Abstract
I test three different ways to calculate an inverse to see performance gains.
Summary
We suspect that for most esitmates of the mixing proportions, we will
20210503 ukb pipeline
UKB Bloodcells Multivariate fine-mapping
mvSuSiE benchmark results summary
Commands generated the results in this notebook can be found here.
Multivariate EBNM based prior for M&M
Running the mixture prior pipeline with this notebook.
The updated mvSuSiE benchmark
During the past few months we have implemented a few fixes with input from Yuxin who performed mvSuSiE analysis in GWAS context and ironed out some corner cases. Also progress from udr
package offers us better estimate for mixture prior. We now rerun all the benchmark previously developed and look at updated results.
20201221 ukb ED prior
UKB Blood Cells Prepare Data
Multivariate fine-mapping with missing data examples
This notebook applies mvSuSiE on some GTEx genes using two approaches to handle missing data.
Summarizing EB based mvSuSiE
I have used several other notebooks for simulating and learning priors via EB approach. This notebook puts togather what we have.
20200530 mthess Benchmark
Benchmark with mthess
and atlas
on a small scale simulation
20200520 MNM Benchmark
Benchmark using non-trivial mixture simulations
Multivariate EBNM based prior for M&M
Here for the simulation benchmark we prepare mixture prior based on a mulrivariate Emperical Bayes Normal Mean model (previously we use Extreme Deconvolution for the task).
M&M benchmark XIII
This is benchmark using a non-trivial simulation scheme, and analyzing it using our current implementation of EM updates to prior scalar. See this notebook for details how it is executed.
20200427 MNM Benchmark
Benchmark using non-trivial mixture simulations
Create prior mixture for simulation studies
This notebook contains scripts to create mixture prior for use with simulations in DSC.
Comparing M&M with atlas-qtl
See here for some background information. Since atlas-qtl
does not give credible sets we compare here the marginal PIP in terms of calibration and precision-recall curves.
Comparing atlasqtl with M&M
In addition to MTHESS when we first conceived the project in 2016, two other papers from the same group of authors have been published with software implementation locus (paper) and building on top of it, an efficient approach called atlasqtl (paper). atlasqtl
approach is designed specifically for detecting pleiotropic patterns (which the authors refer to as "hotspots"). Here we challenge ourselves with a simulated example from altlasqtl
documentation.
M&M benchmark XII
This benchmark is an improvments over the previous one with mostly the same setup but hopefully previously observed issues are fixed.
Investigating MASH logBF computation under EE model
This is a continuation of a previous investigation.
Diagnosis of possible issues with EE model
From this benchmark I see some runs with EE model has inflated FDR with no missing data. Here are some digging into it.
Diagnosis of problems revealed in benchmarks
This notebook performs some diagnosis for problems identified from this benchmark.
M&M benchmark XI
This benchmark is an improvments over the previous one, in the following espects.
M&M benchmark VIII
This benchmark uses the latest GTEx V8 genotype data and evaluated the pipeline in the presence of missing data.
20190701 EE Problem
Continued exploration of potential problem with EE model
Linkage vs pleiotropy: the Two-SNP example with missing data
This is continuation of notebook two_snps_dispute.ipynb
.
Linkage vs pleiotropy: the Two-SNP example
Here I pick up a particular data-set and make a specific simulation case of linkage vs pleiotropy.
Filtering single effects by single effect logBF
Implemented as option simple
to "estimate" prior variance, I simply set the effect zero if all logBF are smaller than zero. It helps removing false discoveries as shown in this notebook.
Scaling the prior matrices from empirical Bayes multivariate analysis
Matrices provided from fitting empirical Bayes normal means problem (EBNM) using exchangable standard effects model ($\beta/s | s \sim g(\cdot)$) must be scaled by residual variance and sample size in msSuSiE analysis, when the variables matrix $X$ is standardized.
Pre-computing various second-moment related quantities
This saves computation for M&M by precomputing and re-using quantitaties shared between iterations. It mostly saves $O(R^3)$ computations. This vignette shows results agree with the original version. Cannot use unit test due to numerical discrepency between chol
of amardillo and R -- this has been shown problematic for some computations. I'll have to improve mashr
for it.
Visualization of M&M output
A prototype to plot various M&M output information.
OpenMP benchmark for Rcpp based codes
Here I test if OpenMP helps with some of the computations.
Improving calculation of multivariate multiple regression
A prototype for multivariate multiple regression computation assuming fixed residual covariance matrix.
Multivariate regression simple prior with estimated scalar
Here I use multivariate prior (not mash mixture) for mvsusie
call, but allow for the scalar of prior to be estimated. That is, I have implemeted estimate_prior_variance
for this class.
Single effect model sanity check
Check for agreement of different regression methods in VEM updates.
A non-trivial toy example for multivariate regression with mvsusieR
Here I show with a toy simulation of 600 samples, 3000 variables phenotypes and 50 conditions of interest, just to see how the method works in terms of computational speed,
Running "degenerated" MASH computation
This is prototype to a unit test to verify implementation of mash computation is correct, by comparing it to univariate case when $Y$ has one column and prior covariance matrices is fixed.
Comparing MASH analysis with simple multivariate analysis
Previously we showed that even though univariate analysis with degenerated MASH model gives identical results to SuSiE as expected (with non-decreasing ELBO), in multivariate calculations the ELBO is not always non-decreasing. To investigate the issue we will 1) further simplify the problem and 2) isolate the problem to posterior calculation versus ELBO calculations and check which part is problematic. The best way to achieve both is to implement a simple Bayesian multivariate regression model with prior $b \sim MVN(0, U)$ where $U$ is known, instead of using MASH prior for $b$.
Comparing marginal statistics with Linear regression
For prototyping get_sumstats()
function in data object.
Investigating behavior of lfsr for condition specific CS
Here I investigate behavior of lfsr of CS per condition, using singleton simulation setting.
EM estimate vs direct optimization issue
Here I compare EM with optim
in a larger simulated data-set involving 50 conditions. Example below is when EM
and optim
results are different and optim
result seems better.
A smaller example of prior variance scalar estimate
This is continuation of previous notebook but using smaller example and more explicit code to show the problem. The data can be downloaded here.
EM update for prior variance when prior cannot be directly inverted
The EM approach previously shown for estimating prior variance scalar has a problem: its current form does not work when input prior matrices has non-invertable component. I used generalized inverse in the code but it does not work.
Decreasing ELBO issue
Possibly caused by comparing and setting prior variances to zero when estimating them with EM.
ELBO implementation and comparisons
I have implemented ELBO for M&M model based on write up in this document. See Section 8 for derivation details; also Section B for an independent re-derivation from Yuxin Zou in checking my work.
Comparing multivariate MASH analysis using diagonal priors with univariate computations
This is to verify that the mvsusieR implementation is correct for the truthly multivariate computations. Previously I have only compared it for the degenerated case where the prior matrix is 1 by 1.
20190627 EE model problem
Problem with EE model
20190627 EE model no problem
No problem with EE model
20190620 benchmark evaluation poly
Pipelines to evaluate fine-mapping benchmarks
M&M on GTEx data pipeline
Pipelines to run M&M analysis. Each gene is already saved on RDS format, see analysis/20180515_Extract_Benchmark_Data.ipynb
.
Format MASH weights for M&M
Previously I've analyzed GTEx V8 data with MASH. Here I'll format it for use with V7 data that we have extracted genotypes for.
A summary of genotype sample LD in GTEx data
This is partially in response to the reviewer response for SuSiE paper.
Further investigation of mismatched analysis with identity prior
I observe unexpected FDR inflation for analyzing singleton simulations using identity priors. Here I take a closer look at the problem.
Simulation results summary
Summary of some results from simulation studies for a group meeting demonstration.
Numerical comparison plots
Figure to summarize numerical comparison results. See this notebook for its input data.
Comparing M&M with MT-HESS
MT-HESS does not give credible sets. Herein we compare the marginal PIP in terms of calibration and precision-recall curves.
Analysis with R2HESS
For comparison here I prototype multivariate fine-mapping with one of the few other software out there, R2HESS
.
M&M ASH benchmark VI
This is a continuation of Part V where I set total PVE is set to 0.1 and assume 1 or 2 causal variables per region. I added in evaluation of lfsr per condition.
M&M ASH benchmark V
This is a continuation of Part V where I set total PVE is set to 0.15 and assume 2 causal variables per region. But here, the two SNPs have the same effects sampled from the multivariate distribution. Also I use $R = 5$ conditions and run it on $J=1000$ and 150 genes.
M&M ASH benchmark V
This is a continuation of Part V where I set total PVE is set to 0.15 and assume 2 causal variables per region. But here, the two SNPs have the same effects sampled from the multivariate distribution. Also I use $R = 5$ conditions and run it on $J=1000$ and 150 genes.
M&M ASH benchmark Part IV
This is a continuation of Part III where instead of looking at 1 causal SNP of PVE = 0.05 I look at a range of causal SNPs per gene with 50% having 1 causal, 30% two causal and 20% three causal. The total PVE is set to 0.15.
M&M ASH benchmark Part III
This is a continuation of Part II where instead of looking at 1 causal SNP of PVE = 0.05 I look at 2 causal SNPs with total PVE set to 0.15.
M&M ASH benchmark Part II
This is a continuation of Part I where I use only $R=2$ conditions, 1 causal SNP of PVE = 0.05, with simple singleton, identity and fully shared patterns. The goal is to ensure all computations are correct.
M&M ASH benchmark Part I
Moving on to multivariate analysis we start with some performance benchmarks.
Re-evaluating the use of null weight in real data examples
Here with susieR
version 0.4 I examine an example identified from simulations where SuSiE seems to have made mistakes. The data-set can be downloaded here.
SuSiE paper results based on estimated prior
This notebook displays another version of this page -- in the SuSiE paper analysis we focused on fixed scaled prior variance (PVE = 0.1). Here we show the results of using estimated prior variance.
Adding null component to SuSiE
Here we evaluate the possible benefit adding a null component to SuSiE. The hope is that the CS will be easier to prune (without using purity) and that the pruned CS can achieve smaller FDR.
Investigating sQTL analysis results of interest: a deeper look
From this notebook we've got some examples, particularly for the 4 QTL case where the smallest p-value has small PIP, and it seems to be driven by adjunct signal clusters. We want to see if SuSiE's results makes more sense, specifically if it falls in splice sites.
20180911 BF Exploration MS
knitr::opts_chunk$set(echo = TRUE)
Investigating sQTL analysis results of interest: an overview
We've got 5 introns having 3 QTLs, 3 having 4 QTLs and 2 having 5 QTLs. This notebook shows SuSiE plots for these cases as an overview and a search for interesting patterns to look into.
20180829 LD Heatmap
Make LD heatmap for demonstration data-sets
20180712 Enrichment Workflow
Enrichment analysis workflow for molecular QTL results
20180711 A Hard Case
A hard case fine-mapping example
A detailed look at some of the SuSiE fits
In comparison with CAVIAR follow-ups; workflow implemented in this notebook.
20180704 MolecularQTL Workflow
Molecular QTL workflow
20180630 Runtime
Comparing computational efficiency of methods
20180620 Purity Plot Lite
SuSiE Purity Plot
20180615 Power DAP
Workflow to extract info for power comparison with DAP for a hard case
ROC comparisons
Or rather, precision-recall curve.
20180606 Identify Interesting Dataset
Identify and extract interesting data-set for vignettes
20180606 Coverage Check
Check susie coverage
20180605 PIP Calibrated
Calibration of SNP level PIP
Looking into some DAP-G outliers
Here I look into cases where DAP reports PIP = 1 for SNPs that are not causal in simulation. Data used, along with DAP input temp files, can be downloaded here.
Looking into the 3 CAVIAR outliers
Here we want to understand the examples that CAVIAR -c 1
and susie L 1
do not agree for n = 1
causal variable.
20180531 PIP L1 Comparison
Compare PIP of L1 susie, DAP and CAVIAR
Power comparison susie vs DAP
Here we compare power of susie and DAP under different number of simulation signals for fixed PVE.
20180527 PIP Workflow
Workflow to extract PIP and set information for different methods
20180527 PIP Comparison
Direct comparison of PIP for SuSiE, DAP, CAVIAR and FINEMAP
CS outlier scenarios
This notebook explores scenarios when CS tend not contain causal signals (false positives).
Purity result summary
Result of this notebook has been uploaded here.
20180516 Purity Plot
SusieR benchmark plot
SusieR benchmark
A first (comprehensive) set of simulations to learn properties of the new fine-mapping method.
20180515 Extract Benchmark Data
Extract per gene dataset
20180508 ELBO
M&M ELBO
Fine mapping on FMO2 data in GTEx
This is an update and combined version of m&m and susie on multi-tissue fine mapping.
Univariate analysis on FMO2 data in GTEx
To compare with m&m results I analyze the same data-set using varbvs (varbvsnorm
) and susieR.
M&M analysis on FMO2 data in GTEx
This vignette shows multi-tissue fine mapping using the "Sum of Single Effects" (SuSIE) approach.
Breaking M&M prototyping to using DSC
This is an attempt to use DSC in a novel way.
20171212 Median TPM GTEx V8
Processing GTEx V8 expression data for various summary statistics
Simulation of multiple phenotypes given genotypes and covariance
Here we simulate effect size from mixture gaussian.
20171207 Tryout Parallel
Try out paralleled numpy
computations
20171207 MNMASH Model
M&M model VEM updates
Computing vectorized OLS
Implementing univariate, simple OLS for multiple Y and multiple X without using loop. But uses Einstein summation in Python.
Toy M&M analysis on Thyroid and Lung
This analysis uses some early prototypes of M&M.
20171129 MNMASH Model
M&M model VEM updates
Prototype of core update in M&M ASH model
This is the core update for VEM step of M&M ASH model, version 2, single SNP calculation under MASH model.
Prototype of VEM in M&M ASH model
This is the VEM step of M&M ASH model.
Prepare toy example data-set for M&M ASH model
Here I take a gene from 2 tissues, with covariates.
Summary of single tissue analysis results
From > 25K RDS files
20171030 BIMBAM Plots
BIMBAM plots using Matthew's code 2007
BIMBAM analysis with selected set of FMO2 SNPs for Thyroid
We pick a few hundred SNPs that harbors eQTLs (3 or 4 eQTLs) and analyze with BIMBAM. Here we focus on regions between 171.12Mb to 171.20Mb on chr1, based on what we've previously learned.
20171027 BVS MR ASH
Analyzing a toy example with varbvs
and mr-ash
GTEx V8 genotype data imputation
Revised from old notebook 20170518_Imputation.ipynb
.
Extract per-gene per tissue data
For fine mapping demos.
QTL data preprocessing for Yuxin Zou
This is not GTEx related. This workflow converts data from an immune response eQTL study in primary human monocytes to mash format.
MASH results on V6 data
Basically same setup as V6 paper.
MASH results on V8 data
Instead of setting $\hat{V} = cor(Z_{null})$ we set $\hat{V} = I$
MASH results on V6 data
With $X^TX$ in prior and instead of setting $\hat{V} = cor(Z_{null})$ we set $\hat{V} = I$.
MASH results on V6 data
Without $X^TX$ in prior and instead of setting $\hat{V} = cor(Z_{null})$ we set $\hat{V} = I$.
MASH results on V8 data
Basically same parameter setting as MASH paper.
MASH results on V8 data
Without $X^TX$ in prior.
MR ASH on single tissue data
cis-eQTL analysis using mr-ash
, for GTEx V8.
MASH analysis for GTEx V8 data
This is the new mashr
version of analysis.
GTEx V8 genotypes
Save V8 genotype data to HDF5 for association analysis.
GTEx V8 expression and covariate data
Save V8 expression and covariates to HDF5 format, for use with association analysis.
Converting tissue specific eQTL summary statistics to HDF5
Here I convert GTEx summary statistics to HDF5 format, making it easier to query and share the results.
Compare MASH results before and after without SNPs in LD
We performed MASH on both original list of SNPs and SNPs filtered by LD > 0.2. Here we compare the results.
MASH analysis for Urbut 2017
Reproducing (using old mashr
codes) the Urbut 2017 paper in response to reviewer requests.
Preparing input data for MASH analysis
This include input max Z score from univariate analysis and training data, for both before and after LD pruning.
20170828 PEER Not Orthogonal
PEER analysis result not orthogonal
Compute pairwise LD for selected SNPs
Though it is straightfoward enough to do it in R / Python, I use PLINK to compute the LD matrix.
MR-ASH analysis on Lung
cis-eQTL analysis on GTEx V7 Lung data via mr-ash
.
MR ASH on GTEx genes
cis-eQTL analysis using mr-ash
.
20170630 Simulation Study
mr-ash simulation ash paper scenarios
20170628 MR ASH Toy Example
mr-ash example analysis
Extract dataset for given genes
A procedure useful to create toy data from data bundle for methods development and closer look at real data of interest.
Simulation of quantitative phenotype given genotypes
Here we simulate effect size from mixture gaussian distribution and match strong effects with "heavily LD convoluted" SNPs.
20170615 MASHR Benchmark
mashr R vs. C++ benchmark
Association mapping covariates
Merge covariates info from multiple sources and find orthonormal basis for covariate matrix.
Extract cis-SNP
We annotate genotype data by gene positions and extract cis-SNP. Data are extracted from PLINK
files and saved to HDF5 format.
Imputation data post-processing and PCA analysis
Process VCF files from Michigan imputation server by removing imputed sites and fix variant IDs. We also perform MDS analysis to obtain covariates for association analysis.
GTEx V7 genotype data imputation
The official release currently does not impute missing data. We use Michigan Imputation Server for the task.
RNA-Seq data preprocessing
This workflow includes data normalization and PEER factor analysis.
Download data from dbGaP
A brief documentation to how dbGaP website is accessed & how to download data.