Analysis

Notebooks

Abstract
   I test three different ways to calculate an inverse to see performance gains.

Summary
   We suspect that for most esitmates of the mixing proportions, we will

2021/05

20210503 ukb pipeline
   UKB Bloodcells Multivariate fine-mapping

2021/03

mvSuSiE benchmark results summary
   Commands generated the results in this notebook can be found here.

2021/02

The updated mvSuSiE benchmark
   During the past few months we have implemented a few fixes with input from Yuxin who performed mvSuSiE analysis in GWAS context and ironed out some corner cases. Also progress from udr package offers us better estimate for mixture prior. We now rerun all the benchmark previously developed and look at updated results.

2020/12

20201221 ukb ED prior
   UKB Blood Cells Prepare Data

2020/09

Multivariate fine-mapping with missing data examples
   This notebook applies mvSuSiE on some GTEx genes using two approaches to handle missing data.

2020/06

Summarizing EB based mvSuSiE
   I have used several other notebooks for simulating and learning priors via EB approach. This notebook puts togather what we have.

2020/05

20200530 mthess Benchmark
   Benchmark with mthess and atlas on a small scale simulation

20200520 MNM Benchmark
   Benchmark using non-trivial mixture simulations

Multivariate EBNM based prior for M&M
   Here for the simulation benchmark we prepare mixture prior based on a mulrivariate Emperical Bayes Normal Mean model (previously we use Extreme Deconvolution for the task).

2020/04

M&M benchmark XIII
   This is benchmark using a non-trivial simulation scheme, and analyzing it using our current implementation of EM updates to prior scalar. See this notebook for details how it is executed.

20200427 MNM Benchmark
   Benchmark using non-trivial mixture simulations

Create prior mixture for simulation studies
   This notebook contains scripts to create mixture prior for use with simulations in DSC.

2020/02

Comparing M&M with atlas-qtl
   See here for some background information. Since atlas-qtl does not give credible sets we compare here the marginal PIP in terms of calibration and precision-recall curves.

2020/01

Comparing atlasqtl with M&M
   In addition to MTHESS when we first conceived the project in 2016, two other papers from the same group of authors have been published with software implementation locus (paper) and building on top of it, an efficient approach called atlasqtl (paper). atlasqtl approach is designed specifically for detecting pleiotropic patterns (which the authors refer to as "hotspots"). Here we challenge ourselves with a simulated example from altlasqtl documentation.

2019/12

M&M benchmark XII
   This benchmark is an improvments over the previous one with mostly the same setup but hopefully previously observed issues are fixed.

Diagnosis of possible issues with EE model
   From this benchmark I see some runs with EE model has inflated FDR with no missing data. Here are some digging into it.

2019/11

Diagnosis of problems revealed in benchmarks
   This notebook performs some diagnosis for problems identified from this benchmark.

M&M benchmark XI
   This benchmark is an improvments over the previous one, in the following espects.

M&M benchmark VIII
   This benchmark uses the latest GTEx V8 genotype data and evaluated the pipeline in the presence of missing data.

2019/07

20190701 EE Problem
   Continued exploration of potential problem with EE model

2019/06

Linkage vs pleiotropy: the Two-SNP example with missing data
   This is continuation of notebook two_snps_dispute.ipynb.

Linkage vs pleiotropy: the Two-SNP example
   Here I pick up a particular data-set and make a specific simulation case of linkage vs pleiotropy.

Filtering single effects by single effect logBF
   Implemented as option simple to "estimate" prior variance, I simply set the effect zero if all logBF are smaller than zero. It helps removing false discoveries as shown in this notebook.

Scaling the prior matrices from empirical Bayes multivariate analysis
   Matrices provided from fitting empirical Bayes normal means problem (EBNM) using exchangable standard effects model ($\beta/s | s \sim g(\cdot)$) must be scaled by residual variance and sample size in msSuSiE analysis, when the variables matrix $X$ is standardized.

Pre-computing various second-moment related quantities
   This saves computation for M&M by precomputing and re-using quantitaties shared between iterations. It mostly saves $O(R^3)$ computations. This vignette shows results agree with the original version. Cannot use unit test due to numerical discrepency between chol of amardillo and R -- this has been shown problematic for some computations. I'll have to improve mashr for it.

Visualization of M&M output
   A prototype to plot various M&M output information.

OpenMP benchmark for Rcpp based codes
   Here I test if OpenMP helps with some of the computations.

Improving calculation of multivariate multiple regression
   A prototype for multivariate multiple regression computation assuming fixed residual covariance matrix.

Multivariate regression simple prior with estimated scalar
   Here I use multivariate prior (not mash mixture) for mvsusie call, but allow for the scalar of prior to be estimated. That is, I have implemeted estimate_prior_variance for this class.

Single effect model sanity check
   Check for agreement of different regression methods in VEM updates.

A non-trivial toy example for multivariate regression with mvsusieR
   Here I show with a toy simulation of 600 samples, 3000 variables phenotypes and 50 conditions of interest, just to see how the method works in terms of computational speed,

Running "degenerated" MASH computation
   This is prototype to a unit test to verify implementation of mash computation is correct, by comparing it to univariate case when $Y$ has one column and prior covariance matrices is fixed.

Comparing MASH analysis with simple multivariate analysis
   Previously we showed that even though univariate analysis with degenerated MASH model gives identical results to SuSiE as expected (with non-decreasing ELBO), in multivariate calculations the ELBO is not always non-decreasing. To investigate the issue we will 1) further simplify the problem and 2) isolate the problem to posterior calculation versus ELBO calculations and check which part is problematic. The best way to achieve both is to implement a simple Bayesian multivariate regression model with prior $b \sim MVN(0, U)$ where $U$ is known, instead of using MASH prior for $b$.

Comparing marginal statistics with Linear regression
   For prototyping get_sumstats() function in data object.

Investigating behavior of lfsr for condition specific CS
   Here I investigate behavior of lfsr of CS per condition, using singleton simulation setting.

EM estimate vs direct optimization issue
   Here I compare EM with optim in a larger simulated data-set involving 50 conditions. Example below is when EM and optim results are different and optim result seems better.

A smaller example of prior variance scalar estimate
   This is continuation of previous notebook but using smaller example and more explicit code to show the problem. The data can be downloaded here.

EM update for prior variance when prior cannot be directly inverted
   The EM approach previously shown for estimating prior variance scalar has a problem: its current form does not work when input prior matrices has non-invertable component. I used generalized inverse in the code but it does not work.

Decreasing ELBO issue
   Possibly caused by comparing and setting prior variances to zero when estimating them with EM.

ELBO implementation and comparisons
   I have implemented ELBO for M&M model based on write up in this document. See Section 8 for derivation details; also Section B for an independent re-derivation from Yuxin Zou in checking my work.

Comparing multivariate MASH analysis using diagonal priors with univariate computations
   This is to verify that the mvsusieR implementation is correct for the truthly multivariate computations. Previously I have only compared it for the degenerated case where the prior matrix is 1 by 1.

20190627 EE model problem
   Problem with EE model

20190627 EE model no problem
   No problem with EE model

20190620 benchmark evaluation poly
   Pipelines to evaluate fine-mapping benchmarks

M&M on GTEx data pipeline
   Pipelines to run M&M analysis. Each gene is already saved on RDS format, see analysis/20180515_Extract_Benchmark_Data.ipynb.

Format MASH weights for M&M
   Previously I've analyzed GTEx V8 data with MASH. Here I'll format it for use with V7 data that we have extracted genotypes for.

2019/04

A summary of genotype sample LD in GTEx data
   This is partially in response to the reviewer response for SuSiE paper.

2019/03

Further investigation of mismatched analysis with identity prior
   I observe unexpected FDR inflation for analyzing singleton simulations using identity priors. Here I take a closer look at the problem.

Simulation results summary
   Summary of some results from simulation studies for a group meeting demonstration.

Numerical comparison plots
   Figure to summarize numerical comparison results. See this notebook for its input data.

2019/02

Comparing M&M with MT-HESS
   MT-HESS does not give credible sets. Herein we compare the marginal PIP in terms of calibration and precision-recall curves.

Analysis with R2HESS
   For comparison here I prototype multivariate fine-mapping with one of the few other software out there, R2HESS.

M&M ASH benchmark VI
   This is a continuation of Part V where I set total PVE is set to 0.1 and assume 1 or 2 causal variables per region. I added in evaluation of lfsr per condition.

M&M ASH benchmark V
   This is a continuation of Part V where I set total PVE is set to 0.15 and assume 2 causal variables per region. But here, the two SNPs have the same effects sampled from the multivariate distribution. Also I use $R = 5$ conditions and run it on $J=1000$ and 150 genes.

M&M ASH benchmark V
   This is a continuation of Part V where I set total PVE is set to 0.15 and assume 2 causal variables per region. But here, the two SNPs have the same effects sampled from the multivariate distribution. Also I use $R = 5$ conditions and run it on $J=1000$ and 150 genes.

2019/01

M&M ASH benchmark Part IV
   This is a continuation of Part III where instead of looking at 1 causal SNP of PVE = 0.05 I look at a range of causal SNPs per gene with 50% having 1 causal, 30% two causal and 20% three causal. The total PVE is set to 0.15.

M&M ASH benchmark Part III
   This is a continuation of Part II where instead of looking at 1 causal SNP of PVE = 0.05 I look at 2 causal SNPs with total PVE set to 0.15.

M&M ASH benchmark Part II
   This is a continuation of Part I where I use only $R=2$ conditions, 1 causal SNP of PVE = 0.05, with simple singleton, identity and fully shared patterns. The goal is to ensure all computations are correct.

M&M ASH benchmark Part I
   Moving on to multivariate analysis we start with some performance benchmarks.

2018/10

Re-evaluating the use of null weight in real data examples
   Here with susieR version 0.4 I examine an example identified from simulations where SuSiE seems to have made mistakes. The data-set can be downloaded here.

SuSiE paper results based on estimated prior
   This notebook displays another version of this page -- in the SuSiE paper analysis we focused on fixed scaled prior variance (PVE = 0.1). Here we show the results of using estimated prior variance.

2018/09

Adding null component to SuSiE
   Here we evaluate the possible benefit adding a null component to SuSiE. The hope is that the CS will be easier to prune (without using purity) and that the pruned CS can achieve smaller FDR.

Investigating sQTL analysis results of interest: a deeper look
   From this notebook we've got some examples, particularly for the 4 QTL case where the smallest p-value has small PIP, and it seems to be driven by adjunct signal clusters. We want to see if SuSiE's results makes more sense, specifically if it falls in splice sites.

20180911 BF Exploration MS
   knitr::opts_chunk$set(echo = TRUE)

Investigating sQTL analysis results of interest: an overview
   We've got 5 introns having 3 QTLs, 3 having 4 QTLs and 2 having 5 QTLs. This notebook shows SuSiE plots for these cases as an overview and a search for interesting patterns to look into.

2018/08

20180829 LD Heatmap
   Make LD heatmap for demonstration data-sets

2018/07

20180712 Enrichment Workflow
   Enrichment analysis workflow for molecular QTL results

20180711 A Hard Case
   A hard case fine-mapping example

A detailed look at some of the SuSiE fits
   In comparison with CAVIAR follow-ups; workflow implemented in this notebook.

20180704 MolecularQTL Workflow
   Molecular QTL workflow

2018/06

20180630 Runtime
   Comparing computational efficiency of methods

20180620 Purity Plot Lite
   SuSiE Purity Plot

20180615 Power DAP
   Workflow to extract info for power comparison with DAP for a hard case

ROC comparisons
   Or rather, precision-recall curve.

20180606 Identify Interesting Dataset
   Identify and extract interesting data-set for vignettes

20180606 Coverage Check
   Check susie coverage

20180605 PIP Calibrated
   Calibration of SNP level PIP

Looking into some DAP-G outliers
   Here I look into cases where DAP reports PIP = 1 for SNPs that are not causal in simulation. Data used, along with DAP input temp files, can be downloaded here.

Looking into the 3 CAVIAR outliers
   Here we want to understand the examples that CAVIAR -c 1 and susie L 1 do not agree for n = 1 causal variable.

2018/05

20180531 PIP L1 Comparison
   Compare PIP of L1 susie, DAP and CAVIAR

Power comparison susie vs DAP
   Here we compare power of susie and DAP under different number of simulation signals for fixed PVE.

20180527 PIP Workflow
   Workflow to extract PIP and set information for different methods

20180527 PIP Comparison
   Direct comparison of PIP for SuSiE, DAP, CAVIAR and FINEMAP

CS outlier scenarios
   This notebook explores scenarios when CS tend not contain causal signals (false positives).

Purity result summary
   Result of this notebook has been uploaded here.

20180516 Purity Plot
   SusieR benchmark plot

SusieR benchmark
   A first (comprehensive) set of simulations to learn properties of the new fine-mapping method.

20180515 Extract Benchmark Data
   Extract per gene dataset

20180508 ELBO
   M&M ELBO

2018/04

Fine mapping on FMO2 data in GTEx
   This is an update and combined version of m&m and susie on multi-tissue fine mapping.

Univariate analysis on FMO2 data in GTEx
   To compare with m&m results I analyze the same data-set using varbvs (varbvsnorm) and susieR.

M&M analysis on FMO2 data in GTEx
   This vignette shows multi-tissue fine mapping using the "Sum of Single Effects" (SuSIE) approach.

Breaking M&M prototyping to using DSC
   This is an attempt to use DSC in a novel way.

2017/12

20171212 Median TPM GTEx V8
   Processing GTEx V8 expression data for various summary statistics

Simulation of multiple phenotypes given genotypes and covariance
   Here we simulate effect size from mixture gaussian.

20171207 Tryout Parallel
   Try out paralleled numpy computations

20171207 MNMASH Model
   M&M model VEM updates

Computing vectorized OLS
   Implementing univariate, simple OLS for multiple Y and multiple X without using loop. But uses Einstein summation in Python.

2017/11

Toy M&M analysis on Thyroid and Lung
   This analysis uses some early prototypes of M&M.

20171129 MNMASH Model
   M&M model VEM updates

Prototype of core update in M&M ASH model
   This is the core update for VEM step of M&M ASH model, version 2, single SNP calculation under MASH model.

Prototype of VEM in M&M ASH model
   This is the VEM step of M&M ASH model.

Prepare toy example data-set for M&M ASH model
   Here I take a gene from 2 tissues, with covariates.

2017/10

Summary of single tissue analysis results
   From > 25K RDS files

20171030 BIMBAM Plots
   BIMBAM plots using Matthew's code 2007

BIMBAM analysis with selected set of FMO2 SNPs for Thyroid
   We pick a few hundred SNPs that harbors eQTLs (3 or 4 eQTLs) and analyze with BIMBAM. Here we focus on regions between 171.12Mb to 171.20Mb on chr1, based on what we've previously learned.

20171027 BVS MR ASH
   Analyzing a toy example with varbvs and mr-ash

GTEx V8 genotype data imputation
   Revised from old notebook 20170518_Imputation.ipynb.

Extract per-gene per tissue data
   For fine mapping demos.

QTL data preprocessing for Yuxin Zou
   This is not GTEx related. This workflow converts data from an immune response eQTL study in primary human monocytes to mash format.

MASH results on V6 data
   Basically same setup as V6 paper.

MASH results on V8 data
   Instead of setting $\hat{V} = cor(Z_{null})$ we set $\hat{V} = I$

MASH results on V6 data
   With $X^TX$ in prior and instead of setting $\hat{V} = cor(Z_{null})$ we set $\hat{V} = I$.

MASH results on V6 data
   Without $X^TX$ in prior and instead of setting $\hat{V} = cor(Z_{null})$ we set $\hat{V} = I$.

MASH results on V8 data
   Basically same parameter setting as MASH paper.

MASH results on V8 data
   Without $X^TX$ in prior.

MR ASH on single tissue data
   cis-eQTL analysis using mr-ash, for GTEx V8.

MASH analysis for GTEx V8 data
   This is the new mashr version of analysis.

2017/09

GTEx V8 genotypes
   Save V8 genotype data to HDF5 for association analysis.

GTEx V8 expression and covariate data
   Save V8 expression and covariates to HDF5 format, for use with association analysis.

Converting tissue specific eQTL summary statistics to HDF5
   Here I convert GTEx summary statistics to HDF5 format, making it easier to query and share the results.

2017/08

Compare MASH results before and after without SNPs in LD
   We performed MASH on both original list of SNPs and SNPs filtered by LD > 0.2. Here we compare the results.

MASH analysis for Urbut 2017
   Reproducing (using old mashr codes) the Urbut 2017 paper in response to reviewer requests.

Preparing input data for MASH analysis
   This include input max Z score from univariate analysis and training data, for both before and after LD pruning.

20170828 PEER Not Orthogonal
   PEER analysis result not orthogonal

Compute pairwise LD for selected SNPs
   Though it is straightfoward enough to do it in R / Python, I use PLINK to compute the LD matrix.

MR-ASH analysis on Lung
   cis-eQTL analysis on GTEx V7 Lung data via mr-ash.

MR ASH on GTEx genes
   cis-eQTL analysis using mr-ash.

2017/06

20170630 Simulation Study
   mr-ash simulation ash paper scenarios

20170628 MR ASH Toy Example
   mr-ash example analysis

Extract dataset for given genes
   A procedure useful to create toy data from data bundle for methods development and closer look at real data of interest.

Simulation of quantitative phenotype given genotypes
   Here we simulate effect size from mixture gaussian distribution and match strong effects with "heavily LD convoluted" SNPs.

20170615 MASHR Benchmark
   mashr R vs. C++ benchmark

Association mapping covariates
   Merge covariates info from multiple sources and find orthonormal basis for covariate matrix.

2017/05

Extract cis-SNP
   We annotate genotype data by gene positions and extract cis-SNP. Data are extracted from PLINK files and saved to HDF5 format.

Imputation data post-processing and PCA analysis
   Process VCF files from Michigan imputation server by removing imputed sites and fix variant IDs. We also perform MDS analysis to obtain covariates for association analysis.

GTEx V7 genotype data imputation
   The official release currently does not impute missing data. We use Michigan Imputation Server for the task.

RNA-Seq data preprocessing
   This workflow includes data normalization and PEER factor analysis.

2017/04

Download data from dbGaP
   A brief documentation to how dbGaP website is accessed & how to download data.

Pipelines


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago