20181108
With Matthew.
Aganda
With Matthew.
susieR
¶Gao
Matthew
Gao
Matthew
Gao (also this weekend)
Matthew (hopefully next week)
Hopefully we can discuss and write (together?)
Hopefully we have some drafts of above to discuss.
coef()
for coefficients. Need something for association testing?estimate_prior_variance
estimate_prior_variance
seems to have less power in our numerical study, compared to fixing it to 0.1.Next move:
In person discussion with Matthew mostly on real data application results
susie-paper
website to process DSC results and show people how each figure / table is reproduced.In person discussion with Matthew on manuscript and next steps
July 2 - July 12
July 13 - July 25
July 25 - Aug 7
Aug 7 - Aug 17
Aug 17
Hope to finish up susie paper Sept 15, and the first draft of M&M paper with applications by the end of 2018. That is, we'll have 3 months on multivariate data analysis part.
Slack discussion with Matthew on manuscript outline.
estimate_residual_variance
set to TRUE
.Note: all figures will be updated to reflect a fix in my simulation code, and to incorporate twice more replicates in the new simulation run
Show Bayesian CS followed by a "purity" filter is good enough in the context of genetic fine-mapping.
get_susie_CS
function rather than external Python script. This is to ensure others can reproduce exactly what we offer for the manuscript.Show SNP level PIP and ROC
Comparison of susie CS with DAP cluster
Compare speed of susieR
vs dap-g
Meeting with Matthew.
random.normal(mean, sd)
issue ...Meeting with Matthew.
Meeting with Matthew
To convey the core idea of our new fine-mapping approach we'd like to report these quantities:
SNP level: Identify for each of the $L$ effects 95% HPD interval, report its size (number of SNPs in the set), purity (smallest pair-wise LD means higher purity) and lfsr (or, minimum lfsr). There should be high correlation between small size, high purity and low lfsr. We can visualize this, and determine a threshold of $L_0$ to report.
Finemapping results: Once we have determined $L_0$ above, we can imaging having a browser of these sets where we report for SNPs posterior probability of being an eQTL by plotting those HPD identified above. The posterior probability is just posterior of $\alpha$. In normal cases these sets should not overlap.
Effect estimate: For tissue level summary, we can click on each of the sets plotted and display a metaplot for averaged effect size and standard error bars.
Meeting with Matthew
GTEx related
M&M related:
Meeting with Matthew and Abhishek. Mostly we discussed big pictures of motivation and intuition of this proposed VB algorithm / parameterization based on the spike-slab model, and action lists for the next steps.
We discussed comparison with MCMC based methods. One big selling point with the VB method is the novelty in defining and interpreting the term "fine-mapping". We believe our definition is more reasonable, and our method will natually lead to easily interpreable results. We envisage that by re-defining fine-mapping our way, we avoid situations that conventional methods may struggle with when doing fine-mapping the way they define it.
We also discussed intuition behind the proposed parameterization and the VB algorithm that results in the particular structure of posterior distribution that we can exploit, i.e., posteriors at $L$ blocks. In this case, the model (parameterization of prior) was in fact motivated by the form of the algorithm deviced to result in the posterior structure.
There are a few things in single-tissue application that we will consider next:
We believe these are good enough (even without 3) as first pass to figuring out the foundamentals of this approach, and readily apply to eQTL mapping. Other actions are needed to extand to other contexts (GWAS, variationQTL, etc) that we can worry about next.
We have not discussed multivariate applications in this meeting. But we agree that multivariate application can be done in parallel with single tissue analysis, and we focus on harvesting low-hanging fruits in GTEx data as first pass. The proposed method can utilize existing MASH computations. I also have implemented a version (in Python) following Matthew's outline; hopefully I'll get it to work later this week.
Meeting with Matthew
Discussions on this derivation
single_snp
is not sparse?Meeting with Matthew
Showed results of analysis on 3 tissues. We have simulation and some real data results so far on mr-ash. But real data results has "stability" or "inconsistency" issues.
We decide to move from mr-ash for now. Eventually it can be used (in comparison with other methods) as a tool for prediction. The real data example is precisely the reason why it is not good for fine-mapping (I think we wanted to do fine-mapping differently anyways by adding an additional MCMC step).
Comments on current model implementation
Tips on debugging current model
finemap-vb
model instead as the basis
I'm tempted to go over Matthew's newVB vignette, try to write up the basic model in a separate prototying Jupyter notebook and move my code for the model I have to this new framework. As discussed I'll keep track of and integrate to it progress on Abhishek and Peter's ends, and setup the GTEx data analysis infrastructure like we did for mash and mr-ash.
With Peter (mostly) and Matthew (briefly). We discussed various stuff on current modeling steps 1 and 2, making sure we are on the same page.
We then discussed a simple MH sampling scheme. The key is, as suggested by Matthew and formalized by Peter, to sample 2 SNPs jointly at each move.
Gao will work out the algorithm details and a draft implementation.
With Matthew and Wei on simulation and some real data results. We mostly finished discussing this notebook. We've mostly focused on the situation when there is very dense true signal yet mr-ash
cannot recover any of them.
Some interesting discussions are:
mr-ash
, or for prediction of response, we should scale it by a factor of $c$ such that $||Y-c\hat{\beta}||^2$ is minimized.ash
(scaled), and look at the variational lower bound to see if it improves. Another initialization would be result from ridge regression -- none of the effects will be zero.mr-ash
failed to recover any signal when there is no correlation between columns of $X$. Next steps:
mr-ash
. Check out BoltLMM
paper from Price Lab.mr-ash
and ash
results and see how well it worksWith Matthew on simulation results
Diagnostics:
Comparison with other methods:
With Matthew on issues with GTEx V7 preprocessing.
Questions:
Feedback:
With Matthew and Wei. We discussed mostly the fine-mapping step of m&m, and some mr-ash related issues. First and foremost, we make it clear that focus of m&m should be constrained to eQTL fine mapping for now because it is an important problem that has not been answered in the multivariate framework we propose.
Although we are interested in fine mapping eventually, it is perhaps too premature to make very concrete plan until we see the data analysis outcome from Step 2 the mash step (this comment was made in response to my initial request to finalize the fine mapping MCMC algorithm, and instead we brainstormed on what can be possibly done to perform fine mapping).
Sparse vs non-sparse result:
What should be the output of fine mapping?
What can we learn directly from Step 2 the mash step?
It is also perhaps too early to make any meaningful envision on what to do with finemapping, before looking into the data:
We want to start the mr-ash paper by saying that there is great interest in introducing sparsity in regression, and recently there is a method called ash that introduces it in a "smart" way, and how it is relatively straightforward to use the ash idea in the context of regression.
The strength of mr-ash is the computational efficiency (VEM) and the flexibility (ASH compared to spike-slab), but there are disadvantages (PIP is too concentrated, see Carbonatto 2012).
In data application we can show how different the distribution of effect size of eQTL is, across genes. We can focus on a single tissue, fit separate mr-ash for each gene, and comment on interesting patterns that emerges; or we can use meta-analysis for multiple tissues on genes of interest, if there is not enough power from single tissue analysis.
Meeting with Matthew. We went through the m&m procedure on overleaf, revisited issue 8 on github and talked about next steps on data analysis + simulations. The discussion has led to minor changes in the overleaf write up.
Additionally we decide that the next step should be getting Step 1 done, ie, mr-ash on GTEx data. We will start with analyzing GTEx V6 and verify with mash result, then move on to V7 data. Step 1 would be an interesting application by itself as it is some form of univariate fine-mapping.
Meeting with Matthew and Wei, to revive the project, by looking at what we have and what to be done.
Implementation-wise, shall we not write any code until we finalize on how the generalized framework is formulated? We should think "modularly" and we make contributions directly to other modules whenever possible, then build m&m ash with these modules.
mrash
as a standalone work?¶If we start from full data then finalizing mrash
is a natural first step. It is then just a discussion of whether to create a separate package or to make it part of varbvs.
The meeting has outlined the approach we take towards a modularized m&m. Most items on the agenda has been covered. See this document for details.
Meeting with Matthew. We started from recap on the motivation of project, then discussed the M&M ASH model with practical considerations.
M&M ASH model is motivated by what we have noticed in the MASH project. We have observed effect of a SNP (eQTL) positive in one tissue yet negative in another tissue. This bothers us. We suspect this type of observation is most likely due to negative LD between two causal SNPs both having positive effect in two separate tissues yet if we make the one eQTL per gene assumption as made in MASH we will observe opposite effects. So if we assume SNPs are independent in association analysis we obtain $\hat{\beta}$ convoluted by LD of all SNPs.
Let's consider univariate association analysis for a moment. Because of LD, $g(.)$, the distribution of $\beta$ we estimate via univariate methods, would have long tails. In other words $g(.)$ is inflated by LD with other SNPs. Estimates of $g(.)$ from multiple regression with ASH prior via variantional EM (currently called MVASH) will not have this problem. However when we want to make inference on $\beta$ the effect size, there will be identifiability issue with MVASH because VEM can reach local optima and the effect size it reports for the SNP identified may not be the SNP that in fact has an effect. The solution to this problem is to use MCMC for fine mapping on selected regions via VEM. A hybrid approach is to estimate hyper-parameters via VEM and use MCMC to sample the posterior.
Now to solve the same issue in the context of multivariate regression, we propose the M&M ASH model, which applies multiple regression using ASH prior on multiple responses. David Gerard has derived a VEM procedure for the M&M model. Assumptions in David's derivations are:
Matthew suggests we make this model simpler and make sure it works. For starters we should ignore correlation among tissues. That is, we assume residual variance a diagonal matrix. Here are a few points why we should start with diagonal and why at least as a first pass we should not make non-diagonal assumption in M&M ASH:
We should start with the simplest version (that residual covariance is diagonal) and make it work. The hard part is computation. Using summary data whenever possible may help with computation. Additionally in updating mixture components we can use noiser estimates, that is, estimates from randomly sampled \beta{hat} instead of 20K genes 1000 SNPs 50 conditions data points. We will have our next meeting (David and Gao with Matthew) after we get this simple version to work in practice.
Status
m&m ash
model and implementation.Next steps