Multivariate Bayesian variable selection regression

Start Simple!

Here we summarize a whiteboard discussion lead by @pcarbo along with @NKweiwang and @gaow.

Context and goal

The discussion focused mostly in the context of eQTL mapping amount tissues though potentially m&m ash is a more generic method. Our goal (hypothesis) in this context is to find new patterns of sharing of effects and increase eQTL detection power via analyzing multiple SNPs jointly. In particular we report "counts" compared to single SNP methods, i.e., how many more / less eQTL do we report. Additionally we check if this approach gives us more accurate view of sharing.

Start simple

@pcarbo suggests we start it simple by considering $J = 2$ (two tissues) and $P = 2$ (two SNPs) problem.

2 SNPs

This aims to create a toy example where we can evaluate via simulation or in real data the difference between single-SNP and multi-SNP approach. We will contrast the difference between analyze the 2 SNPs separately vs. analyzing them jointly. This can be done to GTEx data with straightforward linear regression analysis. @gaow is going to invest it soon.

2 tissues

This aims to simulate / solve a situation simple enough that we can leverage to fully investigate properties of the multi-SNP approach in multiple tissues. Currently we are having computational issues with $J > 2$, that the residual variance for response is a $J \times J$ matrix and there can be too many parameters to estimate. @pcarbo points out that if we start with $J = 2$ and instead of using ash we can simply enumerate the model underlying the "ground truth" (giving us a "2D spike-slab" mixture) and we can possibly infer all parameters involved via variational EM. In this setting the residual covariance matrix will have only 3 parameters to estimate at each iteration. This simple model (with $J = 2$ and $P > 2$) and parameters to infer is outlined as follows: Solving this model will not only give us estimate of effects (as ash model does), but also give estimates of weights on mixture components that, unlike ash weights, has clear interpretation.

This model can possibly be solved via:

  • Variational EM
  • MCMC
  • Variational EM + simple MCMC (which we also think of doing for fine mapping)

We may need to be careful about parameterization of this model. For example we may want to re-parameterize the mixture components as follows:

Problems we want to address to with this "simple start"

  • How well does VB work in this setup. Intuitively VB might have a tendency to overestimate sharing.

[to be edited]

Other thoughts

  • I (@gaow) like this simple start approach and would like to pursue. However I think it would be great to use this simple model as a generative model as basis of simulation, yet use ash model (as currently implemented in m&m ash) to perform inference. For this simple case we can solve $\Sigma$ the residual variance updates at each iteration. We can also use this to evaluate diagonal $\Sigma$ approximation. It will also provide ground truth of effect size to compare with m&m ash estimates. My concern with formulating and solving the model as described is that it might still be difficult and computationally intensive, and even if we workout $J = 2$ case it is hard to justify that at $J > 2$ case we can safely switch to using ash approach instead and all our investigation at $J = 2$ will remain held.
  • @NKweiwang points out that effect size estimate with m&m ash may well be as good as solving this model, although m&m ash does not provide mixture proportion estimates.
  • @pcarbo thinks if we can workout $J = 2$ case alone and find good data example, it warrants a paper on a biological journal. We can then publish a statistical paper on $J > 2$ with m&m ash model that makes additional assumptions to deal with computational limitations of the $J = 2$ approach.

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago