Multivariate Bayesian variable selection regression

Meetings

20181108

With Matthew.

Aganda

  • Manuscript finalize: what are missing, what I should write on
  • M&M update:
  • Next moves:
    • SuSiE talk? Nov 28?
    • Unit-test pass the M&M version (via special cases)
    • DSC:
      • No sharing
      • Different patterns of sharing
      • Impact of residual correlation structure? (Need to estimate on ELBO)
      • Impact of effect correlation structure? (Yuxin is taking care of it?)
      • Comparison to other methods -- will take a few that works on small R and make a case
    • Both above are done in full data; in practice we should do it on summary data?

20180913 meeting

With Matthew.

Updates on susieR

  • worked with Kaiqian trying to finalize the sparse branch
    • She'll come in herself in a separate meeting
  • do not want to make additional changes (discussed interface change) until Kaiqian's branch is merged
    • no need to deal with merge conflicts like pro when we do not have to do it!

Updates since last manuscript meeting

Gao

  • Appendix VB section mostly completed
  • Looked at single case examples

Matthew

  • Went over the numerical comparison section with changes and suggestions

Todo now (by the end of this week)

Gao

  • Go over again results section finalizing case examples from sQTL analysis
  • Adopt Matthew's suggestion re-organizing figures for the numerical comparison section

Matthew

  • Another round at the Methods section & Appendix -- can leave some details for Gao to derive & fill up
    • mostly a matter of style, or best practice guideline: what to show and what to hide, how generic we formulate things

Todo after above (before our meeting next week)

Gao (also this weekend)

  • Draft a short discussion section as agreed
  • Fill up remaining details from Methods and Appendix that Matthew pointed out

Matthew (hopefully next week)

  • A look at the results section
  • A look at the discussion section

Todo Wed Sept 19

Hopefully we can discuss and write (together?)

  • The introduction section
  • The trend filtering application possibly as a small applications section.

Todo Fri Sept 21

Hopefully we have some drafts of above to discuss.

20180910 Manuscript discussion

Methods

  • why use $1/\tau$ to parameterize?
  • Should we output BF from SuSiE? We have coef() for coefficients. Need something for association testing?

Other stuff

  1. Data application
    • Focusing on some case example -- but I want proven examples not just overlapping with motif stuff
  2. When should we introduce other methods: intro, motivating example, or numerical comparison? In particular when do we introduce DAP-G's signal cluster.
  3. Which section should we focus on finalizing next. More generally with what sequence should we start finalizing.
    • I suggest Matthew focus on revising / re-writing Methods and Numerical examples
    • Gao focus on adding case examples to Applications
    • Gao will make updated draft to intro and discussion parts after discussing with Matthew today.
  4. Timeline: need to wrap up some errands Tue and Wed. Will finish up everything else by Monday next week.

20180831 Updates

Questions

  • why use $1/\tau$ to parameterize?
  • (2.6) $n$ because $X'X=n$?
  • Possible limitations: violation to additive? eg. joint "interaction" effect.
  • Should we output BF from SuSiE?

20180817 Updates

  1. The CAVIAR inconsistency issue is "fixed" by estimate_prior_variance
  2. But estimate_prior_variance seems to have less power in our numerical study, compared to fixing it to 0.1.
  3. And both above using susieR version 0.2 are less powerful compared to using version 0.1.
  4. Enrichment analysis updated and discussed with Yang.
    • The result section of the manuscript has been updated
  5. A website was created for the manuscript. Every single analysis performed in the manuscript can be reproduced using information from this site.

Next move:

  1. Which version of SuSiE to use? (see 1-3 above)
    • possibly need to rerun splicing QTL analysis (should be very quick)
  2. Additional analysis for the manscript?
  3. Write it up now. Timeline on both our sides?

20180711

In person discussion with Matthew mostly on real data application results

Aganda

  1. Data section outline see overleaf
  2. Some questions before interpreting the results
    • should we start to interpret now, or to solve the prior and other issues first?
  3. The bad case and the good case -- do we discuss?
  4. Real data results -- figures for finemapping, and some enrichment results.

Recap 3 issues on slack

  1. How to specify prior -- there is a difference in real data analysis, though not so much in simulation studies, because I do not have very large PVE simulated anyways.
  2. How to specify L -- as demonstrated in one of my other notebook with CAVIAR, it is perhaps not safe to set a large L
  3. Issue 21, same signal captured by 2 CS.

TODO for manuscript

  1. [after DSC rerun] [not urgent] Figure 2: find a new way to plot such that the PRC is nicer. Possibly the grenander estimator.
  2. [after DSC rerun] Make Figure 3 shorter and use same bin-width
  3. [July 17] Figure 4 goes to supplemental material
  4. [after DSC rerun] Table 1 make a plot of it. Bar plot of 4 panels. The table itself goes to supplemental
  5. [July 18] Version number FINEMAP (1.1), CAVIAR and DAP-G
  6. [July 19] Rewrite complexity analysis. Only do it for susie. Do not review that for other methods.
    • susie is either NLP or LP^2
  7. [July 20] Explain in the text DAP intervals -- tries to acheive the same thing as we do, but not as natural
  8. [July 21] Review how other paper define CS? eg, Maller 2012?
  9. [July 22 - 24] Add result for splicing analysis first draft
  10. [July 25 - 29] Add the motivating example first draft

Rerun DSC

  • [July 18] Because of susieR version 0.2 update! But I think this is also a good chance for me to start susie-paper website to process DSC results and show people how each figure / table is reproduced.
  • Add an option to use susieR estimate prior variance!
  • Put together the susie-paper website.

20180702

In person discussion with Matthew on manuscript and next steps

Timeline

July 2 - July 12

  • Data results on molecular QTL. See outline in the overleaf manuscript draft
  • Meeting with Matthew on July 12 to discuss data results, if completed to some digree ...

July 13 - July 25

  • Start working on multivariate again.
  • Finish up ELBO derivation following from FLASH (or better, susie?)
    • Current version ELBO is based on FLASH paper but does not always increase ...
  • Fix a couple of bugs in M&M code
  • Make it based on summary statistics only.
  • Start a new overleaf document to put in these methods details

July 25 - Aug 7

  • Implement a DSC for multivariate simulation and analysis, comparing M&M with susie

Aug 7 - Aug 17

  • Another round of edits to both susie and M&M paper

Aug 17

  • Meeting with Matthew to discuss susie timeline and finish up susie paper
  • Meeting with Matthew to discuss M&M application
    • I'd like to also throw in enrichment based priors
    • Not tissue specific annotations yet, but very simple and well established ones, eg physical position, footprint
    • Maybe look at Yang's data again in addition to GTEx data

Hope to finish up susie paper Sept 15, and the first draft of M&M paper with applications by the end of 2018. That is, we'll have 3 months on multivariate data analysis part.

20180618

Slack discussion with Matthew on manuscript outline.

RE "default susie"

  • Default estimate_residual_variance set to TRUE.
  • By default we scale and center X so that prior becomes PVE.
  • Choice of Prior:
    • Option 1: we show in our numerical studies prior = 0.1 for all scenarios. We show in supplemental that results do not change that much with other values.
    • Option 2: as discussed in this ticket we can potentially estimate it.

Result section

Note: all figures will be updated to reflect a fix in my simulation code, and to incorporate twice more replicates in the new simulation run

  1. Show Bayesian CS followed by a "purity" filter is good enough in the context of genetic fine-mapping.

  2. Show SNP level PIP and ROC

    • Consistency of PIP with other methods including susie, caviar, DAP and FINEMAP (Need to add in FINEMAP V1.1 results)
    • Calibrated PIP in comparison with the truth, also showing DAP, CAVIAR and FINEMAP results with the truth (Need to add in FINEMAP V1.1 results).
    • ROC in comparison with DAP, CAVIAR and FINEMAP (Need to add in FINEMAP V1.1 results; also make Y-axis start at zero).
  3. Comparison of susie CS with DAP cluster

  4. Compare speed of susieR vs dap-g

    • Need to pull and average from DSC results
    • For susieR: we do use ELBO to check for convergence -- should I use line-profiler to get an idea how long the ELBO takes and give an estimate of using $\alpha$ convergence instead?

20180607

Meeting with Matthew.

Aganda

  • Look together sets comparison between DAP and Susie in "Power" and FDP.
  • Discuss data applications.
  • Discuss additional simulation.

TODO

  • [] Double-check DAP's reported LD agree with what I compute, for the 1 causal case in particular
  • [X] Re-plot PIP-based ROC focusing on PIP thresholds of interest (zoom-in on X-axis)
  • [X] Use Median sample size for the power table
  • [X] Create simulation using ~7K variants with 10 causal, show susie method scales but DAP's heuristics may have issues
  • [X] Bring back FINEMAP version < May 9, 2018 to the DSC -- the goal is to reassure that DAP indeed outperforms CAVIAR and FINEMAP
    • make plots of susie vs FINEMAP and DAP vs FINEMAP
  • [X] Rerun all DSC -- fix that "dollar bet" mistake on random.normal(mean, sd) issue ...
  • [] Take that single example from CAVIAR paper (or other single examples) just quickly checkout if anything interesting shows up
  • [X] Talk to Kevin on his molecular QTL data. We decide to do data application outside GTEx eQTL for various reasons.

20180529

Meeting with Matthew.

Aganda

  • Project status: simulation studies mostly done. Starting to analyzing data.
  • Mostly done -- get some more CAVIAR results, double-check DAP. Maybe get an old version of FINEMAP to work? Or just ignore it and move on?

Questions

  • Show why the VB removes those in LD by putting them in one CS. But not eleminating signals -- need some proof, or at least some quantification
    • Input: LD. (maybe also say why 0.2 is good enough?)
    • Output: weights in set 1, weights in set 2.
    • When can two truely independent and significant signals go into one set?
  • This case: it is just what it is?
  • Goal of project: finemapping, heritability estimate, prediction. What if people are interested in the others?

20180417

Meeting with Matthew

Aganda

Need to fix

  • Computation of lfsr. Currently I'm conceptually wrong. The lfsr we report for the new model should be per value per set of SNPs (per condition). "Set of SNPs" is reflected by the $L$ effects in our model. When there is only one SNP in the set of SNPs the lfsr natually becomes lfsr for this SNP
  • We can compute lfdr for our own information but should not report

To catch up in writing

  • How lfsr are computed and interpreted
  • How and why are the 3 quantities of interest defined (maybe use a separate notebook to explain the motivations)
  • New method to computer ELBO

What to report from M&M?

To convey the core idea of our new fine-mapping approach we'd like to report these quantities:

  1. SNP level: Identify for each of the $L$ effects 95% HPD interval, report its size (number of SNPs in the set), purity (smallest pair-wise LD means higher purity) and lfsr (or, minimum lfsr). There should be high correlation between small size, high purity and low lfsr. We can visualize this, and determine a threshold of $L_0$ to report.

  2. Finemapping results: Once we have determined $L_0$ above, we can imaging having a browser of these sets where we report for SNPs posterior probability of being an eQTL by plotting those HPD identified above. The posterior probability is just posterior of $\alpha$. In normal cases these sets should not overlap.

  3. Effect estimate: For tissue level summary, we can click on each of the sets plotted and display a metaplot for averaged effect size and standard error bars.

Next steps

  • Get ELBO work -- deal with low rank matrix computations in the prior and KL part.
  • Make M&M report as discussed above.
  • More comparisons?

20171210

Meeting with Matthew

Agenda

GTEx related

  • New genotype uploaded along with updated results (including eQTL results); will be ready mid-week
    • MASH results: to wait for Zarlab or to produce results as is.
  • Summary of GTEx fine-mapping group meeting
    • Relevance to new-vb (or Susie)
    • Relevance to DSC (results exploration)

M&M related:

  • Algorithm
  • Simulation plan: scenarios and scores
    • Scenarios
      • Is the current theme good enough to include all special cases?
      • What other simulations will be needed?
    • Scores
      • Check CI
      • Check lfsr
  • Computation

20171127

Meeting with Matthew and Abhishek. Mostly we discussed big pictures of motivation and intuition of this proposed VB algorithm / parameterization based on the spike-slab model, and action lists for the next steps.

We discussed comparison with MCMC based methods. One big selling point with the VB method is the novelty in defining and interpreting the term "fine-mapping". We believe our definition is more reasonable, and our method will natually lead to easily interpreable results. We envisage that by re-defining fine-mapping our way, we avoid situations that conventional methods may struggle with when doing fine-mapping the way they define it.

We also discussed intuition behind the proposed parameterization and the VB algorithm that results in the particular structure of posterior distribution that we can exploit, i.e., posteriors at $L$ blocks. In this case, the model (parameterization of prior) was in fact motivated by the form of the algorithm deviced to result in the posterior structure.

There are a few things in single-tissue application that we will consider next:

  1. Estimating (or fix?) hyper-parameters $\sigma$, $\sigma_b$
  2. Incorporate prior $\pi_j$: estimate or plug-in?
  3. Work with summary statistics

We believe these are good enough (even without 3) as first pass to figuring out the foundamentals of this approach, and readily apply to eQTL mapping. Other actions are needed to extand to other contexts (GWAS, variationQTL, etc) that we can worry about next.

We have not discussed multivariate applications in this meeting. But we agree that multivariate application can be done in parallel with single tissue analysis, and we focus on harvesting low-hanging fruits in GTEx data as first pass. The proposed method can utilize existing MASH computations. I also have implemented a version (in Python) following Matthew's outline; hopefully I'll get it to work later this week.

20171120

Meeting with Matthew

Discussions on this derivation

  • Exactly one element is zero -- because sum of $\pi$'s for $p$ SNPs is 1?
  • What to choose for $L$?
  • "Similarly we can optimize F2 over q1 with q2 fixed the same way." -- for multiple SNP case, each time we fix all other $q$'s except for one? does the ordering of $L$ imply "importance"
  • function single_snp is not sparse?

20171115

Meeting with Matthew

mr-ash

Showed results of analysis on 3 tissues. We have simulation and some real data results so far on mr-ash. But real data results has "stability" or "inconsistency" issues.

We decide to move from mr-ash for now. Eventually it can be used (in comparison with other methods) as a tool for prediction. The real data example is precisely the reason why it is not good for fine-mapping (I think we wanted to do fine-mapping differently anyways by adding an additional MCMC step).

m&m model

Comments on current model implementation

  • Change notation: $t$ to $p$, $V$ to $\Sigma$, to match with MASH notation
  • Notation: $\xi_j$ vs $\hat{b}_j$. The latter is both more clear and confusing in its own way
  • We remove prior on $\Lambda$. Just say it's MLE
  • Missing values: do not attempt to impute in the updates. Just ignore them in computing updates
    • by properly setting elements in matrices & vectors to zeros
  • Numeric issue: use difference of log of multivariate normal densities, rather than ratio of densities.

Tips on debugging current model

  • Use one tissue special case to debug, compare with mr-ash
  • Maybe give up this model (which is a multivariate version of mr-ash) and use the finemap-vb model instead as the basis
    • Check out this outline and generalize to multivariate case
    • Also checkout Abhishek's this and that.

Next move

I'm tempted to go over Matthew's newVB vignette, try to write up the basic model in a separate prototying Jupyter notebook and move my code for the model I have to this new framework. As discussed I'll keep track of and integrate to it progress on Abhishek and Peter's ends, and setup the GTEx data analysis infrastructure like we did for mash and mr-ash.

20171005 m&m fine mapping meeting

With Peter (mostly) and Matthew (briefly). We discussed various stuff on current modeling steps 1 and 2, making sure we are on the same page.

We then discussed a simple MH sampling scheme. The key is, as suggested by Matthew and formalized by Peter, to sample 2 SNPs jointly at each move.

Gao will work out the algorithm details and a draft implementation.

Project meeting 20170630

With Matthew and Wei on simulation and some real data results. We mostly finished discussing this notebook. We've mostly focused on the situation when there is very dense true signal yet mr-ash cannot recover any of them.

Some interesting discussions are:

  1. Why do we see ASH over-shrink the effect size? One intuition is that the $s$ we feed to ash are estimates from univariate regresiosn, which is different from the (proper) estimate of residual variance comparing the model with/without the corresponding SNP. It may be an over-estimate thus effectively over-shrink $\hat{\beta}$
  2. Relevant to 1: when using ASH estimates either as initialization values for mr-ash, or for prediction of response, we should scale it by a factor of $c$ such that $||Y-c\hat{\beta}||^2$ is minimized.
  3. To rule out any potential convergence issue, we should initialize at some "good" estimates, eg, from ash (scaled), and look at the variational lower bound to see if it improves. Another initialization would be result from ridge regression -- none of the effects will be zero.
  4. We can try to simulate from a simple model, the LMM, which has close form solution. We can see how well we fit. In fact this seemingly simple model can be a difficult problem to solve in given circumstances! Suppose we have $$Y=X\beta+E$$ $$\beta \sim N(0, \sigma_\beta^2I)$$ then $$Y \sim N(0, \sigma_\beta^2XX^T+\sigma_e^2I)$$ When $XX^T$ is identity, estimates for $\sigma_\beta$ and $\sigma_e$ are not identifiable. This may explain why mr-ash failed to recover any signal when there is no correlation between columns of $X$.

Next steps:

  1. Simulate and fit under LMM model, also fit that simple simulation to mr-ash. Check out BoltLMM paper from Price Lab.
  2. Try make predictions with mr-ash and ash results and see how well it works
  3. Minor fix: include z-scores in diagnostic plots

Project meeting 20170615

With Matthew on simulation results

  • Analyze with ASH and see what to exploit
  • Try simulation with very big \sigma$ and observe the distribution of estimates

Diagnostics:

  • Distribution of effects
  • Use averaged CDF: see ASH paper
  • Simulate scenarios where there is qualitative differences and see if mr-ash can capture that
  • Look at estimate of $\sigma^2$

Comparison with other methods:

  • Simulate 2 mixture normal and compare with GEMMA
  • Compare with rss-ash

Project meeting 20170518

With Matthew on issues with GTEx V7 preprocessing.

Questions:

  • Normalization: why qnorm? should I standardize it? should I do per-tissue qnorm then standardize them?
  • PEER: should I do per-tissue or altogether? If altogether should I add tissue as a covariate?
    • Variable number of PEER with GTEx V6 -- what's the deal with V7 now?
  • Analysis: should I remove PEER + covariates first for each gene and then run mr-ash?
  • Any additional concerns with imputed genotypes?

Feedback:

  • We do quantile normalization because it is the most robust. And our model assumes normal.
    • potential issue with RPKM: if some genes are extremely highly expressed they will impact other genes's RPKM
  • No good answer to whether or not we adopt alternative stretagy. We should stick to one defensible strategy which easily is the official GTEx guideline.
  • We can use multiple regression to remove residue first. Potential issue is that we may over-correct but the gain in power is well worth the possible loss. Just make sure we keep P << N so that degree of freedom change is not that big. GTEx official procedure has some guideline on how many PEER to use.

Project meeting 20170502

With Matthew and Wei. We discussed mostly the fine-mapping step of m&m, and some mr-ash related issues. First and foremost, we make it clear that focus of m&m should be constrained to eQTL fine mapping for now because it is an important problem that has not been answered in the multivariate framework we propose.

Although we are interested in fine mapping eventually, it is perhaps too premature to make very concrete plan until we see the data analysis outcome from Step 2 the mash step (this comment was made in response to my initial request to finalize the fine mapping MCMC algorithm, and instead we brainstormed on what can be possibly done to perform fine mapping).

How to summarize fine mapping results?

Sparse vs non-sparse result:

  • If we adopt a none sparse model it is not good idea to evaluate whether $\beta=0$ -- we may want to look at correctness of the sign instead.
  • Result under a sparse model is potentially computationally easy. Example see Wen 2016 DAP paper. Or we can even look at one eQTL per gene model, or two eQTL per gene (thus a combination of ~2000 cis SNPs choose 2 pairs), at the fine-mapping step.

What should be the output of fine mapping?

  • One version is to output each SNP's LD with the eQTL
  • We can also output the 95% CI for the LD estimate with eQTL and see if it covers complete LD
    • But which one is the eQTL?
  • Or we can output the set of SNPs that we have 95% confidence that the set includes the eQTL (see Eskin)
  • We want to show that for given SNP, what is the nearest eQTL: what is the distance of this SNP to the nearest eQTL? what is the highest LD between this SNP and an eQTL?
  • Given MCMC result, how do we summarize it?
    • What should we do if we want to use the results for data intergration?
    • We may want a mode, not a flat PIP from MCMC
    • We can take samples from posterior, then for each sample perform intergation analysis, and access the variance / sensitivity of results (how robust the conclusion is). And somehow combine the results. This is similar to the idea of multiple imputation.

What can we learn directly from Step 2 the mash step?

  • If we go a bit further computing the lfsr and posterior mean of effects, we can get the "mode" candidate for eQTL, ie, we can then perform fine mapping around the mode.

It is also perhaps too early to make any meaningful envision on what to do with finemapping, before looking into the data:

  • How uncertain are the results we'll end up getting? Maybe LD is not a big issue in many of the new eQTLs we identify, which by itself would be exciting results.

Where do we stop for mr-ash application

We want to start the mr-ash paper by saying that there is great interest in introducing sparsity in regression, and recently there is a method called ash that introduces it in a "smart" way, and how it is relatively straightforward to use the ash idea in the context of regression.

The strength of mr-ash is the computational efficiency (VEM) and the flexibility (ASH compared to spike-slab), but there are disadvantages (PIP is too concentrated, see Carbonatto 2012).

In data application we can show how different the distribution of effect size of eQTL is, across genes. We can focus on a single tissue, fit separate mr-ash for each gene, and comment on interesting patterns that emerges; or we can use meta-analysis for multiple tissues on genes of interest, if there is not enough power from single tissue analysis.

Project meeting 20170417

Meeting with Matthew. We went through the m&m procedure on overleaf, revisited issue 8 on github and talked about next steps on data analysis + simulations. The discussion has led to minor changes in the overleaf write up.

Additionally we decide that the next step should be getting Step 1 done, ie, mr-ash on GTEx data. We will start with analyzing GTEx V6 and verify with mash result, then move on to V7 data. Step 1 would be an interesting application by itself as it is some form of univariate fine-mapping.

Project meeting 20170330

Meeting with Matthew and Wei, to revive the project, by looking at what we have and what to be done.

Tentative agenda

Connected work

  • varbvs
  • rss
  • ash
  • mash
  • mrash
  • BMASS

Questions

  • How can m&m ash generalize all theses work
    • We have to think carefully what to incorperate in the generalized framework, and how to incorporate them
    • In particular how can we combine mrash / rss + mash?
  • Do we start from summary statistics or full data?
  • Do we need MCMC in addition to VEM?

Implementation-wise, shall we not write any code until we finalize on how the generalized framework is formulated? We should think "modularly" and we make contributions directly to other modules whenever possible, then build m&m ash with these modules.

What to do with mrash as a standalone work?

If we start from full data then finalizing mrash is a natural first step. It is then just a discussion of whether to create a separate package or to make it part of varbvs.

Minutes

The meeting has outlined the approach we take towards a modularized m&m. Most items on the agenda has been covered. See this document for details.

Project meeting 20161103

Meeting with Matthew. We started from recap on the motivation of project, then discussed the M&M ASH model with practical considerations.

Motivation

M&M ASH model is motivated by what we have noticed in the MASH project. We have observed effect of a SNP (eQTL) positive in one tissue yet negative in another tissue. This bothers us. We suspect this type of observation is most likely due to negative LD between two causal SNPs both having positive effect in two separate tissues yet if we make the one eQTL per gene assumption as made in MASH we will observe opposite effects. So if we assume SNPs are independent in association analysis we obtain $\hat{\beta}$ convoluted by LD of all SNPs.

Let's consider univariate association analysis for a moment. Because of LD, $g(.)$, the distribution of $\beta$ we estimate via univariate methods, would have long tails. In other words $g(.)$ is inflated by LD with other SNPs. Estimates of $g(.)$ from multiple regression with ASH prior via variantional EM (currently called MVASH) will not have this problem. However when we want to make inference on $\beta$ the effect size, there will be identifiability issue with MVASH because VEM can reach local optima and the effect size it reports for the SNP identified may not be the SNP that in fact has an effect. The solution to this problem is to use MCMC for fine mapping on selected regions via VEM. A hybrid approach is to estimate hyper-parameters via VEM and use MCMC to sample the posterior.

Now to solve the same issue in the context of multivariate regression, we propose the M&M ASH model, which applies multiple regression using ASH prior on multiple responses. David Gerard has derived a VEM procedure for the M&M model. Assumptions in David's derivations are:

  • The residual variance of genes (after regressing out eQTL effect) is structured low rank + diagonal
  • There will be missing data in the response matrix
  • The mixture proportion can be estimated per test, or be estimated jointly for all tests

M&M ASH with diagonal residual covariance structure

Matthew suggests we make this model simpler and make sure it works. For starters we should ignore correlation among tissues. That is, we assume residual variance a diagonal matrix. Here are a few points why we should start with diagonal and why at least as a first pass we should not make non-diagonal assumption in M&M ASH:

  • We are not sure yet if correlated residual will cause a problem to our inference -- unless we can show it empirically: we should find real data examples when correlation between tissues are due to correlation between genes, not due to similarity of tissues. This would raise a red flag that we should model such correlations.
  • Even if the problem is confirmed we should use MASH model to show we can solve it, before incorporating the solution to M&M ASH. As MASH model is simpler, it will get us assessment from real data quickly and we'll decide if it worth to pursue the fix in M&M ASH.
  • To do it in MASH we should assume this residual correlation is the same as the tissues' correlations (eQTL effect is relatively small) and we estimate the 44 by 44 matrix of covariance directly from expression data. This is not a trivial problem; many methods get estimates that shrink the structure to diagonal. But sparse factor analysis methods can be a good technique to do this, as shown by Wei's work. We then plug this estimate to MASH model
    • The advantage of this approach (over making inference jointly as what David has done for M&M ASH) is that this approach is modular and we can choose a good method (such as SFA, FLASH) to make this step of inference. The method may be biased (ignoring impact of eQTL) but has better variance
  • The problem with this approach is that if eQTL induces correlation we'll wrongfully believe there is residual covariance when in fact there is not. That is, after removing effect from eQTL the residual covariance is diagonal. This observation would favor the joint approach over the modular approach. To assess if this is a problem, we can choose genes with large covariance matrix, and remove the effect of top eQTL then see if the residual covariance matrix still retains correlations or is mostly diagonal.

Next steps

We should start with the simplest version (that residual covariance is diagonal) and make it work. The hard part is computation. Using summary data whenever possible may help with computation. Additionally in updating mixture components we can use noiser estimates, that is, estimates from randomly sampled \beta{hat} instead of 20K genes 1000 SNPs 50 conditions data points. We will have our next meeting (David and Gao with Matthew) after we get this simple version to work in practice.

Project meeting 20160921

Tentative schedule

Status

  • The m&m ash model and implementation.
  • Implementation is not working on data due to data structure design
    • When J is 40, P 2000, K 50 and L 20, the S and SI matrices will be of size 3.2G * 2 = 6.4G. Looping over such data is very slow.
  • Correctness of implementation not tested

Next steps

  • Get implementation working
  • Run simulations
  • Put into OmicsBMA framework and do real data anlaysis

Minutes

  • We should conceptually distinguish a model using original data from the one using summary data, though in the VB algorithm they are very similar.
  • Currently the model assumes $\Sigma_{J \times J}$ known. This is not easy to estimate because it would involve a non-trivial multiple regression. We should try to model $\Sigma_{J \times J}$ as unknown diagonal matrix and estimate it in the VB framework. For starters, write up the J = 1 case.

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago