Multivariate Bayesian variable selection regression

M&M ASH benchmark Part I

Moving on to multivariate analysis we start with some performance benchmarks.

  • Perviously established benchmark here has roughly implemented the framework of multivariate simulations but was incomplete and never used. The benchmark is now moved to here to facilicate potential collaboration using M&M for prediction. Development of DSC benchmark in this repo is therefore discontinuted.
  • This website will continue to post benchmark results
  • Since the application are in the context of GTEx we will be using this repository for the data application, whether it be eQTL or sQTL.

Aim, data, methods and experimental design

The relevant material are in a document on Overleaf shared with project collaborators and is being actively developed. I will not recap it in this note. However the narrative below will follow from the structure in that document.

Patterns of sharing evaluated

It is however useful to recap the patterns of sharing we evaluate in benchmark. They include:

  1. Condition specific effects
  2. Low, moderate and high correlations across conditions
    • including a case of 100% correlated
  3. Mixture settings:
    • dict(identity=0.1,equal_effects=0.2,singleton=0.2,simple_het_1=0.1,simple_het_2=0.1,simple_het_3=0.1,null=0.2)

Benchmarks evaluated

An initial benchmark as a first pass

As a first pass we use $R=5$ conditions on 100 toy data-sets. We put together evaluations of M&M CS for experiments under several patterns of sharing as documented in the section above. To fit M&M this time we simply input the underlying priors $U$ (and their weights in the context of mixture simlulation) and residual covariance $V$.

./finemap.dsc --target first_pass

This benchmark takes roughly 10 minutes to complete using my 40 core desktop server.

In [1]:
%cd ~/GIT/github/mnm-twas/dsc
/home/gaow/GIT/github/mnm-twas/dsc
In [10]:
library('dscrutils')
out = dscquery('finemap_output', "hundred_data.dataset sharing_pattern.n_signal susie_scores.total susie_scores.valid susie_scores.size susie_scores.purity susie_scores.top")
Loading dsc-query output from CSV file.
Reading DSC outputs:
 - sharing_pattern.n_signal: extracted atomic values
 - susie_scores.total: extracted atomic values
 - susie_scores.valid: extracted atomic values
 - susie_scores.size: extracted atomic values
 - susie_scores.purity: extracted atomic values
 - susie_scores.top: extracted atomic values
In [11]:
head(out)
DSChundred_data.datasetsharing_patternsharing_pattern.n_signalsusie_scores.totalsusie_scores.validsusie_scores.sizesusie_scores.puritysusie_scores.top
1 ~/Documents/GTExV8/Thyroid.Lung.FMO2.filled.rds identity 2 2 2 8 0.9921241 0
1 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000031823.RDSidentity 1 1 1 1 1.0000000 1
1 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000062194.RDSidentity 1 1 1 1 1.0000000 1
1 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000073150.RDSidentity 1 1 1 1 1.0000000 1
1 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000078319.RDSidentity 1 1 1 1 1.0000000 1
1 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000081277.RDSidentity 1 1 1 2 0.9953505 1
In [12]:
res = out[,c(3,4,5,6,7,8,9)]
colnames(res) = c('pattern', 'total_true', 'total', 'valid', 'size', 'purity', 'top_hit')

Purity of CS

In [13]:
aggregate(purity~pattern, res, mean)
patternpurity
high_het 0.9950535
identity 0.9952877
low_het 0.9939219
mid_het 0.9873239
mixture010.8644837
shared 0.9954203
singleton0.9744209

Size of CS

In [7]:
aggregate(size~pattern, res, median)
patternsize
high_het 2.00
identity 1.25
low_het 2.00
mid_het 2.00
mixture012.00
shared 2.00
singleton4.25

Power

In [17]:
valid = aggregate(valid ~ pattern, res, sum)
total_true = aggregate(total_true ~ pattern, res, sum)
power = merge(valid, total_true, by = "pattern")
power$power = power$valid/power$total_true
power
patternvalidtotal_truepower
high_het 120 123 0.9756098
identity 117 122 0.9590164
low_het 115 123 0.9349593
mid_het 125 130 0.9615385
mixture01 95 135 0.7037037
shared 118 136 0.8676471
singleton126 130 0.9692308

FDR

In [18]:
valid = aggregate(valid ~ pattern, res, sum)
total = aggregate(total ~ pattern, res, sum)
fdr = merge(valid, total, by = "pattern")
fdr$fdr = (fdr$total - fdr$valid)/fdr$total
fdr
patternvalidtotalfdr
high_het 120 123 0.024390244
identity 117 126 0.071428571
low_het 115 121 0.049586777
mid_het 125 128 0.023437500
mixture01 95 98 0.030612245
shared 118 119 0.008403361
singleton 126 127 0.007874016

Top-hit rate (how often the strongest SNP is causal)

In [19]:
top_hit = aggregate(top_hit ~ pattern, res, sum)
total_true = aggregate(total_true ~ pattern, res, sum)
top_rate = merge(top_hit, total_true, by = "pattern")
top_rate$top_rate = top_rate$top_hit/top_rate$total_true
top_rate
patterntop_hittotal_truetop_rate
high_het 73 123 0.5934959
identity 77 122 0.6311475
low_het 77 123 0.6260163
mid_het 92 130 0.7076923
mixture0163 135 0.4666667
shared 79 136 0.5808824
singleton69 130 0.5307692

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago