Multivariate Bayesian variable selection regression

M&M benchmark XII

This benchmark is an improvments over the previous one with mostly the same setup but hopefully previously observed issues are fixed.

The only major difference in setting is that I removed EZ model from mvsusieR package thus the DSC for it.

Conclusion

The corresponding DSC code are from ee50493 and to be reproduced as follows:

./finemap.dsc --host dsc_mnm.yml -o mnm_20191209
In [1]:
%cd ~/GIT/mvarbvs/dsc_mnm
/project2/mstephens/gaow/mvarbvs/dsc_mnm
In [2]:
out = dscrutils::dscquery('mnm_20191209', targets = c('simulate.n_traits', 'mnm', 'mnm.resid_method', 'mnm.L', 
                                                      'susie_scores', 'susie_scores.total', 'susie_scores.valid', 'susie_scores.size', 
                                                      'susie_scores.purity', 'susie_scores.top', 'susie_scores.n_causal', 'susie_scores.included_causal', 
                                                      'susie_scores.overlap_var', 'susie_scores.overlap_cs','susie_scores.false_pos_cond_discoveries', 
                                                      'susie_scores.false_neg_cond_discoveries', 'susie_scores.true_cond_discoveries', 'susie_scores.converged'),
                          module.output.files = "susie_scores", verbose = F)
In [4]:
dim(out)
  1. 15000
  2. 19
In [5]:
saveRDS(out, '../data/finemap_output.20191209.rds')
In [8]:
out$missing = out$mnm
out$missing[which(out$missing=='mnm_high_het')] = FALSE
out$missing[which(out$missing=='mnm_high_het_missing')] = TRUE
out$mnm = NULL
In [9]:
head(out)
DSCsimulate.n_traitsmnm.resid_methodmnm.Lsusie_scores.totalsusie_scores.validsusie_scores.sizesusie_scores.puritysusie_scores.topsusie_scores.n_causalsusie_scores.included_causalsusie_scores.overlap_varsusie_scores.overlap_cssusie_scores.false_pos_cond_discoveriessusie_scores.false_neg_cond_discoveriessusie_scores.true_cond_discoveriessusie_scores.convergedsusie_scores.output.filemissing
1 5 oracle 1 1 1 1 1.0000000 1 1 1 0 0 0 0 5 TRUE susie_scores/full_data_1_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1FALSE
1 5 oracle 1 1 1 1 1.0000000 1 1 1 0 0 0 0 5 TRUE susie_scores/full_data_2_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1FALSE
1 5 oracle 1 1 1 1 1.0000000 1 1 1 0 0 0 0 5 TRUE susie_scores/full_data_3_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1FALSE
1 5 oracle 1 1 1 12 0.9883042 0 1 1 0 0 0 0 5 TRUE susie_scores/full_data_4_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1FALSE
1 5 oracle 1 1 1 22 1.0000000 0 1 1 0 0 0 0 5 TRUE susie_scores/full_data_5_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1FALSE
1 5 oracle 1 1 1 1 1.0000000 1 1 1 0 0 0 0 5 FALSE susie_scores/full_data_6_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1FALSE
In [17]:
res = out[,-1]
colnames(res) = c('n_traits', 'resid_method', 'L', 'total', 'valid', 'size', 'purity', 'top_hit', 'total_true', 'total_true_included', 'overlap_var', 'overlap_cs', 'false_positive_cross_cond', 'false_negative_cross_cond', 'true_positive_cross_cond', 'converged', 'filename', 'missing')

Purity of CS

In [18]:
purity = aggregate(purity~n_traits + resid_method + missing + L, res, mean)
purity = purity[order(purity$n_traits),]
purity
n_traitsresid_methodmissingLpurity
1 5 diag FALSE 1 0.9950368
3 5 flash FALSE 1 0.9936965
5 5 oracle FALSE 1 0.9953622
7 5 diag TRUE 1 0.9853424
9 5 flash TRUE 1 0.9852914
11 5 diag FALSE 2 0.9949696
13 5 flash FALSE 2 0.9934708
15 5 oracle FALSE 2 0.9950214
17 5 diag TRUE 2 0.9853550
19 5 flash TRUE 2 0.9853039
21 5 diag FALSE 10 0.9949696
23 5 flash FALSE 10 0.9934708
25 5 oracle FALSE 10 0.9950214
27 5 diag TRUE 10 0.9853550
29 5 flash TRUE 10 0.9853039
245 diag FALSE 1 0.9999438
445 flash FALSE 1 0.9946040
645 oracle FALSE 1 0.9999487
845 diag TRUE 1 0.9989562
1045 flash TRUE 1 0.9959553
1245 diag FALSE 2 0.9999438
1445 flash FALSE 2 0.9946040
1645 oracle FALSE 2 0.9999487
1845 diag TRUE 2 0.9989562
2045 flash TRUE 2 0.9959553
2245 diag FALSE 10 0.9999438
2445 flash FALSE 10 0.9946040
2645 oracle FALSE 10 0.9999487
2845 diag TRUE 10 0.9989562
3045 flash TRUE 10 0.9959553

Power of CS

In [19]:
total_true_included = aggregate(total_true_included ~ n_traits + resid_method + missing + L, res, sum)
total_true = aggregate(total_true ~  n_traits + resid_method + missing + L, res, sum)
cs_overlap = aggregate(overlap_cs ~  n_traits + resid_method + missing + L, res, sum)
snp_overlap = aggregate(overlap_var ~  n_traits + resid_method + missing + L, res, sum)
power = merge(total_true_included, total_true, by = c( 'n_traits' , 'resid_method' , 'missing' , 'L'))
power = merge(power, cs_overlap,  by = c( 'n_traits' , 'resid_method' , 'missing' , 'L'))
power = merge(power, snp_overlap,  by = c( 'n_traits' , 'resid_method' , 'missing' , 'L'))
power$power = round(power$total_true_included/power$total_true,3)
power$overlap_cs = round(power$overlap_cs, 3)
power$overlap_var = round(power$overlap_var, 3)
power = power[order(power$n_traits),]
power = power[order(power$L),]
power = power[order(power$missing),]
power
n_traitsresid_methodmissingLtotal_true_includedtotal_trueoverlap_csoverlap_varpower
16 5 diag FALSE 1 495 500 0 0 0.990
22 5 flash FALSE 1 497 500 0 0 0.994
28 5 oracleFALSE 1 495 500 0 0 0.990
145 diag FALSE 1 500 500 0 0 1.000
745 flash FALSE 1 498 500 0 0 0.996
1345 oracleFALSE 1 500 500 0 0 1.000
18 5 diag FALSE 2 495 500 0 0 0.990
24 5 flash FALSE 2 497 500 0 0 0.994
30 5 oracleFALSE 2 495 500 0 0 0.990
345 diag FALSE 2 500 500 0 0 1.000
945 flash FALSE 2 498 500 0 0 0.996
1545 oracleFALSE 2 500 500 0 0 1.000
17 5 diag FALSE 10 495 500 0 0 0.990
23 5 flash FALSE 10 497 500 0 0 0.994
29 5 oracleFALSE 10 495 500 0 0 0.990
245 diag FALSE 10 500 500 0 0 1.000
845 flash FALSE 10 498 500 0 0 0.996
1445 oracleFALSE 10 500 500 0 0 1.000
19 5 diag TRUE 1 492 500 0 0 0.984
25 5 flash TRUE 1 492 500 0 0 0.984
445 diag TRUE 1 499 500 0 0 0.998
1045 flash TRUE 1 498 500 0 0 0.996
21 5 diag TRUE 2 492 500 0 0 0.984
27 5 flash TRUE 2 492 500 0 0 0.984
645 diag TRUE 2 499 500 0 0 0.998
1245 flash TRUE 2 498 500 0 0 0.996
20 5 diag TRUE 10 492 500 0 0 0.984
26 5 flash TRUE 10 492 500 0 0 0.984
545 diag TRUE 10 499 500 0 0 0.998
1145 flash TRUE 10 498 500 0 0 0.996

FDR of CS no missing data

In [20]:
valid = aggregate(valid ~ n_traits + resid_method + missing + L, res, sum)
total = aggregate(total ~ n_traits + resid_method + missing + L, res, sum)
fdr = merge(valid, total, by = c( 'n_traits' , 'resid_method' , 'missing' , 'L'))
fdr$fdr = round((fdr$total - fdr$valid)/fdr$total,3)
fdr = fdr[which(fdr$missing==FALSE),-3]
fdr = fdr[order(fdr$n_traits),]
fdr
n_traitsresid_methodLvalidtotalfdr
16 5 diag 1 495 500 0.010
17 5 diag 10 495 501 0.012
18 5 diag 2 495 501 0.012
22 5 flash 1 497 500 0.006
23 5 flash 10 497 501 0.008
24 5 flash 2 497 501 0.008
28 5 oracle 1 495 500 0.010
29 5 oracle10 495 503 0.016
30 5 oracle 2 495 503 0.016
145 diag 1 500 500 0.000
245 diag 10 500 500 0.000
345 diag 2 500 500 0.000
745 flash 1 498 498 0.000
845 flash 10 498 498 0.000
945 flash 2 498 498 0.000
1345 oracle 1 500 500 0.000
1445 oracle10 500 500 0.000
1545 oracle 2 500 500 0.000

FDR of CS with missing data

In [21]:
valid = aggregate(valid ~ n_traits + resid_method + missing + L, res, sum)
total = aggregate(total ~ n_traits + resid_method + missing + L, res, sum)
fdr = merge(valid, total, by = c( 'n_traits' , 'resid_method' , 'missing' , 'L'))
fdr$fdr = round((fdr$total - fdr$valid)/fdr$total,3)
fdr = fdr[which(fdr$missing==TRUE),-3]
fdr = fdr[order(fdr$n_traits),]
fdr
n_traitsresid_methodLvalidtotalfdr
19 5 diag 1 492 500 0.016
20 5 diag 10 492 502 0.020
21 5 diag 2 492 502 0.020
25 5 flash 1 492 500 0.016
26 5 flash10 492 502 0.020
27 5 flash 2 492 502 0.020
445 diag 1 499 500 0.002
545 diag 10 499 500 0.002
645 diag 2 499 500 0.002
1045 flash 1 498 499 0.002
1145 flash10 498 499 0.002
1245 flash 2 498 499 0.002

Convergence

Based on ELBO. In principle all runs should converge by ELBO. If it is not converged, then it means either ELBO is not non-increasing or it exceeds max iteration.

It is only relevant to focus on $L>1$.

In [22]:
elbo_converged = aggregate(converged~n_traits + resid_method + missing +  L, res, mean)
elbo_converged = elbo_converged[which(elbo_converged$missing==FALSE),-3]
elbo_converged = elbo_converged[which(elbo_converged$L!=1),]
elbo_converged = elbo_converged[order(elbo_converged$n_traits),]
elbo_converged
n_traitsresid_methodLconverged
11 5 diag 2 0.870
13 5 flash 2 0.876
15 5 oracle 2 0.860
21 5 diag 10 0.870
23 5 flash 10 0.876
25 5 oracle10 0.860
1245 diag 2 0.904
1445 flash 2 0.934
1645 oracle 2 0.882
2245 diag 10 0.904
2445 flash 10 0.934
2645 oracle10 0.882

However if ELBO is not increasing then we should see a warning. The fact we dont see it (no stderr files present in the DSC folder) means 100 iterations were not good enough for ELBO to converge at our default tolerance level 0.001.

Since for missing data we used convergence of PIP to stop the IBSS algorithm, let's check it:

In [23]:
pip_converged = aggregate(converged~n_traits + resid_method + missing +  L, res, mean)
pip_converged = pip_converged[which(pip_converged$missing==TRUE),-3]
pip_converged = pip_converged[which(pip_converged$L!=1),]
pip_converged = pip_converged[order(pip_converged$n_traits),]
pip_converged
n_traitsresid_methodLconverged
17 5 diag 2 1
19 5 flash 2 1
27 5 diag 10 1
29 5 flash10 1
1845 diag 2 1
2045 flash 2 1
2845 diag 10 1
3045 flash10 1

So it all converges by PIP, if we dont check by ELBO.


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago