Multivariate Bayesian variable selection regression

M&M benchmark XI

This benchmark is an improvments over the previous one, in the following espects.

  1. Use both small $R=5$ and large $R=45$ simulations to compare if merely increasing number of conditions messes it up.
  2. Simulate even simpler: 1 effect and using 1 grid for effect covariance such that the prior is no longer a mixture.
  3. Analyzing it with L = 1, L = 2 and L = 10.
  4. Assess CS overlap at both variable and CS level
  5. Evaluated both EE and EZ model when computing BF for multivariate analysis
  6. Add oracle residual covariance method
  7. Use correlation from FLASH method instead and use actual variance of $Y$ to scale it, for residual variance.
  8. Turn on ELBO computation and add a score to check for convergence; although for now we know that missing data ELBO computation can be problematic. Notice this ELBO evaluation is only relevant when there is no missing data. For cases of non-missing data I still use convergence in PIP.
  9. Increased number of replicates to 500.

Conclusion

  1. Under simple situation, that is, small number of effects, small enough L, and no missing data, increasing $R=5$ to $R=45$ does not seem to result in false positives for EZ model.
    • And as expected from the simulated senario, increased number of conditions help with the power due to sharing of effects by magnitude.
  2. EE model has high FDR when model is mis-specified (L > 1)!
  3. CS overlapping situation exists for EZ model, and gets worse as $R$ increases, even for L=2 and lower power setting compared to previous simulation.
    • Using an oracle covariance does not seem to have helped.
  4. FLASH based covariance hurts the power for EZ model.
  5. EE model sometimes has convergence issue by ELBO although ELBO keeps increasing.

Next steps for this investigation

  1. Missing data situation: when ELBO computation is not involved and when residual covariance is simply diag, missing data handling should be really easy -- not sure how to check for that "bug".
  2. What is wrong with EE model FDR problem?

The corresponding DSC code are from c5d75a5 and to be reproduced as follows:

./finemap.dsc --host dsc_mnm.yml -o mnm_20191116
In [1]:
%cd ~/GIT/mvarbvs/dsc_mnm
/project2/mstephens/gaow/mvarbvs/dsc_mnm
In [5]:
out = dscrutils::dscquery('mnm_20191116', targets = c('simulate.n_traits', 'mnm.resid_method', 'mnm.missing_Y', 'mnm.alpha', 'mnm.L', 'susie_scores', 'susie_scores.total', 'susie_scores.valid', 'susie_scores.size', 'susie_scores.purity', 'susie_scores.top', 'susie_scores.n_causal', 'susie_scores.included_causal', 'susie_scores.overlap_var', 'susie_scores.overlap_cs','susie_scores.false_pos_cond_discoveries', 'susie_scores.false_neg_cond_discoveries', 'susie_scores.true_cond_discoveries', 'susie_scores.converged'),
                                          module.output.files = "susie_scores", verbose = F)
In [6]:
head(out)
DSCsimulate.n_traitsmnm.resid_methodmnm.missing_Ymnm.alphamnm.Lsusie_scores.totalsusie_scores.validsusie_scores.sizesusie_scores.puritysusie_scores.topsusie_scores.n_causalsusie_scores.included_causalsusie_scores.overlap_varsusie_scores.overlap_cssusie_scores.false_pos_cond_discoveriessusie_scores.false_neg_cond_discoveriessusie_scores.true_cond_discoveriessusie_scores.convergedsusie_scores.output.file
1 5 oracle TRUE 0 1 1 1 2 0.9852849 1 1 1 0 0 0 0 5 TRUE susie_scores/full_data_1_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1 5 oracle TRUE 0 1 1 1 1 1.0000000 1 1 1 0 0 0 0 5 TRUE susie_scores/full_data_2_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1 5 oracle TRUE 0 1 1 1 1 1.0000000 1 1 1 0 0 0 0 5 TRUE susie_scores/full_data_3_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1 5 oracle TRUE 0 1 1 1 43 0.9899983 0 1 1 0 0 0 0 5 TRUE susie_scores/full_data_4_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1 5 oracle TRUE 0 1 1 1 23 0.9863327 0 1 1 0 0 0 0 5 TRUE susie_scores/full_data_5_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1 5 oracle TRUE 0 1 1 1 9 0.9216036 1 1 1 0 0 0 0 5 TRUE susie_scores/full_data_6_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
In [7]:
dim(out)
  1. 36000
  2. 20
In [8]:
saveRDS(out, '../data/finemap_output.20191116.rds')
In [6]:
res = out[,-1]
colnames(res) = c('n_traits', 'resid_method', 'missing', 'EZ_model', 'L', 'total', 'valid', 'size', 'purity', 'top_hit', 'total_true', 'total_true_included', 'overlap_var', 'overlap_cs', 'false_positive_cross_cond', 'false_negative_cross_cond', 'true_positive_cross_cond', 'elbo_converged', 'filename')

Purity of CS

Purity is higher for $R=45$ simply due to higher power; because in this simulation there is no FDR issue.

In [8]:
purity = aggregate(purity~n_traits + resid_method + missing + EZ_model + L, res, mean)
purity = purity[which(purity$missing==FALSE),-3]
purity = purity[order(purity$n_traits),]
purity
n_traitsresid_methodEZ_modelLpurity
1 5 diag 0 1 0.995564192
3 5 flash 0 1 0.993308992
5 5 oracle 0 1 0.996260685
13 5 diag 1 1 0.898551466
15 5 flash 1 1 0.493463519
17 5 oracle 1 1 0.918378140
25 5 diag 0 2 0.995498060
27 5 flash 0 2 0.993060009
29 5 oracle 0 2 0.995250958
37 5 diag 1 2 0.859395104
39 5 flash 1 2 0.369242831
41 5 oracle 1 2 0.891067109
49 5 diag 0 10 0.991605224
51 5 flash 0 10 0.988157033
53 5 oracle 0 10 0.986834604
61 5 diag 1 10 0.012468447
63 5 flash 1 10 0.001081979
65 5 oracle 1 10 0.024371456
245 diag 0 1 0.999944125
445 flash 0 1 0.998646255
645 oracle 0 1 0.999954269
1445 diag 1 1 0.996597778
1645 flash 1 1 0.353285630
1845 oracle 1 1 0.997202390
2645 diag 0 2 0.992376542
2845 flash 0 2 0.991455608
3045 oracle 0 2 0.991887007
3845 diag 1 2 0.994617874
4045 flash 1 2 0.198183655
4245 oracle 1 2 0.995283875
5045 diag 0 10 0.956458416
5245 flash 0 10 0.965783140
5445 oracle 0 10 0.966548821
6245 diag 1 10 0.975531115
6445 flash 1 10 0.000000000
6645 oracle 1 10 0.978309174

Power of CS

Focusing on $L = 2$ to evaluate overlapping CS status. In this case there still exists overlaps between CS, but not as many as with $L=10$. Overlapping status got worse when increased $R$.

In [14]:
total_true_included = aggregate(total_true_included ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
total_true = aggregate(total_true ~  n_traits + resid_method + missing + EZ_model + L, res, sum)
cs_overlap = aggregate(overlap_cs ~  n_traits + resid_method + missing + EZ_model + L, res, sum)
snp_overlap = aggregate(overlap_var ~  n_traits + resid_method + missing + EZ_model + L, res, sum)
power = merge(total_true_included, total_true, by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
power = merge(power, cs_overlap,  by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
power = merge(power, snp_overlap,  by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
power$power = round(power$total_true_included/power$total_true,3)
power$overlap_cs = round(power$overlap_cs, 3)
power$overlap_var = round(power$overlap_var, 3)
power = power[which(power$missing==FALSE),-3]
power = power[order(power$n_traits),]
power = power[order(power$L),]
power = power[order(power$EZ_model),]
#power = power[order(power$missing),]
power
n_traitsresid_methodEZ_modelLtotal_true_includedtotal_trueoverlap_csoverlap_varpower
37 5 diag 0 1 495 500 0 0 0.990
49 5 flash 0 1 499 500 0 0 0.998
61 5 oracle0 1 492 500 0 0 0.984
145 diag 0 1 498 500 0 0 0.996
1345 flash 0 1 499 500 0 0 0.998
2545 oracle0 1 498 500 0 0 0.996
39 5 diag 0 2 495 500 0 0 0.990
51 5 flash 0 2 499 500 0 0 0.998
63 5 oracle0 2 492 500 0 0 0.984
345 diag 0 2 498 500 0 0 0.996
1545 flash 0 2 499 500 0 0 0.998
2745 oracle0 2 498 500 0 0 0.996
38 5 diag 0 10 495 500 1 1 0.990
50 5 flash 0 10 499 500 1 1 0.998
62 5 oracle0 10 492 500 1 1 0.984
245 diag 0 10 498 500 0 0 0.996
1445 flash 0 10 499 500 0 0 0.998
2645 oracle0 10 497 500 0 0 0.994
40 5 diag 1 1 495 500 0 0 0.990
52 5 flash 1 1 291 500 0 0 0.582
64 5 oracle1 1 498 500 0 0 0.996
445 diag 1 1 500 500 0 0 1.000
1645 flash 1 1 199 500 0 0 0.398
2845 oracle1 1 500 500 0 0 1.000
42 5 diag 1 2 488 500 4 621 0.976
54 5 flash 1 2 221 500 2 92 0.442
66 5 oracle1 2 496 500 2 349 0.992
645 diag 1 2 500 500 3 122 1.000
1845 flash 1 2 115 500 9 609 0.230
3045 oracle1 2 500 500 2 9 1.000
41 5 diag 1 10 9 500 26 8723 0.018
53 5 flash 1 10 1 500 3 2452 0.002
65 5 oracle1 10 18 500 33 11985 0.036
545 diag 1 10 499 500 34 1608 0.998
1745 flash 1 10 0 500 0 0 0.000
2945 oracle1 10 499 500 23 1241 0.998

FDR of CS no missing data

In [20]:
valid = aggregate(valid ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
total = aggregate(total ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
fdr = merge(valid, total, by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
fdr$fdr = round((fdr$total - fdr$valid)/fdr$total,3)
fdr = fdr[which(fdr$missing==FALSE),-3]
fdr = fdr[order(fdr$n_traits),]
fdr
n_traitsresid_methodEZ_modelLvalidtotalfdr
37 5 diag 0 1 495 500 0.010
38 5 diag 0 10 495 568 0.129
39 5 diag 0 2 495 545 0.092
40 5 diag 1 1 495 495 0.000
41 5 diag 1 10 23 23 0.000
42 5 diag 1 2 492 492 0.000
49 5 flash 0 1 499 500 0.002
50 5 flash 0 10 499 577 0.135
51 5 flash 0 2 499 543 0.081
52 5 flash 1 1 291 291 0.000
53 5 flash 1 10 3 3 0.000
54 5 flash 1 2 223 223 0.000
61 5 oracle0 1 492 500 0.016
62 5 oracle0 10 492 600 0.180
63 5 oracle0 2 492 548 0.102
64 5 oracle1 1 498 498 0.000
65 5 oracle1 10 37 37 0.000
66 5 oracle1 2 498 498 0.000
145 diag 0 1 498 500 0.004
245 diag 0 10 498 3075 0.838
345 diag 0 2 498 628 0.207
445 diag 1 1 500 500 0.000
545 diag 1 10 529 529 0.000
645 diag 1 2 503 503 0.000
1345 flash 0 1 499 500 0.002
1445 flash 0 10 499 3325 0.850
1545 flash 0 2 499 632 0.210
1645 flash 1 1 199 200 0.005
1745 flash 1 10 0 0 NaN
1845 flash 1 2 124 124 0.000
2545 oracle0 1 498 500 0.004
2645 oracle0 10 497 3383 0.853
2745 oracle0 2 498 642 0.224
2845 oracle1 1 500 500 0.000
2945 oracle1 10 521 521 0.000
3045 oracle1 2 502 502 0.000

FDR of CS with missing data

In [21]:
valid = aggregate(valid ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
total = aggregate(total ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
fdr = merge(valid, total, by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
fdr$fdr = round((fdr$total - fdr$valid)/fdr$total,3)
fdr = fdr[which(fdr$missing==TRUE),-3]
fdr = fdr[order(fdr$n_traits),]
fdr
n_traitsresid_methodEZ_modelLvalidtotalfdr
43 5 diag 0 1 467 485 0.037
44 5 diag 0 10 484 748 0.353
45 5 diag 0 2 480 561 0.144
46 5 diag 1 1 179 184 0.027
47 5 diag 1 10 0 16 1.000
48 5 diag 1 2 111 118 0.059
55 5 flash 0 1 467 485 0.037
56 5 flash 0 10 484 748 0.353
57 5 flash 0 2 480 561 0.144
58 5 flash 1 1 170 175 0.029
59 5 flash 1 10 0 16 1.000
60 5 flash 1 2 103 110 0.064
67 5 oracle0 1 465 485 0.041
68 5 oracle0 10 480 777 0.382
69 5 oracle0 2 478 561 0.148
70 5 oracle1 1 224 229 0.022
71 5 oracle1 10 1 17 0.941
72 5 oracle1 2 151 159 0.050
745 diag 0 1 403 499 0.192
845 diag 0 10 428 2579 0.834
945 diag 0 2 416 944 0.559
1045 diag 1 1 423 489 0.135
1145 diag 1 10 681 1083 0.371
1245 diag 1 2 443 561 0.210
1945 flash 0 1 377 497 0.241
2045 flash 0 10 411 2592 0.841
2145 flash 0 2 396 935 0.576
2245 flash 1 1 274 370 0.259
2345 flash 1 10 344 812 0.576
2445 flash 1 2 263 413 0.363
3145 oracle0 1 403 499 0.192
3245 oracle0 10 428 2659 0.839
3345 oracle0 2 416 944 0.559
3445 oracle1 1 422 489 0.137
3545 oracle1 10 660 1085 0.392
3645 oracle1 2 442 562 0.214

Convergence

Based on ELBO. In principle all runs should converge by ELBO. If it is not converged, then it means ELBO is not non-increasing.

It is only relevant to focus on $L>1$. For without missing data the runs do converge wrt ELBO.

In [18]:
elbo_converged = aggregate(elbo_converged~n_traits + resid_method + missing +  EZ_model + L, res, mean)
#elbo_converged = elbo_converged[which(elbo_converged$missing==FALSE),-3]
elbo_converged = elbo_converged[which(elbo_converged$L!=1),]
elbo_converged = elbo_converged[order(elbo_converged$n_traits),]
elbo_converged
n_traitsresid_methodmissingEZ_modelLelbo_converged
25 5 diag FALSE 0 2 0.994
27 5 flash FALSE 0 2 1.000
29 5 oracleFALSE 0 2 0.998
31 5 diag TRUE 0 2 1.000
33 5 flash TRUE 0 2 1.000
35 5 oracle TRUE 0 2 1.000
37 5 diag FALSE 1 2 1.000
39 5 flash FALSE 1 2 1.000
41 5 oracleFALSE 1 2 1.000
43 5 diag TRUE 1 2 1.000
45 5 flash TRUE 1 2 1.000
47 5 oracle TRUE 1 2 1.000
49 5 diag FALSE 0 10 1.000
51 5 flash FALSE 0 10 1.000
53 5 oracleFALSE 0 10 1.000
55 5 diag TRUE 0 10 1.000
57 5 flash TRUE 0 10 1.000
59 5 oracle TRUE 0 10 1.000
61 5 diag FALSE 1 10 1.000
63 5 flash FALSE 1 10 1.000
65 5 oracleFALSE 1 10 1.000
67 5 diag TRUE 1 10 1.000
69 5 flash TRUE 1 10 1.000
71 5 oracle TRUE 1 10 1.000
2645 diag FALSE 0 2 0.984
2845 flash FALSE 0 2 0.994
3045 oracleFALSE 0 2 0.992
3245 diag TRUE 0 2 1.000
3445 flash TRUE 0 2 1.000
3645 oracle TRUE 0 2 1.000
3845 diag FALSE 1 2 1.000
4045 flash FALSE 1 2 1.000
4245 oracleFALSE 1 2 1.000
4445 diag TRUE 1 2 1.000
4645 flash TRUE 1 2 1.000
4845 oracle TRUE 1 2 1.000
5045 diag FALSE 0 10 0.988
5245 flash FALSE 0 10 0.992
5445 oracleFALSE 0 10 0.986
5645 diag TRUE 0 10 0.998
5845 flash TRUE 0 10 1.000
6045 oracle TRUE 0 10 0.998
6245 diag FALSE 1 10 1.000
6445 flash FALSE 1 10 1.000
6645 oracleFALSE 1 10 1.000
6845 diag TRUE 1 10 1.000
7045 flash TRUE 1 10 0.994
7245 oracle TRUE 1 10 0.998

The convergence issue for EE model: they still have increasing ELBO; but the model did not converge after 100 iterations (ELBO still increase!)


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago