Multivariate Bayesian variable selection regression

M&M benchmark VIII

This benchmark uses the latest GTEx V8 genotype data and evaluated the pipeline in the presence of missing data.

  1. the number of conditions are increased to $R=45$
  2. missing data in expression are simulated according to missingness pattern in the actual expression cross tissues; Both flashier::flash methods and simple diagonal methods were used to compute covariance of response to use as residual covariance.

Conclusion

  1. Our pipeline with missing data has high false positive rates even though the simulated residual correlation is diagonal.
  2. When the underlying pattern of residual covariance is diagonal, FLASH based method suffer from quite a bit power loss as shown in simulations without missing data.

Next steps for this investigation

  1. Figure out the problem (hopefully bug) with missing data handling in mvsusieR.
    • An obvious thing to do is to add more unit tests for missing data although we already have a couple of unit tests for it. But hopefully more tests can catch something obvious.
  2. Add a diagnostic function to compute in between CS correlation.

The benchmark is now under dsc_mnm, running on UChicago RCC midway

./finemap.dsc --host mnm_dsc.yaml

This executes the default pipeline in finemap.dsc file, as of today (2019.11.08).

In [3]:
%cd ~/GIT/mvarbvs/dsc_mnm
/project2/mstephens/gaow/mvarbvs/dsc_mnm
In [4]:
start_time <- Sys.time()
out = dscrutils::dscquery('finemap_output', targets = c('simulate', 'mnm.resid_method', 'mnm.missing_Y', 'susie_scores.total', 'susie_scores.valid', 'susie_scores.size', 'susie_scores.purity', 'susie_scores.top', 'susie_scores.n_causal', 'susie_scores.included_causal', 'susie_scores.overlap', 'susie_scores.false_pos_cond_discoveries', 'susie_scores.false_neg_cond_discoveries', 'susie_scores.true_cond_discoveries'), verbose = F)
end_time <- Sys.time()
In [5]:
end_time - start_time
Time difference of 14.60613 secs
In [6]:
head(out)
DSCsimulatemnm.resid_methodmnm.missing_Ysusie_scores.totalsusie_scores.validsusie_scores.sizesusie_scores.puritysusie_scores.topsusie_scores.n_causalsusie_scores.included_causalsusie_scores.overlapsusie_scores.false_pos_cond_discoveriessusie_scores.false_neg_cond_discoveriessusie_scores.true_cond_discoveries
1 mid_het flash TRUE 2 2 23.5 0.84828832 1 1 23 0 4 86
1 mid_het flash TRUE 1 1 13.0 0.95270611 3 2 0 0 8 37
1 mid_het flash TRUE 6 1 2.0 0.99743670 1 1 0 16 216 38
1 mid_het flash TRUE 5 5 37.0 0.96913390 2 2 125 0 57 168
1 mid_het flash TRUE 1 1 20.0 0.91783800 2 1 0 0 1 44
1 mid_het flash TRUE 1 1 12.0 0.93026280 3 1 0 0 6 39
In [7]:
dim(out)
  1. 400
  2. 15
In [8]:
saveRDS(out, '../data/finemap_output.20191108.rds')
In [10]:
res = out[,-c(1,2)]
colnames(res) = c('resid_method', 'missing', 'total', 'valid', 'size', 'purity', 'top_hit', 'total_true', 'total_true_included', 'overlap', 'false_positive_cross_cond', 'false_negative_cross_cond', 'true_positive_cross_cond')

Purity of CS

Yes purity is higher with missing data --- but because many of those CS are false positives! (see below)

In [11]:
purity = aggregate(purity~resid_method + missing, res, mean)
purity
resid_methodmissingpurity
diag FALSE 0.9670738
flash FALSE 0.6914748
diag TRUE 0.9349031
flash TRUE 0.8479686

Size of CS

In [12]:
size = aggregate(size~resid_method+missing, res, median)
size
resid_methodmissingsize
diag FALSE6.00
flashFALSE7.00
diag TRUE7.75
flash TRUE7.25

Power of CS

Notice here that many CS overlap -- this is not what was observed with $R=5$.

In [13]:
total_true_included = aggregate(total_true_included ~ resid_method + missing, res, sum)
total_true = aggregate(total_true ~ resid_method + missing, res, sum)
overlap = aggregate(overlap ~ resid_method + missing, res, mean)
power = merge(total_true_included, total_true, by = c("resid_method", "missing"))
power = merge(power, overlap,  by = c("resid_method", "missing"))
power$power = power$total_true_included/power$total_true
power = power[order(power$missing),]
power
resid_methodmissingtotal_true_includedtotal_trueoverlappower
1diag FALSE 158 162 194.40 0.9753086
3flash FALSE 116 162 103.48 0.7160494
2diag TRUE 140 162 242.23 0.8641975
4flash TRUE 112 162 119.73 0.6913580

FDR of CS

The high FDR explains the seemingly high power, and is consistent with the observations that CS are "purer".

In [14]:
valid = aggregate(valid ~ resid_method + missing, res, sum)
total = aggregate(total ~ resid_method + missing, res, sum)
fdr = merge(valid, total, by = c("resid_method", "missing"))
fdr$fdr = (fdr$total - fdr$valid)/fdr$total
fdr = fdr[order(fdr$missing),]
fdr
resid_methodmissingvalidtotalfdr
1diag FALSE 320 320 0.000000000
3flash FALSE 169 170 0.005882353
2diag TRUE 281 369 0.238482385
4flash TRUE 190 284 0.330985915

Power for per signal per condition estimates

We compute lfsr on per signal per condition basis. We call it a signal in the condition if lfsr is smaller than 0.05.

In [15]:
tp = aggregate(true_positive_cross_cond ~ resid_method + missing, res, sum)
fn = aggregate(false_negative_cross_cond ~ resid_method + missing, res, sum)
power = merge(tp, fn, by = c("resid_method", "missing"))
In [16]:
power$power = power$true_positive_cross_cond/(power$true_positive_cross_cond + power$false_negative_cross_cond)
power = power[order(power$missing),]
power
resid_methodmissingtrue_positive_cross_condfalse_negative_cross_condpower
1diag FALSE 8871 5529 0.6160417
3flash FALSE 4467 3176 0.5844564
2diag TRUE 4311 10076 0.2996455
4flash TRUE 3613 7168 0.3351266

FDR for per signal per condition estimates

In [17]:
tp = aggregate(true_positive_cross_cond ~ resid_method + missing, res, sum)
fp = aggregate(false_positive_cross_cond ~ resid_method + missing, res, sum)
fdr = merge(tp, fp, by = c("resid_method", "missing"))
fdr$fdr = fdr$false_positive_cross_cond/(fdr$true_positive_cross_cond + fdr$false_positive_cross_cond)
fdr = fdr[order(fdr$missing),]
fdr
resid_methodmissingtrue_positive_cross_condfalse_positive_cross_condfdr
1diag FALSE 8871 0 0.000000000
3flash FALSE 4467 7 0.001564595
2diag TRUE 4311 2218 0.339715117
4flash TRUE 3613 1999 0.356200998

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago