From this benchmark I see some runs with EE model has inflated FDR with no missing data. Here are some digging into it.
%cd ~/GIT/mvarbvs/dsc_mnm
Here I load previous benchmark results and identify some scenarios where there is a problem,
out = readRDS('../data/finemap_output.20191116.rds')
res = out[,-1]
colnames(res) = c('n_traits', 'resid_method', 'missing', 'EZ_model', 'L', 'total', 'valid', 'size', 'purity', 'top_hit', 'total_true', 'total_true_included', 'overlap_var', 'overlap_cs', 'false_positive_cross_cond', 'false_negative_cross_cond', 'true_positive_cross_cond', 'elbo_converged', 'filename')
bad = res[which(res$total-res$valid>0),]
bad = bad[which(bad$missing==FALSE & bad$EZ_model==0 & bad$n_traits==45 & bad$L<10),]
nrow(bad)
There are quite a few of them with 45 traits. But none seen for $R=5$ or $R=10$. Changing EZ_model == 1
there is only one such instance as opposed to 412 here.
head(bad)
Load the 3rd line of dataset identified -- that is, two CS identified but only one of them is valid:
ex = readRDS('mnm_20191116/mnm_high_het/full_data_8_high_het_2_oracle_generator_1_mnm_high_het_9.rds')
DSC_19D98699 <- dscrutils::load_inputs(c('mnm_20191116/oracle_generator/oracle_generator_1.pkl','mnm_20191116/full_data/full_data_8.rds','mnm_20191116/high_het/full_data_8_high_het_2.pkl'), dscrutils::read_dsc)
meta <- DSC_19D98699$meta
The original result from DSC is presented in a plot of PIP and CS, where true signal is in red:
true_pos = as.integer(apply(meta$true_coef, 1, sum) != 0)
susieR::susie_plot(ex$result,y='PIP', main = 'EE model', xlab = 'SNP positions', add_legend = T, b=true_pos)
Starting with analyzing with L=1
:
DSC_REPLICATE <- DSC_19D98699$DSC_DEBUG$replicate
X <- DSC_19D98699$X
Y <- DSC_19D98699$Y
cfg <- DSC_19D98699$configurations
prior = cfg[[as.character(ncol(Y))]][['high_het']]
compute_cov_diag <- function(Y){
covar <- diag(apply(Y, 2, var, na.rm=T))
return(covar)
}
#saveRDS(list(X=X,Y=Y,meta=meta,prior=prior,DSC_REPLICATE=DSC_REPLICATE), 'issue_9_EE.rds')
attach(readRDS('/home/gaow/tmp/05-Dec-2019/issue_9_EE.rds'))
resid_Y <- compute_cov_diag(Y)
This data-set has also been saved to here. If you want to reproduce my results you can download the data-set from the link above, use attach(readRDS(...))
to attach the data to the R workspace and continue with commands below.
Use EE model and set $L=1$,
m_init = mvsusieR::create_mash_prior(mixture_prior=list(matrices=prior$xUlist, weights=prior$pi), alpha=0, max_mixture_len=-1)
r1 = mvsusieR::mvsusie(X, Y, L=1, prior_variance=m_init, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=F, precompute_covariance = F)
To plot and compare results,
susieR::susie_plot(r1, y='PIP', main = 'EE L=1', xlab = 'SNP positions', add_legend = T, b=true_pos)
So with $L=1$ the first SNP was captured but with low PIP. This is apparently also the top BF. Now try L=2
to verify the problem:
r1 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=m_init, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=F, precompute_covariance = F)
susieR::susie_plot(r1, y='PIP', main = 'EE L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
But the logBF suggests the 2nd effect should not be considered significant:
r1$lbf
Here the MASH prior is very simple:
str(m_init$prior_variance)
So prior variance is one matrix,
U = m_init$prior_variance$xUlist[[2]]
dim(U)
Now fitting it with multivariate Bayesian regression routine not involving MASH. The result should be identical to using MASH EE model in this case:
r2 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=U, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=F)
It converged reasonably fast. The result remains the same:
susieR::susie_plot(r2, y='PIP', main = 'Bayesian Multivariate regression L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
m_init$precompute_cov_matrices(mvsusieR:::DenseData$new(X,Y),resid_Y)
m_init$precomputed$common_sbhat
For data without center and scale,
m_init$precompute_cov_matrices(mvsusieR:::DenseData$new(X,Y,center=F,scale=F),resid_Y)
m_init$precomputed$common_sbhat
m_init$remove_precomputed()
And not scaling does change result a bit though not completely, see below (why??):
r3 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=m_init, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=F, standardize=F)
susieR::susie_plot(r3, y='PIP', main = 'MASH EE not scale L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
r3 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=m_init, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=F, intercept=F)
susieR::susie_plot(r3, y='PIP', main = 'MASH EE no intercept L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
r3 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=m_init, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=F, intercept=F, standardize=F)
susieR::susie_plot(r3, y='PIP', main = 'MASH EE no intercept no scale L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
Using EZ model, however, gives a bit better result:
m_init = mvsusieR::create_mash_prior(mixture_prior=list(matrices=prior$xUlist, weights=prior$pi), alpha=1, max_mixture_len=-1)
r4 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=m_init, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=F, precompute_covariance = F)
susieR::susie_plot(r4, y='PIP', main = 'EZ L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
Now compare result of one run with BayesianMultivariateRegression
vs MashRegression
:
data = mvsusieR:::DenseData$new(X,Y)
A = mvsusieR:::BayesianMultivariateRegression$new(ncol(X), resid_Y, prior$xUlist[[1]])
A$fit(data,save_summary_stats=T)
m_init = mvsusieR:::MashInitializer$new(NULL, NULL, xUlist=prior$xUlist, prior_weights=prior$pi, null_weight=prior$null_weight, alpha=0, top_mixtures=-1)
B = mvsusieR:::MashRegression$new(ncol(X), resid_Y, m_init)
B$fit(data, save_summary_stats = T)
testthat::expect_equal(A$bhat,B$bhat)
testthat::expect_equal(A$sbhat, B$sbhat)
testthat::expect_equal(A$posterior_b1, B$posterior_b1)
testthat::expect_equal(as.vector(A$posterior_b2), as.vector(B$posterior_b2))
testthat::expect_equal(A$lbf, B$lbf)
plot(A$lbf, B$lbf,xlab='BayesianMultivariate_lbf', ylab='EE_lbf')
All seems fine and some lbf are very big. But there is nothing unusual about them.
head(B$bhat[which(B$lbf>800),])
head(B$sbhat[which(B$lbf>800),])
Using EZ I get this,
m_init = mvsusieR:::MashInitializer$new(NULL, NULL, xUlist=prior$xUlist, prior_weights=prior$pi, null_weight=prior$null_weight, alpha=1, top_mixtures=-1)
B = mvsusieR:::MashRegression$new(ncol(X), resid_Y, m_init)
B$fit(data, save_summary_stats = T)
plot(A$lbf, B$lbf,xlab='BayesianMultivariate_lbf', ylab='EZ_lbf')
The trends are the same but only lbf scale is smaller.
r2 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=U, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=T, estimate_prior_method='simple')
susieR::susie_plot(r2, y='PIP', main = 'Bayesian Multivariate regression estimate prior, L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
Now it seems to work! So this is a problem with model mis-specification.
Update I also implemented the simple
approach in MashRegression, by checking all lbf
in the MASH multivarate regrssion and set prior to zero if they are all smaller than zero.
m_init = mvsusieR::create_mash_prior(mixture_prior=list(matrices=prior$xUlist, weights=prior$pi), alpha=0, max_mixture_len=-1)
r2 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=m_init, residual_variance=resid_Y, compute_objective=F, estimate_residual_variance=F, estimate_prior_variance=T, estimate_prior_method='simple')
susieR::susie_plot(r2, y='PIP', main = 'Mash Regression estimate prior, L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
Add 99% null weight see if that's enough to bring down the 2nd false positive effect,
m_init = mvsusieR::create_mash_prior(mixture_prior=list(matrices=prior$xUlist, weights=prior$pi),
null_weight = 0.99, alpha=0, max_mixture_len=-1)
r2 = mvsusieR::mvsusie(X, Y, L=2, prior_variance=m_init, residual_variance=resid_Y,
compute_objective=F, estimate_residual_variance=F,
estimate_prior_variance=F)
susieR::susie_plot(r2, y='PIP', main = 'EE with null weight = 0.99, L=2', xlab = 'SNP positions', add_legend = T, b=true_pos)
It does not help.
m_init$prior_variance