This is a continuation of a previous investigation.
Previous result can be downloaded here. good_lbf in this dataset saves lbf from mvsusieR:::BayesianMultivariateRegression.
devtools::load_all('/home/gaow/GIT/software/mashr')
setwd('/home/gaow/tmp/07-Dec-2019')
dat = readRDS('issue_9_EE_sumstats.rds')
mash_data = mash_set_data(dat$bhat, dat$sbhat, V=dat$V, alpha=0)
names(mash_data)
r1 = calc_relative_lik_matrix(mash_data,dat$xUlist, algorithm.version = 'Rcpp')
r2= calc_relative_lik_matrix(mash_data,dat$xUlist, algorithm.version = 'R')
names(r1)
expect_equal(r1$lfactors, r2$lfactors)
expect_equal(r1$loglik_matrix, r2$loglik_matrix)
Loglik computations are okay.
posterior_weights <- compute_posterior_weights(dat$pi_s,exp(r1$loglik_matrix))
p1 <- compute_posterior_matrices(mash_data, dat$xUlist, posterior_weights, algorithm.version="Rcpp")
p2 <- compute_posterior_matrices(mash_data, dat$xUlist, posterior_weights, algorithm.version="R")
names(p1)
expect_equal(p1$PosteriorMean, p2$PosteriorMean)
expect_equal(p1$PosteriorSD, p2$PosteriorSD)
expect_equal(p1$lfdr, p2$lfdr)
expect_equal(p1$NegativeProb, p2$NegativeProb)
expect_equal(p1$lfsr, p2$lfsr)
Posterior calculations also are consistent between R and Rcpp versions.
null_loglik = compute_null_loglik_from_matrix(r1,mash_data$Shat_alpha)
alt_loglik = compute_alt_loglik_from_matrix_and_pi(dat$pi_s,r1,mash_data$Shat_alpha)
lbf = alt_loglik-null_loglik
plot(dat$good_lbf, lbf, xlab="BayesianMultivariate lbf", ylab = "EE lbf")
plot(dat$bad_lbf, lbf, xlab="EE lbf (mvsusieR)", ylab = "EE lbf (mashr)")
lbf indeed agrees with what mvsusieR:::MashRegression gives, not consistent with what BayesianMultivariateRegression says.
Let's see what the loglik matrix tells us about these zero BF's,
(r1$loglik_matrix + r1$lfactors)[which(lbf>1000),]
head((r1$loglik_matrix + r1$lfactors)[which(abs(lbf)<50),])
It seems the loglik for null was wrong from the beginning. The reason mvsusieR:::BayesianMultivariateRegression does not have the issue is because it avoids numerical issues when it tries to compute lbf, instead of using dmvnorm().
And indeed, if I uncomment this line using dmvnorm() to compute lbf, then the result agrees with mvsusieR:::MashRegression and also wrong for the example on this page.
U = dat$xUlist[[2]]
S = lapply(1:nrow(dat$sbhat), function(j) diag(dat$sbhat[j,]) %*% dat$V %*% diag(dat$sbhat[j,]))
S_inv = lapply(1:length(S), function(j) solve(S[[j]]))
post_cov = lapply(1:length(S), function(j) U %*% solve(diag(nrow(U)) + S_inv[[j]] %*% U))
good_lbf = log(sapply(1:length(S), function(j) sqrt(det(S[[j]])/det(S[[j]]+U))*exp(0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])))
good_lbf[which(is.infinite(good_lbf))] = 0
plot(dat$good_lbf, good_lbf)
We need to solve the loglik calculation issue in mash, because without fixing that, the posterior weights will be wrong and that will undermine MASH model. I'll try to come up with a minimal example vignette to post on mashr github.
Yuxin pointed to me, after she looked at the data, that if I dont set infinity to zero but work instead at log scale I will be getting the same result as in EE model. Therefore I changed the mvsusieR code from:
good_lbf = log(sapply(1:length(S), function(j) sqrt(det(S[[j]])/det(S[[j]]+U))*exp(0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])))
to
good_lbf = sapply(1:length(S), function(j) 0.5 * (log(det(S[[j]])) - log(det(S[[j]]+U))) + 0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])
good_lbf = sapply(1:length(S), function(j) 0.5 * (log(det(S[[j]])) - log(det(S[[j]]+U))) + 0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])
plot(dat$good_lbf, good_lbf)
But after the change, those infinite false signals which were once removed now comes back and contribute to bad lbf, leading to false positives in the results.