This is a continuation of a previous investigation.
Previous result can be downloaded here. good_lbf
in this dataset saves lbf
from mvsusieR:::BayesianMultivariateRegression
.
devtools::load_all('/home/gaow/GIT/software/mashr')
setwd('/home/gaow/tmp/07-Dec-2019')
dat = readRDS('issue_9_EE_sumstats.rds')
mash_data = mash_set_data(dat$bhat, dat$sbhat, V=dat$V, alpha=0)
names(mash_data)
r1 = calc_relative_lik_matrix(mash_data,dat$xUlist, algorithm.version = 'Rcpp')
r2= calc_relative_lik_matrix(mash_data,dat$xUlist, algorithm.version = 'R')
names(r1)
expect_equal(r1$lfactors, r2$lfactors)
expect_equal(r1$loglik_matrix, r2$loglik_matrix)
Loglik computations are okay.
posterior_weights <- compute_posterior_weights(dat$pi_s,exp(r1$loglik_matrix))
p1 <- compute_posterior_matrices(mash_data, dat$xUlist, posterior_weights, algorithm.version="Rcpp")
p2 <- compute_posterior_matrices(mash_data, dat$xUlist, posterior_weights, algorithm.version="R")
names(p1)
expect_equal(p1$PosteriorMean, p2$PosteriorMean)
expect_equal(p1$PosteriorSD, p2$PosteriorSD)
expect_equal(p1$lfdr, p2$lfdr)
expect_equal(p1$NegativeProb, p2$NegativeProb)
expect_equal(p1$lfsr, p2$lfsr)
Posterior calculations also are consistent between R and Rcpp versions.
null_loglik = compute_null_loglik_from_matrix(r1,mash_data$Shat_alpha)
alt_loglik = compute_alt_loglik_from_matrix_and_pi(dat$pi_s,r1,mash_data$Shat_alpha)
lbf = alt_loglik-null_loglik
plot(dat$good_lbf, lbf, xlab="BayesianMultivariate lbf", ylab = "EE lbf")
plot(dat$bad_lbf, lbf, xlab="EE lbf (mvsusieR)", ylab = "EE lbf (mashr)")
lbf indeed agrees with what mvsusieR:::MashRegression
gives, not consistent with what BayesianMultivariateRegression
says.
Let's see what the loglik matrix tells us about these zero BF's,
(r1$loglik_matrix + r1$lfactors)[which(lbf>1000),]
head((r1$loglik_matrix + r1$lfactors)[which(abs(lbf)<50),])
It seems the loglik for null was wrong from the beginning. The reason mvsusieR:::BayesianMultivariateRegression
does not have the issue is because it avoids numerical issues when it tries to compute lbf, instead of using dmvnorm()
.
And indeed, if I uncomment this line using dmvnorm()
to compute lbf, then the result agrees with mvsusieR:::MashRegression
and also wrong for the example on this page.
U = dat$xUlist[[2]]
S = lapply(1:nrow(dat$sbhat), function(j) diag(dat$sbhat[j,]) %*% dat$V %*% diag(dat$sbhat[j,]))
S_inv = lapply(1:length(S), function(j) solve(S[[j]]))
post_cov = lapply(1:length(S), function(j) U %*% solve(diag(nrow(U)) + S_inv[[j]] %*% U))
good_lbf = log(sapply(1:length(S), function(j) sqrt(det(S[[j]])/det(S[[j]]+U))*exp(0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])))
good_lbf[which(is.infinite(good_lbf))] = 0
plot(dat$good_lbf, good_lbf)
We need to solve the loglik calculation issue in mash
, because without fixing that, the posterior weights will be wrong and that will undermine MASH model. I'll try to come up with a minimal example vignette to post on mashr
github.
Yuxin pointed to me, after she looked at the data, that if I dont set infinity
to zero but work instead at log scale I will be getting the same result as in EE model. Therefore I changed the mvsusieR code from:
good_lbf = log(sapply(1:length(S), function(j) sqrt(det(S[[j]])/det(S[[j]]+U))*exp(0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])))
to
good_lbf = sapply(1:length(S), function(j) 0.5 * (log(det(S[[j]])) - log(det(S[[j]]+U))) + 0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])
good_lbf = sapply(1:length(S), function(j) 0.5 * (log(det(S[[j]])) - log(det(S[[j]]+U))) + 0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])
plot(dat$good_lbf, good_lbf)
But after the change, those infinite
false signals which were once removed now comes back and contribute to bad lbf, leading to false positives in the results.