Multivariate Bayesian variable selection regression

Investigating MASH logBF computation under EE model

This is a continuation of a previous investigation.

Load package and previous results

Previous result can be downloaded here. good_lbf in this dataset saves lbf from mvsusieR:::BayesianMultivariateRegression.

In [1]:
devtools::load_all('/home/gaow/GIT/software/mashr')
Loading mashr
Loading required package: ashr
In [2]:
setwd('/home/gaow/tmp/07-Dec-2019')
dat = readRDS('issue_9_EE_sumstats.rds')

Set data

In [3]:
mash_data = mash_set_data(dat$bhat, dat$sbhat, V=dat$V, alpha=0)
In [4]:
names(mash_data)
  1. 'Bhat'
  2. 'Shat'
  3. 'Shat_alpha'
  4. 'V'
  5. 'commonV'
  6. 'alpha'

Compare loglik

In [5]:
r1 = calc_relative_lik_matrix(mash_data,dat$xUlist, algorithm.version = 'Rcpp')
In [6]:
r2= calc_relative_lik_matrix(mash_data,dat$xUlist, algorithm.version = 'R')
In [7]:
names(r1)
  1. 'loglik_matrix'
  2. 'lfactors'
In [8]:
expect_equal(r1$lfactors, r2$lfactors)
In [9]:
expect_equal(r1$loglik_matrix, r2$loglik_matrix)

Loglik computations are okay.

Compare posterior

In [10]:
posterior_weights <- compute_posterior_weights(dat$pi_s,exp(r1$loglik_matrix))
In [11]:
p1 <- compute_posterior_matrices(mash_data, dat$xUlist, posterior_weights, algorithm.version="Rcpp")
In [12]:
p2 <- compute_posterior_matrices(mash_data, dat$xUlist, posterior_weights, algorithm.version="R")
In [13]:
names(p1)
  1. 'PosteriorMean'
  2. 'PosteriorSD'
  3. 'lfdr'
  4. 'NegativeProb'
  5. 'lfsr'
In [14]:
expect_equal(p1$PosteriorMean, p2$PosteriorMean)
In [15]:
expect_equal(p1$PosteriorSD, p2$PosteriorSD)
In [16]:
expect_equal(p1$lfdr, p2$lfdr)
In [17]:
expect_equal(p1$NegativeProb, p2$NegativeProb)
In [18]:
expect_equal(p1$lfsr, p2$lfsr)

Posterior calculations also are consistent between R and Rcpp versions.

LBF calculations

In [19]:
null_loglik = compute_null_loglik_from_matrix(r1,mash_data$Shat_alpha)
alt_loglik = compute_alt_loglik_from_matrix_and_pi(dat$pi_s,r1,mash_data$Shat_alpha)
In [20]:
lbf = alt_loglik-null_loglik
In [21]:
plot(dat$good_lbf, lbf, xlab="BayesianMultivariate lbf", ylab = "EE lbf")
In [22]:
plot(dat$bad_lbf, lbf, xlab="EE lbf (mvsusieR)", ylab = "EE lbf (mashr)")

lbf indeed agrees with what mvsusieR:::MashRegression gives, not consistent with what BayesianMultivariateRegression says.

Let's see what the loglik matrix tells us about these zero BF's,

In [23]:
(r1$loglik_matrix + r1$lfactors)[which(lbf>1000),]
A matrix: 23 × 2 of type dbl[,2]
-1579.04416.10435
-1687.96416.01919
-1579.04416.10435
-1687.96416.01919
-1687.96416.01919
-1618.20916.08220
-1532.37616.20638
-1661.60916.06732
-1499.53216.22189
-1516.50516.20907
-1584.05616.14898
-1368.86716.36046
-1538.45316.22730
-1538.45316.22730
-1358.85216.36753
-1423.74916.32169
-1487.96316.26542
-1423.74916.32169
-1487.96316.26542
-1455.93316.31311
-1357.17616.39357
-1369.45016.36913
-1238.52016.52030
In [24]:
head((r1$loglik_matrix + r1$lfactors)[which(abs(lbf)<50),])
A matrix: 6 × 2 of type dbl[,2]
48.7554317.76282
31.8638117.76036
28.6301417.67223
37.6259417.69075
44.4080917.70641
52.4840117.75334

It seems the loglik for null was wrong from the beginning. The reason mvsusieR:::BayesianMultivariateRegression does not have the issue is because it avoids numerical issues when it tries to compute lbf, instead of using dmvnorm().

And indeed, if I uncomment this line using dmvnorm() to compute lbf, then the result agrees with mvsusieR:::MashRegression and also wrong for the example on this page.

To reproduce the "good" lbf with summary stats

In [25]:
U = dat$xUlist[[2]]
S = lapply(1:nrow(dat$sbhat), function(j) diag(dat$sbhat[j,]) %*% dat$V %*% diag(dat$sbhat[j,]))
In [32]:
S_inv = lapply(1:length(S), function(j) solve(S[[j]]))
post_cov = lapply(1:length(S), function(j) U %*% solve(diag(nrow(U)) + S_inv[[j]] %*% U))
good_lbf = log(sapply(1:length(S), function(j) sqrt(det(S[[j]])/det(S[[j]]+U))*exp(0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])))
good_lbf[which(is.infinite(good_lbf))] = 0
In [33]:
plot(dat$good_lbf, good_lbf)

Conclusion

We need to solve the loglik calculation issue in mash, because without fixing that, the posterior weights will be wrong and that will undermine MASH model. I'll try to come up with a minimal example vignette to post on mashr github.

Update

Yuxin pointed to me, after she looked at the data, that if I dont set infinity to zero but work instead at log scale I will be getting the same result as in EE model. Therefore I changed the mvsusieR code from:

good_lbf = log(sapply(1:length(S), function(j) sqrt(det(S[[j]])/det(S[[j]]+U))*exp(0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])))

to

good_lbf = sapply(1:length(S), function(j) 0.5 * (log(det(S[[j]])) - log(det(S[[j]]+U))) + 0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])
In [34]:
good_lbf = sapply(1:length(S), function(j) 0.5 * (log(det(S[[j]])) - log(det(S[[j]]+U))) + 0.5*t(dat$bhat[j,])%*%S_inv[[j]]%*%post_cov[[j]]%*%S_inv[[j]]%*%dat$bhat[j,])
plot(dat$good_lbf, good_lbf)

But after the change, those infinite false signals which were once removed now comes back and contribute to bad lbf, leading to false positives in the results.


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago