Multivariate Bayesian variable selection regression

mashr R vs. C++ benchmark

We've recently implemented a C++ version of mashr which seems to perform very well on small scale toy data. How does it scale with data size? This notebook benchmarks speed of both implementations on small to larger scale of data.

Before getting to the experiements let's recap the problem and its expected computational requirement. We are fitting mash model with data-set of $J$ effects, $R$ conditions, on $K$ priors each with $L$ grids. Often a straightforward loop is applied over all combinations, generating $LJK$ problems. Then for each problem it often involves some Gaussian density calculation which requires computing inverse of an $R$ matrix. So the computational should scale by the order of $O(LJKR^3)$.

The benchmark

Load data

In [1]:
library(mashr)
data <- get(load("../../data/cor_mash.rda"))

Get test set

In [2]:
J = 100
R = 53
betahat <- data$betahat[1:R, 1:J]
sebetahat <- data$sebetahat[1:R, 1:J]
mash_data <- set_mash_data(t(betahat), t(sebetahat))
Ulist <- expand_cov(cov_canonical(mash_data), c(1:20)/10)
LK <- length(Ulist)
print(LK)
w = matrix(1, J, LK)
w = w / LK
[1] 1161

Likelihood benchmark

In [ ]:
## Compute the likelihood matrix using the R implementation.
cat(sprintf("Computing %d x %d likelihood matrix for %d conditions using R version.\n",J,LK,R))
out.time <- system.time(out1 <- calc_lik_matrix(mash_data,Ulist,log = TRUE,algorithm.version = "R"))
cat(sprintf(paste("Likelihood calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))
## Compute the likelihood matrix using the Rcpp implementation.
cat(sprintf("Computing %d x %d likelihood matrix for %d conditions using Rcpp version.\n",J,LK,R))
out.time <- system.time(out2 <- calc_lik_matrix(mash_data,Ulist,log = TRUE,algorithm.version = "Rcpp"))
cat(sprintf(paste("Likelihood calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))
Computing 100 x 1161 likelihood matrix for 53 conditions using R version.

Posterior benchmark

In [3]:
## Compute posterior using the R implementation.
cat(sprintf("Computing posterior for %d effects under %d conditions using R version.\n",J,R))
out.time <- system.time(out1 <- compute_posterior_matrices(mash_data,Ulist,w,
                                     algorithm.version = "R"))
cat(sprintf(paste("Calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))
## Compute posterior using the Rcpp implementation.
cat(sprintf("Computing posterior matrix for %d effects under %d conditions using Rcpp version.\n",J,R))
out.time <- system.time(out2 <-compute_posterior_matrices(mash_data,Ulist,w,
                                     algorithm.version = "Rcpp"))
cat(sprintf(paste("Calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))
Computing posterior for 100 effects under 53 conditions using R version.
Calculations took 283.41 seconds.
Computing posterior matrix for 100 effects under 53 conditions using Rcpp version.
Calculations took 187.46 seconds.

Result

I ran the benchmark a few times for 1) fixed $J$ and 2) fixed $R$. Here are the results in tables. P is number of priors, TR is time elapsed for R, TC is time elapsed for C++.

Likelihood

In [71]:
import pandas as pd
lik = pd.read_csv("../../data/mash_benchmark_loglik.txt")
lik
Out[71]:
J R P TR TC
0 50 5 201 1.92 0.02
1 100 5 201 3.40 0.05
2 500 5 201 16.28 0.23
3 1000 5 201 32.92 0.39
4 100 10 301 5.44 0.15
5 500 10 301 25.65 0.70
6 1000 10 301 52.34 1.44
7 50 10 301 2.81 0.08
8 50 20 501 5.30 0.46
9 50 30 701 9.08 2.21
10 50 40 901 15.59 5.10
11 50 53 1161 28.61 13.50
12 100 53 1161 54.55 18.90
put []

Posterior

In [73]:
post = pd.read_csv("../../data/mash_benchmark_posterior.txt")
post
Out[73]:
J R P TR TC
0 50 5 201 1.72 0.05
1 100 5 201 3.49 0.10
2 500 5 201 14.72 0.41
3 1000 5 201 31.20 0.90
4 100 10 301 5.56 0.71
5 500 10 301 27.72 2.45
6 1000 10 301 57.83 5.16
7 50 10 301 3.11 0.27
8 50 20 501 8.68 2.59
9 50 30 701 24.14 10.53
10 50 40 901 57.69 34.68
11 50 53 1161 146.48 102.04
12 100 53 1161 286.90 194.22
put []

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago