We've recently implemented a C++ version of mashr
which seems to perform very well on small scale toy data. How does it scale with data size? This notebook benchmarks speed of both implementations on small to larger scale of data.
Before getting to the experiements let's recap the problem and its expected computational requirement. We are fitting mash
model with data-set of $J$ effects, $R$ conditions, on $K$ priors each with $L$ grids. Often a straightforward loop is applied over all combinations, generating $LJK$ problems. Then for each problem it often involves some Gaussian density calculation which requires computing inverse of an $R$ matrix. So the computational should scale by the order of $O(LJKR^3)$.
Load data
library(mashr)
data <- get(load("../../data/cor_mash.rda"))
Get test set
J = 100
R = 53
betahat <- data$betahat[1:R, 1:J]
sebetahat <- data$sebetahat[1:R, 1:J]
mash_data <- set_mash_data(t(betahat), t(sebetahat))
Ulist <- expand_cov(cov_canonical(mash_data), c(1:20)/10)
LK <- length(Ulist)
print(LK)
w = matrix(1, J, LK)
w = w / LK
Likelihood benchmark
## Compute the likelihood matrix using the R implementation.
cat(sprintf("Computing %d x %d likelihood matrix for %d conditions using R version.\n",J,LK,R))
out.time <- system.time(out1 <- calc_lik_matrix(mash_data,Ulist,log = TRUE,algorithm.version = "R"))
cat(sprintf(paste("Likelihood calculations took %0.2f seconds.\n"),
out.time["elapsed"]))
## Compute the likelihood matrix using the Rcpp implementation.
cat(sprintf("Computing %d x %d likelihood matrix for %d conditions using Rcpp version.\n",J,LK,R))
out.time <- system.time(out2 <- calc_lik_matrix(mash_data,Ulist,log = TRUE,algorithm.version = "Rcpp"))
cat(sprintf(paste("Likelihood calculations took %0.2f seconds.\n"),
out.time["elapsed"]))
Posterior benchmark
## Compute posterior using the R implementation.
cat(sprintf("Computing posterior for %d effects under %d conditions using R version.\n",J,R))
out.time <- system.time(out1 <- compute_posterior_matrices(mash_data,Ulist,w,
algorithm.version = "R"))
cat(sprintf(paste("Calculations took %0.2f seconds.\n"),
out.time["elapsed"]))
## Compute posterior using the Rcpp implementation.
cat(sprintf("Computing posterior matrix for %d effects under %d conditions using Rcpp version.\n",J,R))
out.time <- system.time(out2 <-compute_posterior_matrices(mash_data,Ulist,w,
algorithm.version = "Rcpp"))
cat(sprintf(paste("Calculations took %0.2f seconds.\n"),
out.time["elapsed"]))
I ran the benchmark a few times for 1) fixed $J$ and 2) fixed $R$. Here are the results in tables. P
is number of priors, TR
is time elapsed for R
, TC
is time elapsed for C++
.
import pandas as pd
lik = pd.read_csv("../../data/mash_benchmark_loglik.txt")
lik
post = pd.read_csv("../../data/mash_benchmark_posterior.txt")
post