mashr R vs. C++ benchmark¶

We've recently implemented a C++ version of mashr which seems to perform very well on small scale toy data. How does it scale with data size? This notebook benchmarks speed of both implementations on small to larger scale of data.

Before getting to the experiements let's recap the problem and its expected computational requirement. We are fitting mash model with data-set of $J$ effects, $R$ conditions, on $K$ priors each with $L$ grids. Often a straightforward loop is applied over all combinations, generating $LJK$ problems. Then for each problem it often involves some Gaussian density calculation which requires computing inverse of an $R$ matrix. So the computational should scale by the order of $O(LJKR^3)$.

The benchmark¶

Load data

library(mashr)
data <- get(load("../../data/cor_mash.rda"))

Get test set

J = 100
R = 53
betahat <- data$betahat[1:R, 1:J]
sebetahat <- data$sebetahat[1:R, 1:J]
mash_data <- set_mash_data(t(betahat), t(sebetahat))
Ulist <- expand_cov(cov_canonical(mash_data), c(1:20)/10)
LK <- length(Ulist)
print(LK)
w = matrix(1, J, LK)
w = w / LK

[1] 1161

Likelihood benchmark

## Compute the likelihood matrix using the R implementation.
cat(sprintf("Computing %d x %d likelihood matrix for %d conditions using R version.\n",J,LK,R))
out.time <- system.time(out1 <- calc_lik_matrix(mash_data,Ulist,log = TRUE,algorithm.version = "R"))
cat(sprintf(paste("Likelihood calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))
## Compute the likelihood matrix using the Rcpp implementation.
cat(sprintf("Computing %d x %d likelihood matrix for %d conditions using Rcpp version.\n",J,LK,R))
out.time <- system.time(out2 <- calc_lik_matrix(mash_data,Ulist,log = TRUE,algorithm.version = "Rcpp"))
cat(sprintf(paste("Likelihood calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))

Computing 100 x 1161 likelihood matrix for 53 conditions using R version.

Posterior benchmark

## Compute posterior using the R implementation.
cat(sprintf("Computing posterior for %d effects under %d conditions using R version.\n",J,R))
out.time <- system.time(out1 <- compute_posterior_matrices(mash_data,Ulist,w,
                                     algorithm.version = "R"))
cat(sprintf(paste("Calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))
## Compute posterior using the Rcpp implementation.
cat(sprintf("Computing posterior matrix for %d effects under %d conditions using Rcpp version.\n",J,R))
out.time <- system.time(out2 <-compute_posterior_matrices(mash_data,Ulist,w,
                                     algorithm.version = "Rcpp"))
cat(sprintf(paste("Calculations took %0.2f seconds.\n"),
              out.time["elapsed"]))

Computing posterior for 100 effects under 53 conditions using R version.
Calculations took 283.41 seconds.
Computing posterior matrix for 100 effects under 53 conditions using Rcpp version.
Calculations took 187.46 seconds.

Result¶

I ran the benchmark a few times for 1) fixed $J$ and 2) fixed $R$. Here are the results in tables. P is number of priors, TR is time elapsed for R, TC is time elapsed for C++.

Likelihood¶

import pandas as pd
lik = pd.read_csv("../../data/mash_benchmark_loglik.txt")
lik

put []

Posterior¶

post = pd.read_csv("../../data/mash_benchmark_posterior.txt")
post

put []

	J	R	P	TR	TC
0	50	5	201	1.92	0.02
1	100	5	201	3.40	0.05
2	500	5	201	16.28	0.23
3	1000	5	201	32.92	0.39
4	100	10	301	5.44	0.15
5	500	10	301	25.65	0.70
6	1000	10	301	52.34	1.44
7	50	10	301	2.81	0.08
8	50	20	501	5.30	0.46
9	50	30	701	9.08	2.21
10	50	40	901	15.59	5.10
11	50	53	1161	28.61	13.50
12	100	53	1161	54.55	18.90

	J	R	P	TR	TC
0	50	5	201	1.72	0.05
1	100	5	201	3.49	0.10
2	500	5	201	14.72	0.41
3	1000	5	201	31.20	0.90
4	100	10	301	5.56	0.71
5	500	10	301	27.72	2.45
6	1000	10	301	57.83	5.16
7	50	10	301	3.11	0.27
8	50	20	501	8.68	2.59
9	50	30	701	24.14	10.53
10	50	40	901	57.69	34.68
11	50	53	1161	146.48	102.04
12	100	53	1161	286.90	194.22