OpenMP benchmark for Rcpp based codes¶

Here I test if OpenMP helps with some of the computations.

attach(readRDS('em_optim_difference.rds'))

Here, sample size N is around 800, number of variables P is around 600. 50 conditions are involved.

X = cbind(X,X,X)

dim(X)

dim(Y)

devtools::load_all('~/GIT/software/mvsusieR')
omp_test = function(m, d, n_thread) {
    x = m$clone(deep=TRUE)
    x$set_thread(n_thread)
    x$fit(d)
    return(0)
}

Loading mvsusieR

Loading required package: mashr

Loading required package: ashr

Loading required package: susieR

I will benchmark it on my 40 CPU threads computer, using number of threads from 1 to 96.

Center and scale the data¶

d = DenseData$new(X,Y)
d$standardize(T,T)
d$set_residual_variance(resid_Y)

mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), mash_init)

res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)

summary(res)[,c('expr', 'mean', 'median')]

There is no advantage here, as expected, because when data is centered and scaled, the parallazation happens at mixture prior level. Since only one mixture component is used, there is nothing to parallel.

Do not center and scale the data¶

This will be more computationally intensive than previous run, because sbhat here is different for every variable. But now the parallazation will happen at variable level.

d = DenseData$new(X,Y)
d$standardize(F,F)
d$set_residual_variance(resid_Y)

mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), mash_init)

res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)

summary(res)[,c('expr', 'mean', 'median')]

We see some advantage here using multiple threads. Performance keeps improving as number of threads increases, up to 40 threads (capacity of my computer). More threads asked beyond that point resulted in performance loss. It seems 4 threads strikes a good balance and reduce the compute time by more than half.

Center and scale data but using mixture prior¶

Here since we are running a mixture prior, the advantage of parallazation should kick in because for common sbhat we parallel over prior mixture,

d = DenseData$new(X,Y)
d$standardize(T,T)
d$set_residual_variance(resid_Y)

mash_init = MashInitializer$new(create_cov_canonical(ncol(Y)), 1)
B = MashRegression$new(ncol(X), mash_init)

res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)

summary(res)[,c('expr', 'mean', 'median')]

We see that the advantage is obvious for using multiple threads for computation with mixture prior having a large number of components (this case is about 60 for canonical prior).

expr	mean	median
<fct>	<dbl>	<dbl>
c1	161.0818	136.1470
c2	170.6787	119.0540
c3	175.3710	110.2931
c4	135.8872	118.4377
c8	170.4492	125.5141
c12	151.2837	131.4356
c24	145.8516	124.3913
c40	224.2847	163.7604
c96	345.9077	335.4519

expr	mean	median
<fct>	<dbl>	<dbl>
c1	359.0996	320.1640
c2	229.4660	207.2559
c3	215.6167	180.4148
c4	219.5334	178.6810
c8	171.5940	146.5264
c12	175.7622	152.8917
c24	142.9345	125.4073
c40	168.9303	150.1708
c96	322.8361	305.4616

expr	mean	median
<fct>	<dbl>	<dbl>
c1	489.7533	478.0427
c2	344.7106	323.2162
c3	300.3792	258.1757
c4	269.4045	244.0847
c8	242.0541	210.5421
c12	232.5791	215.5211
c24	246.1973	216.6343
c40	273.2946	244.1338
c96	533.4972	541.2539