Multivariate Bayesian variable selection regression

OpenMP benchmark for Rcpp based codes

Here I test if OpenMP helps with some of the computations.

In [1]:
attach(readRDS('em_optim_difference.rds'))

Here, sample size N is around 800, number of variables P is around 600. 50 conditions are involved.

In [2]:
X = cbind(X,X,X)
In [3]:
dim(X)
  1. 838
  2. 639
In [4]:
dim(Y)
  1. 838
  2. 50
In [5]:
devtools::load_all('~/GIT/software/mvsusieR')
omp_test = function(m, d, n_thread) {
    x = m$clone(deep=TRUE)
    x$set_thread(n_thread)
    x$fit(d)
    return(0)
}
Loading mvsusieR

Loading required package: mashr

Loading required package: ashr

Loading required package: susieR

I will benchmark it on my 40 CPU threads computer, using number of threads from 1 to 96.

Center and scale the data

In [11]:
d = DenseData$new(X,Y)
d$standardize(T,T)
d$set_residual_variance(resid_Y)

mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), mash_init)
In [12]:
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)
In [13]:
summary(res)[,c('expr', 'mean', 'median')]
A data.frame: 9 × 3
exprmeanmedian
<fct><dbl><dbl>
c1 161.0818136.1470
c2 170.6787119.0540
c3 175.3710110.2931
c4 135.8872118.4377
c8 170.4492125.5141
c12151.2837131.4356
c24145.8516124.3913
c40224.2847163.7604
c96345.9077335.4519

There is no advantage here, as expected, because when data is centered and scaled, the parallazation happens at mixture prior level. Since only one mixture component is used, there is nothing to parallel.

Do not center and scale the data

This will be more computationally intensive than previous run, because sbhat here is different for every variable. But now the parallazation will happen at variable level.

In [6]:
d = DenseData$new(X,Y)
d$standardize(F,F)
d$set_residual_variance(resid_Y)

mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), mash_init)
In [16]:
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)
In [17]:
summary(res)[,c('expr', 'mean', 'median')]
A data.frame: 9 × 3
exprmeanmedian
<fct><dbl><dbl>
c1 359.0996320.1640
c2 229.4660207.2559
c3 215.6167180.4148
c4 219.5334178.6810
c8 171.5940146.5264
c12175.7622152.8917
c24142.9345125.4073
c40168.9303150.1708
c96322.8361305.4616

We see some advantage here using multiple threads. Performance keeps improving as number of threads increases, up to 40 threads (capacity of my computer). More threads asked beyond that point resulted in performance loss. It seems 4 threads strikes a good balance and reduce the compute time by more than half.

Center and scale data but using mixture prior

Here since we are running a mixture prior, the advantage of parallazation should kick in because for common sbhat we parallel over prior mixture,

In [10]:
d = DenseData$new(X,Y)
d$standardize(T,T)
d$set_residual_variance(resid_Y)

mash_init = MashInitializer$new(create_cov_canonical(ncol(Y)), 1)
B = MashRegression$new(ncol(X), mash_init)
In [11]:
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)
In [12]:
summary(res)[,c('expr', 'mean', 'median')]
A data.frame: 9 × 3
exprmeanmedian
<fct><dbl><dbl>
c1 489.7533478.0427
c2 344.7106323.2162
c3 300.3792258.1757
c4 269.4045244.0847
c8 242.0541210.5421
c12232.5791215.5211
c24246.1973216.6343
c40273.2946244.1338
c96533.4972541.2539

We see that the advantage is obvious for using multiple threads for computation with mixture prior having a large number of components (this case is about 60 for canonical prior).


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago