Here I test if OpenMP helps with some of the computations.
attach(readRDS('em_optim_difference.rds'))
Here, sample size N
is around 800, number of variables P
is around 600. 50 conditions are involved.
X = cbind(X,X,X)
dim(X)
dim(Y)
devtools::load_all('~/GIT/software/mvsusieR')
omp_test = function(m, d, n_thread) {
x = m$clone(deep=TRUE)
x$set_thread(n_thread)
x$fit(d)
return(0)
}
I will benchmark it on my 40 CPU threads computer, using number of threads from 1 to 96.
d = DenseData$new(X,Y)
d$standardize(T,T)
d$set_residual_variance(resid_Y)
mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), mash_init)
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)
summary(res)[,c('expr', 'mean', 'median')]
There is no advantage here, as expected, because when data is centered and scaled, the parallazation happens at mixture prior level. Since only one mixture component is used, there is nothing to parallel.
This will be more computationally intensive than previous run, because sbhat
here is different for every variable. But now the parallazation will happen at variable level.
d = DenseData$new(X,Y)
d$standardize(F,F)
d$set_residual_variance(resid_Y)
mash_init = MashInitializer$new(list(diag(ncol(Y))), 1)
B = MashRegression$new(ncol(X), mash_init)
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)
summary(res)[,c('expr', 'mean', 'median')]
We see some advantage here using multiple threads. Performance keeps improving as number of threads increases, up to 40 threads (capacity of my computer). More threads asked beyond that point resulted in performance loss. It seems 4 threads strikes a good balance and reduce the compute time by more than half.
Here since we are running a mixture prior, the advantage of parallazation should kick in because for common sbhat
we parallel over prior mixture,
d = DenseData$new(X,Y)
d$standardize(T,T)
d$set_residual_variance(resid_Y)
mash_init = MashInitializer$new(create_cov_canonical(ncol(Y)), 1)
B = MashRegression$new(ncol(X), mash_init)
res = microbenchmark::microbenchmark(c1 = omp_test(B, d, 1),
c2 = omp_test(B, d, 2), c3 = omp_test(B, d, 3),
c4 = omp_test(B, d, 4), c8 = omp_test(B, d, 8),
c12 = omp_test(B, d, 12), c24 = omp_test(B, d, 24),
c40 = omp_test(B, d, 40), c96 = omp_test(B, d, 96),
times = 30
)
summary(res)[,c('expr', 'mean', 'median')]
We see that the advantage is obvious for using multiple threads for computation with mixture prior having a large number of components (this case is about 60 for canonical prior).