Multivariate Bayesian variable selection regression

Analysis with precomputed quantatities + missing data in Y

Here we analyze an example of GTEx V7 cis-gene data-set.

In [1]:
raw_data = readRDS('Multi_Tissues.ENSG00000145214.RDS')
In [2]:
names(raw_data)
  1. 'X'
  2. 'y'
  3. 'Z'
  4. 'y_res'
  5. 'chrom'
  6. 'pos'
In [3]:
dim(raw_data$X)
  1. 838
  2. 7962
In [4]:
dim(raw_data$y_res)
  1. 838
  2. 49
In [5]:
head(raw_data$y_res)
A matrix: 6 × 49 of type dbl[,49]
Adipose_SubcutaneousAdipose_Visceral_OmentumAdrenal_GlandArtery_AortaArtery_CoronaryArtery_TibialBrain_AmygdalaBrain_Anterior_cingulate_cortex_BA24Brain_Caudate_basal_gangliaBrain_Cerebellar_HemisphereSkin_Not_Sun_Exposed_SuprapubicSkin_Sun_Exposed_Lower_legSmall_Intestine_Terminal_IleumSpleenStomachTestisThyroidUterusVaginaWhole_Blood
GTEX-1117F 0.02125546 0.31798809 NA NA 0.2393113-1.04494665NANANANA 0.15067370 NA NA NA NA NA NA0.010394380.722842 NA
GTEX-111CU-0.39699365-0.64860112-0.005019028 NA NA NANANANANA 0.63489211 NA0.001705089-0.2423771-0.23110084-0.8254563-0.72159107 NA NA NA
GTEX-111FC 0.12702099 NA NA NA NA-0.11075103NANANANA-0.27617936 0.003240822 NA-0.1753999 NA 0.3182093-0.44989562 NA NA NA
GTEX-111VG-0.56557561 NA NA NA NA NANANANANA-0.18524402-0.370778712 NA NA NA-0.5285808-0.47352474 NA NA NA
GTEX-111YS-0.70396973-0.19603980 0.245463191 0.04715705 NA-0.07033531NANANANA-0.30672580 NA0.492145996 NA-0.08577711-0.4095227-0.47541942 NA NA-1.06802847
GTEX-1122O 0.04918042-0.08174026 0.365081493-0.07085276-0.1343889 0.40367406NANANANA-0.09075013 0.3942400610.404058046-0.2285992 0.14804767 NA-0.02374288 NA NA 0.08137622

Initialize data object

In [6]:
data = mvsusieR:::DenseData$new(raw_data$X, raw_data$y_res)

Setting up MASH object

In [7]:
residual_covar = diag(apply(raw_data$y_res, 2, function(x) var(x, na.rm=T)))
In [8]:
prior_mats = mvsusieR:::create_cov_canonical(ncol(raw_data$y_res), singletons=F)
In [9]:
scaling = c(0.05,0.15,0.25,0.4) # FIXME: use auto-grid
In [13]:
str(prior_mats)
List of 5
 $ : num [1:49, 1:49] 1 0 0 0 0 0 0 0 0 0 ...
 $ : num [1:49, 1:49] 1 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 ...
 $ : num [1:49, 1:49] 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
 $ : num [1:49, 1:49] 1 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 ...
 $ : num [1:49, 1:49] 1 1 1 1 1 1 1 1 1 1 ...
In [15]:
str(data)
Classes 'DenseData', 'R6' <DenseData>
  Public:
    add_to_residual: function (value) 
    clone: function (deep = FALSE) 
    compute_MXt: function (M) 
    compute_residual: function (fitted) 
    compute_Xb: function (b) 
    initialize: function (X, Y, center = TRUE, scale = TRUE) 
    n_condition: active binding
    n_effect: active binding
    n_sample: active binding
    remove_from_residual: function (value) 
    rescale_coef: function (b) 
    X: active binding
    X_has_missing: active binding
    X2_sum: active binding
    XtR: active binding
    XtX: active binding
    XtY: active binding
    Y: active binding
    Y_has_missing: active binding
  Private:
    .X: -0.156460235018997 -0.156460235018997 -0.156460235018997 ...
    .Y: 0.0212554552618386 -0.396993654855127 0.127020987443375  ...
    .Y_has_missing: TRUE
    cm: 0.0238948630831407 0.03125 0.0323353769671092 0.21771036 ...
    csd: 0.152721636141217 0.173472166622178 0.176677642549131 0. ...
    d: 708.232129614059 672.955528846154 678.637996254244 598.7 ...
    J: 7962
    N: 838
    R: 49
    residual: 0.0212554552618386 -0.396993654855127 0.127020987443375  ...
    standardize: function (center, scale) 
    Y_mean: 5.89111793866254e-17 4.89245652401156e-17 -1.23831709073 ...
    Y_non_missing: TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE T ... 

Now using EE model,

In [16]:
mash_init = mvsusieR:::MashInitializer$new(prior_mats, scaling, alpha = 0)
In [17]:
mash_init$precompute_cov_matrices(data, residual_covar)

The line above currently takes 3m40s. It is 2.5GB on disk in RDS format. This is to compute for $R = 49, J = 7962, P = 21$. $P$ is 21 for null weight plus at most 20 other components. I saved it to disk,

In [19]:
saveRDS(mash_init, 'mash_init.rds')
-rw-r--r-- 1 gaow gaow  2.5G May 12 07:41 mash_init.rds

And test the memory it takes to keep it -- 7.42GB.

python ~/GIT/github/misc/monitor/monitor.py Rscript -e "mash_init = readRDS('mash_init.rds')"
time elapsed: 25.18s
peak first occurred: 15.40s
peak last occurred: 24.65s
max vms_memory: 7.42GB
max rss_memory: 7.23GB
memory check interval: 1s
return code: 0

Fit one MASH regression model

In [20]:
mvsusie_obj = mvsusieR:::MashRegression$new(ncol(raw_data$X), residual_covar, mash_init)
In [ ]:
mvsusie_obj$fit(data)

The step above now take 2 min. This is for one iteration. MV-SuSiE computation time will depend on how many iterations there are, as we will find out next.

Fit MV-SuSiE model

In [ ]:
#mash_init = readRDS('mash_init.rds')
In [ ]:
res = mvsusieR::susie(raw_data$X,raw_data$y_res,
                  L=10,prior_variance=mash_init,
                  compute_objective=FALSE)

Code above takes 25 minutes to complete for L=5, >1hr for L=10.

In [ ]:
saveRDS(res, "mvsusie_res_10.rds")

Visualize results

In [ ]:
p = mvsusieR::mvsusie_plot(res)
In [ ]:
pdf('mvsusie_plot_ENSG00000145214.pdf', width = 20, height = 20)
print(p$plot)
dev.off()
In [ ]:
pdf('susie_plot_ENSG00000145214.pdf', width=9, height=5)
susieR::susie_plot(res,y='PIP', main = 'Default SuSiE plot for cross-condition PIP', xlab = 'SNP positions', add_legend = T)
dev.off()
In [ ]:
%preview susie_plot_ENSG00000145214.pdf -s png --dpi 150
In [ ]:
%preview mvsusie_plot_ENSG00000145214.pdf -s png --dpi 150

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago