Multivariate Bayesian variable selection regression

Preparing input data for MASH analysis

This include input max Z score from univariate analysis and training data, for both before and after LD pruning.

This is in part responding the reviewer's request that we should use independent subset of SNPs to fit mash model and analyze, and compare with results ignoring the LD (as done in Urbut 2017).

Obtain data

The required data has previously been extracted from GTEx V6 stored in midway:

/project/mstephens/data/internal_supp/gtex-v6-sumstat-hdf5/MatrixEQTLSumStats.Portable.h5

It's a 58MB file containing only max and null summary statistics. For convenience I copy it to my local computer and process from there (see cell below).

In [3]:
[global]
cwd = '~/Documents/GTEx/mash_revision'
input_db = "${cwd!a}/MatrixEQTLSumStats.Portable.h5"
In [2]:
%sosrun get_input
[get_input: provides = input_db]
output: input_db
task:
run:
    rsync -auzP mw:/project/mstephens/data/internal_supp/gtex-v6-sumstat-hdf5/MatrixEQTLSumStats.Portable.h5 ${input_db}
1 task completed: caf9

Extract training and testing data

The cell below loads the data, compute z-score, get a training set (and its correlation estimate $\hat{V}$) and save to an RDS file for use with mash analysis. Parameter exclude_list specifies path to the file of rownames to exclude while extracting: by default it uses all SNPs but if a list is provided (eg list after LD pruning) it will only extract results for those SNPs.

In [3]:
%sosrun extract_zscore
[extract_zscore]
parameter: exclude_list = 'NULL'
parameter: num_train = 20000
depends: R_library("rhdf5")
input: input_db
output: "${input_db!n}.Z.rds" if exclude_list == 'NULL' else "${input_db!n}.${exclude_list!bn}.Z.rds"
task: workdir = cwd
R:
    ConvertP2Z <- function(pval, beta) {
      z <- abs(qnorm(pval / 2))
      z[which(beta < 0)] <- -1 * z[which(beta < 0)]
      return(z)
    }

    GetSS <- function(gene, db) {
      dat <- rhdf5::h5read(db, gene)
      dat$"z-score" <- ConvertP2Z(dat$"p-value", dat$"beta")
      for (name in c("beta", "t-stat", "p-value", "z-score")) {
        dat[[name]] <- t(dat[[name]])
        colnames(dat[[name]]) <- dat$colnames
        rownames(dat[[name]]) <- dat$rownames
      }
      dat$colnames <- dat$rownames <- NULL
      return(dat)
    }

    load_data = function(table) {
        # load data
        mdat = GetSS('max', ${input!r})[[table]]
        ndat = GetSS('null', ${input!r})[[table]]
        # select rows to keep
        num_train = ${num_train}
        if (num_train >= nrow(ndat)) {
            num_train = floor(nrow(ndat) / 2)
        }
        train = ndat[1:num_train,]
        validate = ndat[(num_train+1):nrow(ndat),]
        if (${exclude_list!r} != "NULL") {
            pout = scan(${exclude_list!r}, what="character", sep=NULL)
            names = apply(sapply(strsplit(rownames(mdat), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
            mdat = mdat[!(names %in% pout),]
            names = apply(sapply(strsplit(rownames(ndat), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
            ndat = ndat[!(names %in% pout),]
            names = apply(sapply(strsplit(rownames(train), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
            train = train[!(names %in% pout),]
            names = apply(sapply(strsplit(rownames(validate), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
            validate = validate[!(names %in% pout),]                                              
        }
        # get vhat (SVS)
        vhat = NULL
        if (table == 'z-score') {
            max_absz = apply(abs(ndat),1, max)
            nullish = which(max_absz < 2)
            nz = ndat[nullish,]
            vhat = cor(nz)
        }
        return(list(train = train,
               validate = validate, 
               test = mdat, vhat = vhat))
    }
    ztable = load_data("z-score")
    btable = load_data("beta")
    # save output
    saveRDS(list(train.z = ztable$train,
                 validate.z = ztable$validate,
                 test.z = ztable$test,
                 train.b = btable$train,
                 validate.b = btable$validate,
                 test.b = btable$test,
                 vhat = ztable$vhat), ${output!r})
In [7]:
R:
    dat = readRDS("${input_db!n}.Z.rds")
    str(dat)
List of 7
 $ train.z   : num [1:20000, 1:44] -0.184 0.161 -1.291 -1.628 0.778 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:20000] "ENSG00000000419.8_20_49782767_C_G_b37" "ENSG00000000419.8_20_49654572_A_G_b37" "ENSG00000000419.8_20_49392478_A_G_b37" "ENSG00000000457.9_1_169117725_TG_T_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ validate.z: num [1:28198, 1:44] 0.107 1.1672 0.2172 0.3499 -0.0481 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:28198] "ENSG00000151665.8_2_47601407_A_G_b37" "ENSG00000151687.10_2_190118691_T_G_b37" "ENSG00000151687.10_2_191462133_C_T_b37" "ENSG00000151687.10_2_190620957_G_A_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ test.z    : num [1:16069, 1:44] 0.14 0.955 0.32 1.844 -4.235 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16069(1d)] "ENSG00000000419.8_20_49461813_G_C_b37" "ENSG00000000457.9_1_169695110_T_A_b37" "ENSG00000000460.12_1_169655079_G_A_b37" "ENSG00000000938.8_1_27888990_T_C_b37" ...
  .. ..$ : chr [1:44(1d)] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ train.b   : num [1:20000, 1:44] -0.01192 0.00986 -0.11973 -0.09686 0.05309 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:20000] "ENSG00000000419.8_20_49782767_C_G_b37" "ENSG00000000419.8_20_49654572_A_G_b37" "ENSG00000000419.8_20_49392478_A_G_b37" "ENSG00000000457.9_1_169117725_TG_T_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ validate.b: num [1:28198, 1:44] 0.00967 0.08876 0.0131 0.02502 -0.00229 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:28198] "ENSG00000151665.8_2_47601407_A_G_b37" "ENSG00000151687.10_2_190118691_T_G_b37" "ENSG00000151687.10_2_191462133_C_T_b37" "ENSG00000151687.10_2_190620957_G_A_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ test.b    : num [1:16069, 1:44] 0.0105 0.1067 0.0168 0.0853 -0.2458 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16069(1d)] "ENSG00000000419.8_20_49461813_G_C_b37" "ENSG00000000457.9_1_169695110_T_A_b37" "ENSG00000000460.12_1_169655079_G_A_b37" "ENSG00000000938.8_1_27888990_T_C_b37" ...
  .. ..$ : chr [1:44(1d)] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ vhat      : num [1:44, 1:44] 1 0.0674 0.0142 0.0686 0.03 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...

Extract from selected list

The default procedure above extracts all SNPs to RDS. Here running previously established workflows we create lists of LD pruned SNPs. To prune SNPs based on what we have in the database and the actual genotypes, first we extract SNP names the database (done previously) and feed them to LD pruning routine.

For test SNPs

sos run analysis/20170828_Compute_LD.ipynb get_ld:1-2 -b ~/Documents/GTEx/bin \
    --input_list $HOME/Documents/GTEx/mash_revision/snp_eqtl.txt
Pruning complete.  4204 of 13030 variants removed.

For train SNPs

sos run analysis/20170828_Compute_LD.ipynb get_ld:1-2 -b ~/Documents/GTEx/bin \
    --input_list $HOME/Documents/GTEx/mash_revision/snp_random.txt
Pruning complete.  19405 of 41038 variants removed.

We end up with two lists: snp_eqtls.extracted.prune.out and snp_random.extracted.prune.out. We consolidate them to one list ld2.out.

In [2]:
!cat /home/gaow/Documents/GTEx/mash_revision/snp_eqtls.extracted.prune.out \
     /home/gaow/Documents/GTEx/mash_revision/snp_random.extracted.prune.out > \
     /home/gaow/Documents/GTEx/mash_revision/ld2.out

To extract data excluding ld2.out simply execute the workflow above with exclude_list set to ld2.out:

sos run analysis/20170829_MASH_Input_Preparation extract_zscore \
    --exclude_list /home/gaow/Documents/GTEx/mash_revision/ld2.out
In [6]:
R:
    dat = readRDS("${input_db!n}.ld2.Z.rds")
    str(dat)
List of 7
 $ train.z   : num [1:12143, 1:44] -0.184 0.161 -1.291 -1.628 0.778 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:12143] "ENSG00000000419.8_20_49782767_C_G_b37" "ENSG00000000419.8_20_49654572_A_G_b37" "ENSG00000000419.8_20_49392478_A_G_b37" "ENSG00000000457.9_1_169117725_TG_T_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ validate.z: num [1:16316, 1:44] 1.167 1.099 3.371 0.293 -0.384 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16316] "ENSG00000151687.10_2_190118691_T_G_b37" "ENSG00000151689.8_2_191969341_C_T_b37" "ENSG00000151690.10_2_191471847_C_T_b37" "ENSG00000151690.10_2_191018883_G_A_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ test.z    : num [1:11619, 1:44] 0.14 0.955 0.32 -4.235 -3.829 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:11619] "ENSG00000000419.8_20_49461813_G_C_b37" "ENSG00000000457.9_1_169695110_T_A_b37" "ENSG00000000460.12_1_169655079_G_A_b37" "ENSG00000000971.11_1_196513323_A_G_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ train.b   : num [1:12143, 1:44] -0.01192 0.00986 -0.11973 -0.09686 0.05309 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:12143] "ENSG00000000419.8_20_49782767_C_G_b37" "ENSG00000000419.8_20_49654572_A_G_b37" "ENSG00000000419.8_20_49392478_A_G_b37" "ENSG00000000457.9_1_169117725_TG_T_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ validate.b: num [1:16316, 1:44] 0.0888 0.064 0.1382 0.0142 -0.0199 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16316] "ENSG00000151687.10_2_190118691_T_G_b37" "ENSG00000151689.8_2_191969341_C_T_b37" "ENSG00000151690.10_2_191471847_C_T_b37" "ENSG00000151690.10_2_191018883_G_A_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ test.b    : num [1:11619, 1:44] 0.0105 0.1067 0.0168 -0.2458 -0.1605 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:11619] "ENSG00000000419.8_20_49461813_G_C_b37" "ENSG00000000457.9_1_169695110_T_A_b37" "ENSG00000000460.12_1_169655079_G_A_b37" "ENSG00000000971.11_1_196513323_A_G_b37" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
 $ vhat      : num [1:44, 1:44] 1 0.0924 0.0127 0.0792 0.0415 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...
  .. ..$ : chr [1:44] "Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" ...

As a result of pruning the training set dropped from 20K to 12K, the test set dropped from 16K to 11.6K.

In [27]:
%sessioninfo

SoS

SoS Version
0.9.8.10

R

Kernel
ir
Language
R
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: BunsenLabs GNU/Linux 8.7 (Hydrogen)

Matrix products: default
BLAS: /usr/lib64/microsoft-r/3.4/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/microsoft-r/3.4/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
 [1] compiler_3.4.0      R6_2.2.0            magrittr_1.5       
 [4] RevoUtils_10.0.4    IRdisplay_0.4.4     pbdZMQ_0.2-5       
 [7] tools_3.4.0         crayon_1.3.2        uuid_0.1-2         
[10] stringi_1.1.5       IRkernel_0.8.7.9000 jsonlite_1.4       
[13] stringr_1.2.0       digest_0.6.12       repr_0.12.0        
[16] evaluate_0.10      

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago