Multivariate Bayesian variable selection regression

MASH analysis for GTEx V8 data

This is the new mashr version of analysis.

It follows in structure from this notebook though not implementation. The data has been prepared beforehand via this notebook. To recap, for V8 data, 39784 genes are found having data in at least one of the 49 tissues; 15632 of them have non-missing data in all tissues and thus extracted, as the "top" set of eQTLs for MASH procedure. Additionally, 9 random (equally spaced in position) SNPs are extracted from all cis-SNPs per gene to "train" the MASH model on. We use Z-score for computations in this workflow.

In [6]:
[global]
sfa_exe = "~/Documents/GTExV8/utils/sfa/bin/sfa_linux"
parameter: data = "~/Documents/GTExV8/MASH/GTExV8.ciseQTL.4MASH.rds"
parameter: cwd = "${data!ad}"
parameter: cov = "xtx"
parameter: vhat = 1

Covariance pattern discovery

This section obtains covariance matrices, ie, the priors, for mash model.

SFA

We analyze data with SFA, which will be used to provide part of the prior matrices list. The cell below downloads SFA software and run it on data with rank K = 5.

In [5]:
%run sfa
[sfa_download: provides = sfa_exe]
task: workdir = "~/Documents/GTExV8/utils"
download: decompress = True
    http://stephenslab.uchicago.edu/assets/software/sfa/sfa1.0.tar.gz

[sfa]
depends: sfa_exe
K = 5
tmpfile = "/tmp/${data!bn}.max.txt"
input: data
output: "${input!n}.sfa.rds"
task: workdir = cwd
R:
    z = readRDS(${input!ar})$max$z
    write.table(z, ${tmpfile!r}, col.names=F,row.names=F)
    cmd = paste0('${sfa_exe} -gen ${tmpfile} -g ', dim(z)[1], ' -n ', dim(z)[2], 
                 ' -k ${K} -iter 50 -rand 999 -o ${input!bn}')
    system(cmd)
    saveRDS(list(F = read.table("${input!n}_F.out"),
                lambda = read.table("${input!n}_lambda.out"),
                sigma2 = read.table("${input!n}_sigma2.out"),
                alpha = read.table("${input!n}_alpha.out")), ${output!r})
bash:
    rm -f *{_F.out,_lambda.out,_sigma2.out,_alpha.out}
INFO: sfa_download (index=0) is ignored due to saved signature
1 task completed: d793

Create and refine multi-rank covariance matrices

Here we create 3 covariance matrices:

  • SFA (rank 5, previously computed)
    • plus single rank SFAs
  • PCA (rank 3, to be computed)
    • plus single rank PCAs
  • Optionally
    • Empirical covariance (via parameter: cov = "cov"; default set to nocov)

and apply Extreme Deconvolution to refine these matrices. We observed that Extreme Deconvolution perserves rank.

Additionally we include 2 other types of covariance matrices:

  • canonical configurations (aka bmalite)
  • simple heterogeneity models
In [ ]:
%run
[mash_1]
# Data-driven covariates with PCA and factor analysis on top signals
# followed by extreme deconvolution refinement
depends: R_library("ExtremeDeconvolution"), R_library("stephenslab/mashr")
K = 5 # same as in mash paper
P = 3 # same as in mash paper
input: "${data!a}", "${data!an}.sfa.rds"
output: "${data!an}.${cov}.K${K}.P${P}.rds"
task: workdir = cwd
R:
    dat = readRDS('${input[0]}')
    sfa_data = readRDS("${input[1]}")
    mash_data = mashr::set_mash_data(as.matrix(dat$max$z), matrix(1, nrow(dat$max$z), ncol(dat$max$z)))
    sfa_res = as.matrix(sfa_data$lambda) %*% as.matrix(sfa_data$F)
    # SFA matrices
    U.sfa = c(mashr::cov_from_factors(as.matrix(sfa_data$F), "sfa${K}"), list("tSFA" = t(sfa_res) %*% sfa_res / nrow(dat$max$z)))
    # SVD matrices
    U.pca = mashr::cov_pca(mash_data, ${P})
    # Emperical data matrices
    # `cov_ed` will take significantly longer when this empirical convariance matrix is added
    D.center = apply(as.matrix(dat$max$z), 2, function(x) x - mean(x))
    # Denoised data-driven matrices
    U.dd = c(U.sfa, U.pca, if (${cov!r} == 'xtx') list("XX" = t(D.center) %*% D.center / nrow(dat$max$z)) else list())
    U.ed = mashr::cov_ed(mash_data, U.dd)
    # Canonical matrices
    U.can = mashr::cov_canonical(mash_data)
    saveRDS(list(Ulist = c(U.ed, U.can), DD_raw = U.dd), ${output!r})

Fit MASH mixture model

Using a random null set, the cell below computes the weights for input covariance matrices (priors) in MASH mixture. The output contains matrix of log-likelihoods as well as weights learned from the hierarchical model.

In [ ]:
[mash_2]
depends: R_library("REBayes")
output: "${input!n}.V${vhat}.mash_model.rds"
task: workdir = cwd
R:
    dat = readRDS(${data!ar})
    if (${vhat}) {
        V = cor(dat$null$z[which(apply(abs(dat$null$z),1, max) < 2),])
    } else {
        V = diag(ncol(dat$null$z))
    }
    mash_data = mashr::set_mash_data(as.matrix(dat$null$z), matrix(1, nrow(dat$null$z), ncol(dat$null$z)), as.matrix(V))
    saveRDS(mashr::mash(mash_data, Ulist = readRDS(${input!r})$Ulist, outputlevel = 1), ${output!r})

Posterior inference

Applying hyperparameters learned from the training (the null) set to the test (the top eQTL) set, the cell below computes posterior quantities.

In [ ]:
[mash_3]
output: "${input!nnn}.V${vhat}.mash_posterior.rds"
task: workdir = cwd
R:
    dat = readRDS(${data!ar})
    if (${vhat}) {
        V = cor(dat$null$z[which(apply(abs(dat$null$z),1, max) < 2),])
    } else {
        V = diag(ncol(dat$null$z))
    }
    mash_data = mashr::set_mash_data(as.matrix(dat$max$z), matrix(1, nrow(dat$max$z), ncol(dat$max$z)), as.matrix(V))
    saveRDS(mashr::mash_compute_posterior_matrices(readRDS(${input!r}), mash_data), ${output!r})

Now MASH analysis is complete. I will use a separate notebook to summarize, plot and visualize the result of analysis.

Run this notebook

For repeated runs it might be easier to execute from commandline instead of in Jupyter:

sos run analysis/20171002_MASH_V8.ipynb sfa # --data
sos run analysis/20171002_MASH_V8.ipynb mash # --data
In [24]:
%sessioninfo

SoS

SoS Version
0.9.9.1

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago