Multivariate Bayesian variable selection regression

MASH analysis for Urbut 2017

Reproducing (using old mashr codes) the Urbut 2017 paper in response to reviewer requests.

It largely follows from this vignette.

In [ ]:
[global]
parameter: cwd = '~/Documents/GTEx/mash_revision'
sfa_exe = "${cwd!a}/sfa/bin/sfa_linux"
mashr_src = "${cwd!a}/mashr-paper-master/R/main.R"
parameter: data = "${cwd!a}/MatrixEQTLSumStats.Portable.Z.rds"

Covariance pattern discovery

This obtains covariance matrices, ie, the priors, for mash model.

SFA

We analyze data with SFA. The cell below downloads SFA software and run it on data with rank K = 5.

In [ ]:
%sosrun sfa
[sfa_download: provides = sfa_exe]
task: workdir = cwd
download: decompress = True
    http://stephenslab.uchicago.edu/assets/software/sfa/sfa1.0.tar.gz

[sfa]
depends: sfa_exe
K = 5
tmpfile = "/tmp/${data!bn}.max.txt"
input: data
output: "${input!n}.sfa.rds"
task: workdir = cwd
R:
    z = readRDS(${input!ar})$test.z
    write.table(z, ${tmpfile!r}, col.names=F,row.names=F)
    cmd = paste0('${sfa_exe} -gen ${tmpfile} -g ', dim(z)[1], ' -n ', dim(z)[2], 
                 ' -k ${K} -iter 50 -rand 999 -o ${input!bn}')
    system(cmd)
    saveRDS(list(F = read.table("${input!n}_F.out"),
                lambda = read.table("${input!n}_lambda.out"),
                sigma2 = read.table("${input!n}_sigma2.out"),
                alpha = read.table("${input!n}_alpha.out")), ${output!r})
bash:
    rm -f *{_F.out,_lambda.out,_sigma2.out,_alpha.out}

Create and refine multi-rank covariance matrices

Here we create 3 covariance matrices:

  • SFA (rank 5, previously computed)
  • SVD (rank 3, to be computed)
  • Empirical covariance

and apply Extreme Deconvolution to refine the matrices. We observed that Extreme Deconvolution perserves rank.

In [ ]:
[mashr_download: provides = mashr_src]
task: workdir = cwd
download: decompress = True
    https://github.com/stephenslab/mashr-paper/archive/master.zip

[mash_1]
# Perform SVD + ED
depends: R_library("ExtremeDeconvolution"), mashr_src
K = 3
P = 3
input: data, "${data!n}.sfa.rds"
output: "${data!n}.coved.${K}.${P}.rds"
task: workdir = cwd
R:
    setwd("${mashr_src!da}")
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd!ar})
    dat = readRDS('${data}')
    t.stat = dat$test.z
    mean.mat = matrix(rep(0,ncol(t.stat)*nrow(t.stat)),ncol=ncol(t.stat),nrow=nrow(t.stat))
    s.j = matrix(rep(1,ncol(t.stat)*nrow(t.stat)),ncol=ncol(t.stat),nrow=nrow(t.stat))
    v.mat = dat$vhat
    v.j=list()
    for(i in 1:nrow(t.stat)){v.j[[i]]=v.mat}
    K = ${K}
    P = ${P}
    R = ncol(t.stat)
    sfa = readRDS("${data!n}.sfa.rds")
    init.cov = init.covmat(t.stat=t.stat, factor.mat = as.matrix(sfa$F),lambda.mat = as.matrix(sfa$lambda), K=K,P=P)
    init.cov.list = list()
    for(i in 1:K){init.cov.list[[i]]=init.cov[i,,]}
    projection = list();for(l in 1:nrow(t.stat)){projection[[l]]=diag(1,R)}
    e = ExtremeDeconvolution::extreme_deconvolution(ydata=t.stat,ycovar=v.j,xamp=rep(1/K,K),xmean=mean.mat,xcovar=init.cov.list,fixmean=T,projection=projection)
    true.covs = array(dim=c(K,R,R))
    for(i in 1:K){true.covs[i,,]=e$xcovar[[i]]}
    saveRDS(list(true.covs=true.covs,pi=e$xamp), ${output!r})

Add in single-rank covariance matrices

Now additionally we include 2 other types of covariance matrices:

  • canonical configurations (aka bmalite)
  • single rank SFA

We also expand the list of matrices by grid. At the end of this step (cell below) we are ready to fit the mash model.

In [ ]:
[mash_2: shared = {'prior_matrices': 'output'}]
output: "${input!n}.lite.single.expanded.rds"
task: workdir = cwd
R:
    setwd("${mashr_src!da}")
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd!ar})
    dat = readRDS('${data}')
    z.stat = dat$test.z
    rownames(z.stat) = NULL
    colnames(z.stat) = NULL
    v.mat = dat$vhat
    s.j = matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
    sfa = readRDS("${data!n}.sfa.rds")
    res = compute.hm.covmat.all.max.step(b.hat=z.stat,se.hat=s.j,
                                          t.stat=z.stat,Q=5,
                                          lambda.mat=as.matrix(sfa$lambda),
                                          A='.remove_before_rerun',factor.mat=as.matrix(sfa$F),
                                          max.step=readRDS(${input!r}),
                                          zero=TRUE)
    saveRDS(res, ${output!r})

run:
    rm -f *.remove_before_rerun.rds

Fit MASH mixture model

Using a training set, the cell below computes the weights for input covariance matrices (priors) in MASH mixture. The output contains matrix of log-likelihoods as well as weights learned from the hierarchical model.

In [ ]:
[mash_3]
depends: R_library("SQUAREM")
output: "${input!n}.pihat.rds", "${input!n}.loglik.rds"
task: workdir = cwd
R:
    library("SQUAREM")
    setwd("${mashr_src!da}")
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd!ar})
    dat = readRDS('${data}')
    v.mat = dat$vhat
    covmat = readRDS(${input!r})$covmat
    train.z=as.matrix(dat$train.z)
    rownames(train.z)=NULL
    colnames(train.z)=NULL
    train.v=train.z/train.z
    res = compute.hm.train.log.lik.pen.vmat(train.b = train.z,
                                            covmat = covmat,
                                            A = '.remove_before_rerun', 
                                            pen = 1,
                                            train.s = train.v,
                                            cormat = v.mat)
    saveRDS(res$pis, ${output[0]!r})
    saveRDS(res$lik.mat, ${output[1]!r})

run:
    rm -f *.remove_before_rerun.rds

Posterior inference

Applying hyperparameters learned from the training set to the test set, the cell below computes posterior quantities.

In [ ]:
[mash_4]
depends: sos_variable('prior_matrices')
output: "${input[0]!nn}.posterior.rds"
task: workdir = cwd
R:
    setwd("${mashr_src!da}")
    ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
    setwd(${cwd!ar})
    dat = readRDS(${data!r})
    z.stat = dat$test.z
    v.mat = dat$vhat
    s.j=matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
    pis = readRDS(${input[0]!r})$pihat
    covmat = readRDS(${prior_matrices!r})$covmat
    res = lapply(seq(1:nrow(z.stat)), function(j){
        total.quant.per.snp.with.vmat(j=j, covmat=covmat, 
                                      b.gp.hat=z.stat, 
                                      cormat=v.mat, 
                                      se.gp.hat=s.j, 
                                      pis=pis, 
                                      A='remove_before_rerun', 
                                      checkpoint = TRUE)})
    # data formatting.
    out = do.call(Map, c(f = rbind, res))
    saveRDS(out, ${output!r})

Now MASH analysis is complete. I will use a separate notebook to summarize, plot and visualize the result of analysis.

Run this notebook

For repeated runs it might be easier to execute from commandline instead of in Jupyter:

sos run analysis/20170829_MASH_Paper.ipynb sfa # --data
sos run analysis/20170829_MASH_Paper.ipynb mash # --data ... --cwd ...

The notebook runs default setting. Additionally I run it for dataset after LD pruning:

sos run analysis/20170829_MASH_Paper.ipynb sfa \
    --data $HOME/Documents/GTEx/mash_revision/MatrixEQTLSumStats.Portable.ld2.Z.rds
sos run analysis/20170829_MASH_Paper.ipynb mash \
    --data $HOME/Documents/GTEx/mash_revision/MatrixEQTLSumStats.Portable.ld2.Z.rds
In [1]:
%sessioninfo

SoS

SoS Version
0.9.8.10

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago