Reproducing (using old mashr
codes) the Urbut 2017 paper in response to reviewer requests.
It largely follows from this vignette.
[global]
parameter: cwd = '~/Documents/GTEx/mash_revision'
sfa_exe = "${cwd!a}/sfa/bin/sfa_linux"
mashr_src = "${cwd!a}/mashr-paper-master/R/main.R"
parameter: data = "${cwd!a}/MatrixEQTLSumStats.Portable.Z.rds"
%sosrun sfa
[sfa_download: provides = sfa_exe]
task: workdir = cwd
download: decompress = True
http://stephenslab.uchicago.edu/assets/software/sfa/sfa1.0.tar.gz
[sfa]
depends: sfa_exe
K = 5
tmpfile = "/tmp/${data!bn}.max.txt"
input: data
output: "${input!n}.sfa.rds"
task: workdir = cwd
R:
z = readRDS(${input!ar})$test.z
write.table(z, ${tmpfile!r}, col.names=F,row.names=F)
cmd = paste0('${sfa_exe} -gen ${tmpfile} -g ', dim(z)[1], ' -n ', dim(z)[2],
' -k ${K} -iter 50 -rand 999 -o ${input!bn}')
system(cmd)
saveRDS(list(F = read.table("${input!n}_F.out"),
lambda = read.table("${input!n}_lambda.out"),
sigma2 = read.table("${input!n}_sigma2.out"),
alpha = read.table("${input!n}_alpha.out")), ${output!r})
bash:
rm -f *{_F.out,_lambda.out,_sigma2.out,_alpha.out}
Here we create 3 covariance matrices:
and apply Extreme Deconvolution to refine the matrices. We observed that Extreme Deconvolution perserves rank.
[mashr_download: provides = mashr_src]
task: workdir = cwd
download: decompress = True
https://github.com/stephenslab/mashr-paper/archive/master.zip
[mash_1]
# Perform SVD + ED
depends: R_library("ExtremeDeconvolution"), mashr_src
K = 3
P = 3
input: data, "${data!n}.sfa.rds"
output: "${data!n}.coved.${K}.${P}.rds"
task: workdir = cwd
R:
setwd("${mashr_src!da}")
ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
setwd(${cwd!ar})
dat = readRDS('${data}')
t.stat = dat$test.z
mean.mat = matrix(rep(0,ncol(t.stat)*nrow(t.stat)),ncol=ncol(t.stat),nrow=nrow(t.stat))
s.j = matrix(rep(1,ncol(t.stat)*nrow(t.stat)),ncol=ncol(t.stat),nrow=nrow(t.stat))
v.mat = dat$vhat
v.j=list()
for(i in 1:nrow(t.stat)){v.j[[i]]=v.mat}
K = ${K}
P = ${P}
R = ncol(t.stat)
sfa = readRDS("${data!n}.sfa.rds")
init.cov = init.covmat(t.stat=t.stat, factor.mat = as.matrix(sfa$F),lambda.mat = as.matrix(sfa$lambda), K=K,P=P)
init.cov.list = list()
for(i in 1:K){init.cov.list[[i]]=init.cov[i,,]}
projection = list();for(l in 1:nrow(t.stat)){projection[[l]]=diag(1,R)}
e = ExtremeDeconvolution::extreme_deconvolution(ydata=t.stat,ycovar=v.j,xamp=rep(1/K,K),xmean=mean.mat,xcovar=init.cov.list,fixmean=T,projection=projection)
true.covs = array(dim=c(K,R,R))
for(i in 1:K){true.covs[i,,]=e$xcovar[[i]]}
saveRDS(list(true.covs=true.covs,pi=e$xamp), ${output!r})
Now additionally we include 2 other types of covariance matrices:
bmalite
)We also expand the list of matrices by grid. At the end of this step (cell below) we are ready to fit the mash model.
[mash_2: shared = {'prior_matrices': 'output'}]
output: "${input!n}.lite.single.expanded.rds"
task: workdir = cwd
R:
setwd("${mashr_src!da}")
ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
setwd(${cwd!ar})
dat = readRDS('${data}')
z.stat = dat$test.z
rownames(z.stat) = NULL
colnames(z.stat) = NULL
v.mat = dat$vhat
s.j = matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
sfa = readRDS("${data!n}.sfa.rds")
res = compute.hm.covmat.all.max.step(b.hat=z.stat,se.hat=s.j,
t.stat=z.stat,Q=5,
lambda.mat=as.matrix(sfa$lambda),
A='.remove_before_rerun',factor.mat=as.matrix(sfa$F),
max.step=readRDS(${input!r}),
zero=TRUE)
saveRDS(res, ${output!r})
run:
rm -f *.remove_before_rerun.rds
Using a training set, the cell below computes the weights for input covariance matrices (priors) in MASH mixture. The output contains matrix of log-likelihoods as well as weights learned from the hierarchical model.
[mash_3]
depends: R_library("SQUAREM")
output: "${input!n}.pihat.rds", "${input!n}.loglik.rds"
task: workdir = cwd
R:
library("SQUAREM")
setwd("${mashr_src!da}")
ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
setwd(${cwd!ar})
dat = readRDS('${data}')
v.mat = dat$vhat
covmat = readRDS(${input!r})$covmat
train.z=as.matrix(dat$train.z)
rownames(train.z)=NULL
colnames(train.z)=NULL
train.v=train.z/train.z
res = compute.hm.train.log.lik.pen.vmat(train.b = train.z,
covmat = covmat,
A = '.remove_before_rerun',
pen = 1,
train.s = train.v,
cormat = v.mat)
saveRDS(res$pis, ${output[0]!r})
saveRDS(res$lik.mat, ${output[1]!r})
run:
rm -f *.remove_before_rerun.rds
Applying hyperparameters learned from the training set to the test set, the cell below computes posterior quantities.
[mash_4]
depends: sos_variable('prior_matrices')
output: "${input[0]!nn}.posterior.rds"
task: workdir = cwd
R:
setwd("${mashr_src!da}")
ret = sapply(list.files(pattern = "*.R"), source, .GlobalEnv)
setwd(${cwd!ar})
dat = readRDS(${data!r})
z.stat = dat$test.z
v.mat = dat$vhat
s.j=matrix(rep(1,ncol(z.stat)*nrow(z.stat)),ncol=ncol(z.stat),nrow=nrow(z.stat))
pis = readRDS(${input[0]!r})$pihat
covmat = readRDS(${prior_matrices!r})$covmat
res = lapply(seq(1:nrow(z.stat)), function(j){
total.quant.per.snp.with.vmat(j=j, covmat=covmat,
b.gp.hat=z.stat,
cormat=v.mat,
se.gp.hat=s.j,
pis=pis,
A='remove_before_rerun',
checkpoint = TRUE)})
# data formatting.
out = do.call(Map, c(f = rbind, res))
saveRDS(out, ${output!r})
Now MASH analysis is complete. I will use a separate notebook to summarize, plot and visualize the result of analysis.
For repeated runs it might be easier to execute from commandline instead of in Jupyter:
sos run analysis/20170829_MASH_Paper.ipynb sfa # --data
sos run analysis/20170829_MASH_Paper.ipynb mash # --data ... --cwd ...
The notebook runs default setting. Additionally I run it for dataset after LD pruning:
sos run analysis/20170829_MASH_Paper.ipynb sfa \
--data $HOME/Documents/GTEx/mash_revision/MatrixEQTLSumStats.Portable.ld2.Z.rds
sos run analysis/20170829_MASH_Paper.ipynb mash \
--data $HOME/Documents/GTEx/mash_revision/MatrixEQTLSumStats.Portable.ld2.Z.rds
%sessioninfo