This is the new mashr
version of analysis.
It follows in structure from this notebook though not implementation. The data has been prepared beforehand via this notebook. To recap, for V8 data, 39784 genes are found having data in at least one of the 49 tissues; 15632 of them have non-missing data in all tissues and thus extracted, as the "top" set of eQTLs for MASH procedure. Additionally, 9 random (equally spaced in position) SNPs are extracted from all cis-SNPs per gene to "train" the MASH model on. We use Z-score for computations in this workflow.
[global]
sfa_exe = "~/Documents/GTExV8/utils/sfa/bin/sfa_linux"
parameter: data = "~/Documents/GTExV8/MASH/GTExV8.ciseQTL.4MASH.rds"
parameter: cwd = "${data!ad}"
parameter: cov = "xtx"
parameter: vhat = 1
%run sfa
[sfa_download: provides = sfa_exe]
task: workdir = "~/Documents/GTExV8/utils"
download: decompress = True
http://stephenslab.uchicago.edu/assets/software/sfa/sfa1.0.tar.gz
[sfa]
depends: sfa_exe
K = 5
tmpfile = "/tmp/${data!bn}.max.txt"
input: data
output: "${input!n}.sfa.rds"
task: workdir = cwd
R:
z = readRDS(${input!ar})$max$z
write.table(z, ${tmpfile!r}, col.names=F,row.names=F)
cmd = paste0('${sfa_exe} -gen ${tmpfile} -g ', dim(z)[1], ' -n ', dim(z)[2],
' -k ${K} -iter 50 -rand 999 -o ${input!bn}')
system(cmd)
saveRDS(list(F = read.table("${input!n}_F.out"),
lambda = read.table("${input!n}_lambda.out"),
sigma2 = read.table("${input!n}_sigma2.out"),
alpha = read.table("${input!n}_alpha.out")), ${output!r})
bash:
rm -f *{_F.out,_lambda.out,_sigma2.out,_alpha.out}
Here we create 3 covariance matrices:
parameter: cov = "cov"
; default set to nocov
)and apply Extreme Deconvolution to refine these matrices. We observed that Extreme Deconvolution perserves rank.
Additionally we include 2 other types of covariance matrices:
bmalite
)%run
[mash_1]
# Data-driven covariates with PCA and factor analysis on top signals
# followed by extreme deconvolution refinement
depends: R_library("ExtremeDeconvolution"), R_library("stephenslab/mashr")
K = 5 # same as in mash paper
P = 3 # same as in mash paper
input: "${data!a}", "${data!an}.sfa.rds"
output: "${data!an}.${cov}.K${K}.P${P}.rds"
task: workdir = cwd
R:
dat = readRDS('${input[0]}')
sfa_data = readRDS("${input[1]}")
mash_data = mashr::set_mash_data(as.matrix(dat$max$z), matrix(1, nrow(dat$max$z), ncol(dat$max$z)))
sfa_res = as.matrix(sfa_data$lambda) %*% as.matrix(sfa_data$F)
# SFA matrices
U.sfa = c(mashr::cov_from_factors(as.matrix(sfa_data$F), "sfa${K}"), list("tSFA" = t(sfa_res) %*% sfa_res / nrow(dat$max$z)))
# SVD matrices
U.pca = mashr::cov_pca(mash_data, ${P})
# Emperical data matrices
# `cov_ed` will take significantly longer when this empirical convariance matrix is added
D.center = apply(as.matrix(dat$max$z), 2, function(x) x - mean(x))
# Denoised data-driven matrices
U.dd = c(U.sfa, U.pca, if (${cov!r} == 'xtx') list("XX" = t(D.center) %*% D.center / nrow(dat$max$z)) else list())
U.ed = mashr::cov_ed(mash_data, U.dd)
# Canonical matrices
U.can = mashr::cov_canonical(mash_data)
saveRDS(list(Ulist = c(U.ed, U.can), DD_raw = U.dd), ${output!r})
Using a random null set, the cell below computes the weights for input covariance matrices (priors) in MASH mixture. The output contains matrix of log-likelihoods as well as weights learned from the hierarchical model.
[mash_2]
depends: R_library("REBayes")
output: "${input!n}.V${vhat}.mash_model.rds"
task: workdir = cwd
R:
dat = readRDS(${data!ar})
if (${vhat}) {
V = cor(dat$null$z[which(apply(abs(dat$null$z),1, max) < 2),])
} else {
V = diag(ncol(dat$null$z))
}
mash_data = mashr::set_mash_data(as.matrix(dat$null$z), matrix(1, nrow(dat$null$z), ncol(dat$null$z)), as.matrix(V))
saveRDS(mashr::mash(mash_data, Ulist = readRDS(${input!r})$Ulist, outputlevel = 1), ${output!r})
Applying hyperparameters learned from the training (the null) set to the test (the top eQTL) set, the cell below computes posterior quantities.
[mash_3]
output: "${input!nnn}.V${vhat}.mash_posterior.rds"
task: workdir = cwd
R:
dat = readRDS(${data!ar})
if (${vhat}) {
V = cor(dat$null$z[which(apply(abs(dat$null$z),1, max) < 2),])
} else {
V = diag(ncol(dat$null$z))
}
mash_data = mashr::set_mash_data(as.matrix(dat$max$z), matrix(1, nrow(dat$max$z), ncol(dat$max$z)), as.matrix(V))
saveRDS(mashr::mash_compute_posterior_matrices(readRDS(${input!r}), mash_data), ${output!r})
Now MASH analysis is complete. I will use a separate notebook to summarize, plot and visualize the result of analysis.
For repeated runs it might be easier to execute from commandline instead of in Jupyter:
sos run analysis/20171002_MASH_V8.ipynb sfa # --data
sos run analysis/20171002_MASH_V8.ipynb mash # --data
%sessioninfo