Here for the simulation benchmark we prepare mixture prior based on a mulrivariate Emperical Bayes Normal Mean model (previously we use Extreme Deconvolution for the task).
Here is the analysis plan:
In GTEx we have >35K genes. The reason we want to try using 20K is that 20K seems to have enough information learning about the pattern of sharing between conditions for mixtures I simulated in this notebook.
But we "cheat" a bit by simulating under identity residual variance for all genes, and fit EBNM assuming residual variance is identity, too; or just estimating a global residual variance. This makes the problem easier. Because in practice residual can be different (though maybe similar!) for different genes.
So the simplified plan is to only do 2~5 with 2 using just identity matrix for residual variance.
[global]
parameter: cwd = path('/project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats')
parameter: model = 'artificial_mixture_identity' # 'gtex_mixture_identity'
# handle N = per_chunk data-set in one job
parameter: per_chunk = 1000
import glob
%cd /project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats
# extract data for MAHS from summary stats
[extract_1]
parameter: seed = 999
parameter: n_random = 4
input: glob.glob(f'{cwd}/{model}/*.rds'), group_by = per_chunk
output: f"{cwd}/{model}/cache/{model}_{_index+1}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
set.seed(${seed})
matxMax <- function(mtx) {
return(arrayInd(which.max(mtx), dim(mtx)))
}
remove_rownames = function(x) {
for (name in names(x)) rownames(x[[name]]) = NULL
return(x)
}
extract_one_data = function(infile, n_random) {
# If cannot read the input for some reason then let it go. I dont care losing one.
dat = tryCatch(readRDS(infile)$sumstats, error = function(e) return(NULL))
if (is.null(dat)) return(NULL)
z = abs(dat$bhat/dat$sbhat)
max_idx = matxMax(z)
strong = list(bhat = dat$bhat[max_idx[1],,drop=F], sbhat = dat$sbhat[max_idx[1],,drop=F])
if (max_idx[1] == 1) {
sample_idx = 2:nrow(z)
} else if (max_idx[1] == nrow(z)) {
sample_idx = 1:(max_idx[1]-1)
} else {
sample_idx = c(1:(max_idx[1]-1), (max_idx[1]+1):nrow(z))
}
random_idx = sample(sample_idx, n_random, replace = T)
random = list(bhat = dat$bhat[random_idx,,drop=F], sbhat = dat$sbhat[random_idx,,drop=F])
return(list(random = remove_rownames(random), strong = remove_rownames(strong)))
}
merge_data = function(res, one_data) {
if (length(res) == 0) {
return(one_data)
} else if (is.null(one_data)) {
return(res)
} else {
for (d in names(one_data)) {
for (s in names(one_data[[d]])) {
res[[d]][[s]] = rbind(res[[d]][[s]], one_data[[d]][[s]])
}
}
return(res)
}
}
res = list()
for (f in c(${_input:r,})) {
res = merge_data(res, extract_one_data(f, ${n_random}))
}
saveRDS(res, ${_output:r})
[extract_2]
input: group_by = "all"
output: f"{cwd}/{model}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
merge_data = function(res, one_data) {
if (length(res) == 0) {
return(one_data)
} else {
for (d in names(one_data)) {
for (s in names(one_data[[d]])) {
res[[d]][[s]] = rbind(res[[d]][[s]], one_data[[d]][[s]])
}
}
return(res)
}
}
dat = list()
for (f in c(${_input:r,})) {
dat = merge_data(dat, readRDS(f))
}
# make output consistent in format with
# https://github.com/stephenslab/gtexresults/blob/master/workflows/mashr_flashr_workflow.ipynb
saveRDS(
list(random.z = dat$random$bhat/dat$random$sbhat,
strong.z = dat$strong$bhat/dat$strong$sbhat,
random.b = dat$random$bhat,
strong.b = dat$strong$bhat,
random.s = dat$random$sbhat,
strong.s = dat$strong$sbhat),
${_output:r})
To run it:
for m in artificial_mixture_identity gtex_mixture_identity; do
sos run analysis/20200502_Prepare_ED_prior.ipynb extract --model $m -c midway2.yml -q midway2
done
mashr
¶Before this, we need to run the following to generate FLASH mixture,
for m in artificial_mixture_identity gtex_mixture_identity; do
sos run ~/GIT/gtexresults/workflows/mashr_flashr_workflow.ipynb flash \
--cwd /project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats/ \
--data /project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats/$m.rds \
--effect-model EE -c midway2.yml -q midway2
done
We will use simple
method to compute the residual variance, as implemented in the pipeline below.
[mash_ed_1, mash_teem_1, udr_ed_1]
depends: R_library("mashr")
parameter: npc = 3
input: f"{cwd}/{model}.rds", f"{cwd}/{model}.EE.flash.rds"
output: f"{cwd}/{model}.FL_PC{npc}.rds"
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
library(mashr)
dat = readRDS(${_input[0]:r})
vhat = estimate_null_correlation_simple(mash_set_data(dat$random.b, Shat=dat$random.s, zero_Bhat_Shat_reset = 1E3))
mash_data = mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=0, V=vhat, zero_Bhat_Shat_reset = 1E3)
# FLASH matrices
U.flash = readRDS(${_input[1]:r})
# SVD matrices
U.pca = ${"cov_pca(mash_data, %s)" % npc if npc > 0 else "list()"}
# Emperical cov matrix
X.center = apply(mash_data$Bhat, 2, function(x) x - mean(x))
Ulist = c(U.flash, U.pca, list("XX" = t(X.center) %*% X.center / nrow(X.center)))
saveRDS(list(mash_data = mash_data, Ulist = Ulist), ${_output:r})
[mash_ed_2]
output: f"{_input:n}.ED.rds"
task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 14, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
dat = readRDS(${_input:r})
# Denoised data-driven matrices
res = mashr:::bovy_wrapper(dat$mash_data, dat$Ulist, logfile=${_output:nr}, tol = 1e-06)
# format to input for simulation with DSC (current pipeline)
saveRDS(list(U=res$Ulist, w=res$pi, loglik=scan("${_output:nn}.ED_loglike.log")), ${_output:r})
[mash_teem_2]
output: f"{_input:n}.TEEM.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
library(mashr)
dat = readRDS(${_input:r})
# Denoised data-driven matrices
res = teem_wrapper(dat$mash_data, dat$Ulist)
saveRDS(res, ${_output:r})
[udr_ed_2]
depends: R_library("udr")
output: f"{_input:n}.UD_ED.rds"
task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 14, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
library(udr) # udr commit 5265079 with changes to set lower bound on the eigenvalues
dat = readRDS(${_input:r})
# Denoised data-driven matrices
f0 = ud_init(X = as.matrix(dat$data), V = dat$S, U_scaled = list(), U_unconstrained = dat$Ulist, n_rank1=0)
res = ud_fit(f0, control = list(unconstrained.update = "ed", resid.update = 'none', maxiter=5000),
verbose=FALSE)
# format to input for simulation with DSC (current pipeline)
saveRDS(list(U=res$U, w=res$w, loglik=res$loglik), ${_output:r})
sos run analysis/20200502_Prepare_ED_prior.ipynb mash_ed --model artificial_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb mash_ed --model gtex_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb mash_teem --model artificial_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb mash_teem --model gtex_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb udr_ed --model artificial_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb udr_ed --model gtex_mixture_identity -c midway2.yml -q midway2
It takes many hours to run ED (can be a day depending on number of threads used) but only a few minutes to run TEEM.
%cd ~/GIT/mvarbvs/dsc/mnm_prototype/mnm_sumstats
For the artifically simulated mixture,
a1 = readRDS('artificial_mixture_identity.FL_PC3.ED.rds')
names(a1)
a1$loglik[length(a1$loglik)]
cbind(names(a1$U), a1$w)
tol = 1E-15
names(a1$U)[which(a1$w>tol)]
The component with strongest weight is tFLASH
.
plot_sharing = function(X) {
clrs <- colorRampPalette(rev(c("#D73027","#FC8D59","#FEE090","#FFFFBF",
"#E0F3F8","#91BFDB","#4575B4")))(64)
lat <- cov2cor(X)
lat[lower.tri(lat)] <- NA
n <- nrow(lat)
print(lattice::levelplot(lat[n:1,],col.regions = clrs,xlab = "",ylab = "",
colorkey = TRUE,at = seq(0,1,length.out = 64),
scales = list(cex = 0.6,x = list(rot = 45))))
}
plot_sharing(a1$U$tFLASH)
plot_sharing(a1$U$XX)
plot_sharing(a1$U$PCA_1)
For mixture simulated based on GTEx V8 ED matrices,
g1 = readRDS('gtex_mixture_identity.FL_PC3.ED.rds')
g1$loglik[length(g1$loglik)]
g1$w
names(g1$U)[which(g1$w>tol)]
Again, most weights are on tFLASH
and tPCA
.
plot_sharing(g1$U$tFLASH)
plot_sharing(g1$U$tPCA)
plot_sharing(g1$U$XX)
Currently two caveats:
a2 = readRDS('artificial_mixture_identity.FLASH_PC3.TEEM.rds')
names(a2)
a2$objective[length(a2$objective)]
a2$w
names(a2$U)[which(a2$w>tol)]
In TEEM since it does not preserve the rank of input, these names are not informative. Just showing how many are there compared to before.
g2 = readRDS('gtex_mixture_identity.FLASH_PC3.TEEM.rds')
g2$objective[length(g2$objective)]
g2$w
names(g2$U)[which(g2$w>tol)]
with 100% weights making it a single component.
Here I initialize TEEM with the true mixture prior and weights under which the true b
(effect size) was simulated from,
prior = readRDS("../data/prior_simulation.rds")
setwd('~/tmp/07-May-2020/')
a_data = readRDS('artificial_mixture_identity.FLASH_PC3.rds')
g_data = readRDS('gtex_mixture_identity.FLASH_PC3.rds')
names(prior)
length(prior$gtex_mixture$U)
length(prior$gtex_mixture$w)
a_fit = mashr::teem_wrapper(a_data$mash_data, prior$artificial_mixture_50$U, w_init = prior$artificial_mixture_50$w)
g_fit = mashr::teem_wrapper(g_data$mash_data, prior$gtex_mixture$U, w_init = prior$gtex_mixture$w)
Compare the objectives with previous run using data driven initialization, loglik is higher for oracle init with GTEx based simulation, but not with the artificial simulation.
print(c(a_fit$objective[length(a_fit$objective)], a2$objective[length(a2$objective)], a_fit$objective[length(a_fit$objective)] - a2$objective[length(a2$objective)]))
print(c(g_fit$objective[length(g_fit$objective)], g2$objective[length(g2$objective)], g_fit$objective[length(g_fit$objective)] - g2$objective[length(g2$objective)]))
names(a_fit$U)[which(a_fit$w>tol)]
a_fit$w[which(a_fit$w>tol)]
names(g_fit$U)[which(g_fit$w>tol)]
g_fit$w[which(g_fit$w>tol)]