Multivariate EBNM based prior for M&M¶

Here for the simulation benchmark we prepare mixture prior based on a mulrivariate Emperical Bayes Normal Mean model (previously we use Extreme Deconvolution for the task).

Approach¶

Here is the analysis plan:

Identify up to 20K genes where there is complete phenotype data to make a good / realistic residual variance estimate via FLASH
Simulate 20K data under my phenotypic models (the latest DSC benchmark setting) and generate sumstats for them ; bhat and sbhat
For each data-set, take the strongest gene-snp pair as the strong set
Also select from each data-set perhaps 4 "random" gene-snp pair.
then try to run your estimate of Vhat to get Vhat first, and run Yunqi / Peter's ED

In GTEx we have >35K genes. The reason we want to try using 20K is that 20K seems to have enough information learning about the pattern of sharing between conditions for mixtures I simulated in this notebook.

But we "cheat" a bit by simulating under identity residual variance for all genes, and fit EBNM assuming residual variance is identity, too; or just estimating a global residual variance. This makes the problem easier. Because in practice residual can be different (though maybe similar!) for different genes.

So the simplified plan is to only do 2~5 with 2 using just identity matrix for residual variance.

Workflow¶

[global]
parameter: cwd = path('/project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats')
parameter: model = 'artificial_mixture_identity' # 'gtex_mixture_identity'
# handle N = per_chunk data-set in one job
parameter: per_chunk = 1000
import glob

%cd /project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats

/project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats

Get top gene-SNP and random gene-SNP pairs per gene¶

# extract data for MAHS from summary stats
[extract_1]
parameter: seed = 999
parameter: n_random = 4
input: glob.glob(f'{cwd}/{model}/*.rds'), group_by = per_chunk
output: f"{cwd}/{model}/cache/{model}_{_index+1}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
    set.seed(${seed})
    matxMax <- function(mtx) {
      return(arrayInd(which.max(mtx), dim(mtx)))
    }
    remove_rownames = function(x) {
        for (name in names(x)) rownames(x[[name]]) = NULL
        return(x)
    }
    extract_one_data = function(infile, n_random) {
        # If cannot read the input for some reason then let it go. I dont care losing one.
        dat = tryCatch(readRDS(infile)$sumstats, error = function(e) return(NULL))
        if (is.null(dat)) return(NULL)
        z = abs(dat$bhat/dat$sbhat)
        max_idx = matxMax(z)
        strong = list(bhat = dat$bhat[max_idx[1],,drop=F], sbhat = dat$sbhat[max_idx[1],,drop=F])
        if (max_idx[1] == 1) {
            sample_idx = 2:nrow(z)
        } else if (max_idx[1] == nrow(z)) {
            sample_idx = 1:(max_idx[1]-1)
        } else {
            sample_idx = c(1:(max_idx[1]-1), (max_idx[1]+1):nrow(z))
        }
        random_idx = sample(sample_idx, n_random, replace = T)
        random = list(bhat = dat$bhat[random_idx,,drop=F], sbhat = dat$sbhat[random_idx,,drop=F])
        return(list(random = remove_rownames(random),  strong = remove_rownames(strong)))
    }
    merge_data = function(res, one_data) {
      if (length(res) == 0) {
          return(one_data)
      } else if (is.null(one_data)) {
          return(res)
      } else {
          for (d in names(one_data)) {
              for (s in names(one_data[[d]])) {
                  res[[d]][[s]] = rbind(res[[d]][[s]], one_data[[d]][[s]])
              }
          }
          return(res)
      }
    }
    res = list()
    for (f in c(${_input:r,})) {
      res = merge_data(res, extract_one_data(f, ${n_random}))
    }
    saveRDS(res, ${_output:r})
  
[extract_2]
input: group_by = "all"
output: f"{cwd}/{model}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
    merge_data = function(res, one_data) {
      if (length(res) == 0) {
          return(one_data)
      } else {
          for (d in names(one_data)) {
              for (s in names(one_data[[d]])) {
                  res[[d]][[s]] = rbind(res[[d]][[s]], one_data[[d]][[s]])
              }
          }
          return(res)
      }
    }
    dat = list()
    for (f in c(${_input:r,})) {
      dat = merge_data(dat, readRDS(f))
    }
    # make output consistent in format with 
    # https://github.com/stephenslab/gtexresults/blob/master/workflows/mashr_flashr_workflow.ipynb
    saveRDS(
          list(random.z = dat$random$bhat/dat$random$sbhat,
           strong.z = dat$strong$bhat/dat$strong$sbhat, 
           random.b = dat$random$bhat,
           strong.b = dat$strong$bhat,
           random.s = dat$random$sbhat,
           strong.s = dat$strong$sbhat),
          ${_output:r})

To run it:

for m in artificial_mixture_identity gtex_mixture_identity; do 
    sos run analysis/20200502_Prepare_ED_prior.ipynb extract --model $m -c midway2.yml -q midway2
done

Run extreme deconvolution using `mashr`¶

Before this, we need to run the following to generate FLASH mixture,

for m in artificial_mixture_identity gtex_mixture_identity; do
sos run ~/GIT/gtexresults/workflows/mashr_flashr_workflow.ipynb flash \
    --cwd /project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats/ \
    --data /project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats/$m.rds \
    --effect-model EE -c midway2.yml -q midway2
done

We will use simple method to compute the residual variance, as implemented in the pipeline below.

[mash_ed_1, mash_teem_1, udr_ed_1]
depends: R_library("mashr")
parameter: npc = 3
input: f"{cwd}/{model}.rds", f"{cwd}/{model}.EE.flash.rds"
output: f"{cwd}/{model}.FL_PC{npc}.rds"
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    library(mashr)
    dat = readRDS(${_input[0]:r})
    vhat = estimate_null_correlation_simple(mash_set_data(dat$random.b, Shat=dat$random.s, zero_Bhat_Shat_reset = 1E3))
    mash_data = mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=0, V=vhat, zero_Bhat_Shat_reset = 1E3)
    # FLASH matrices
    U.flash = readRDS(${_input[1]:r})
    # SVD matrices
    U.pca = ${"cov_pca(mash_data, %s)" % npc if npc > 0 else "list()"}
    # Emperical cov matrix
    X.center = apply(mash_data$Bhat, 2, function(x) x - mean(x))
    Ulist = c(U.flash, U.pca, list("XX" = t(X.center) %*% X.center / nrow(X.center)))
    saveRDS(list(mash_data = mash_data, Ulist = Ulist), ${_output:r})

[mash_ed_2]
output: f"{_input:n}.ED.rds"
task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 14, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    dat = readRDS(${_input:r})
    # Denoised data-driven matrices
    res = mashr:::bovy_wrapper(dat$mash_data, dat$Ulist, logfile=${_output:nr}, tol = 1e-06)
    # format to input for simulation with DSC (current pipeline)
    saveRDS(list(U=res$Ulist, w=res$pi, loglik=scan("${_output:nn}.ED_loglike.log")), ${_output:r})

[mash_teem_2]
output: f"{_input:n}.TEEM.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    library(mashr)
    dat = readRDS(${_input:r})
    # Denoised data-driven matrices
    res = teem_wrapper(dat$mash_data, dat$Ulist)
    saveRDS(res, ${_output:r})

[udr_ed_2]
depends: R_library("udr")
output: f"{_input:n}.UD_ED.rds"
task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 14, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    library(udr) # udr commit 5265079 with changes to set lower bound on the eigenvalues
    dat = readRDS(${_input:r})
    # Denoised data-driven matrices
    f0 = ud_init(X = as.matrix(dat$data), V = dat$S, U_scaled = list(), U_unconstrained = dat$Ulist, n_rank1=0)
    res = ud_fit(f0, control = list(unconstrained.update = "ed", resid.update = 'none', maxiter=5000),
    verbose=FALSE)
    # format to input for simulation with DSC (current pipeline)
    saveRDS(list(U=res$U, w=res$w, loglik=res$loglik), ${_output:r})

sos run analysis/20200502_Prepare_ED_prior.ipynb mash_ed --model artificial_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb mash_ed --model gtex_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb mash_teem --model artificial_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb mash_teem --model gtex_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb udr_ed --model artificial_mixture_identity -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb udr_ed --model gtex_mixture_identity -c midway2.yml -q midway2

It takes many hours to run ED (can be a day depending on number of threads used) but only a few minutes to run TEEM.

%cd ~/GIT/mvarbvs/dsc/mnm_prototype/mnm_sumstats

/project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats

Data-driven prior via ED¶

For the artifically simulated mixture,

a1 = readRDS('artificial_mixture_identity.FL_PC3.ED.rds')

names(a1)

a1$loglik[length(a1$loglik)]

cbind(names(a1$U), a1$w)

tol = 1E-15

names(a1$U)[which(a1$w>tol)]

The component with strongest weight is tFLASH.

plot_sharing = function(X) {
clrs <- colorRampPalette(rev(c("#D73027","#FC8D59","#FEE090","#FFFFBF",
                               "#E0F3F8","#91BFDB","#4575B4")))(64)
lat <- cov2cor(X)
lat[lower.tri(lat)] <- NA
n <- nrow(lat)
print(lattice::levelplot(lat[n:1,],col.regions = clrs,xlab = "",ylab = "",
                colorkey = TRUE,at = seq(0,1,length.out = 64),
                scales = list(cex = 0.6,x = list(rot = 45))))
}

plot_sharing(a1$U$tFLASH)

plot_sharing(a1$U$XX)

plot_sharing(a1$U$PCA_1)

For mixture simulated based on GTEx V8 ED matrices,

g1 = readRDS('gtex_mixture_identity.FL_PC3.ED.rds')

g1$loglik[length(g1$loglik)]

g1$w

names(g1$U)[which(g1$w>tol)]

Again, most weights are on tFLASH and tPCA.

plot_sharing(g1$U$tFLASH)

plot_sharing(g1$U$tPCA)

plot_sharing(g1$U$XX)

Data-driven prior via TEEM¶

Currently two caveats:

we don't have interface for bhat/sbhat so the result is based on z-score without transforming back to original scale bhat.
we don't have interface for non-identity residual covariance so the result is based on assuming identity covariance, although identity covariance is indeed the oracle covariance for the residual.

a2 = readRDS('artificial_mixture_identity.FLASH_PC3.TEEM.rds')

names(a2)

a2$objective[length(a2$objective)]

a2$w

names(a2$U)[which(a2$w>tol)]

In TEEM since it does not preserve the rank of input, these names are not informative. Just showing how many are there compared to before.

g2 = readRDS('gtex_mixture_identity.FLASH_PC3.TEEM.rds')

g2$objective[length(g2$objective)]

g2$w

names(g2$U)[which(g2$w>tol)]

with 100% weights making it a single component.

TEEM with oracle initialization¶

Here I initialize TEEM with the true mixture prior and weights under which the true b (effect size) was simulated from,

prior = readRDS("../data/prior_simulation.rds")

setwd('~/tmp/07-May-2020/')

a_data = readRDS('artificial_mixture_identity.FLASH_PC3.rds')
g_data = readRDS('gtex_mixture_identity.FLASH_PC3.rds')

names(prior)

length(prior$gtex_mixture$U)

length(prior$gtex_mixture$w)

a_fit = mashr::teem_wrapper(a_data$mash_data, prior$artificial_mixture_50$U, w_init = prior$artificial_mixture_50$w)

g_fit = mashr::teem_wrapper(g_data$mash_data, prior$gtex_mixture$U, w_init = prior$gtex_mixture$w)

Compare the objectives with previous run using data driven initialization, loglik is higher for oracle init with GTEx based simulation, but not with the artificial simulation.

print(c(a_fit$objective[length(a_fit$objective)], a2$objective[length(a2$objective)], a_fit$objective[length(a_fit$objective)] - a2$objective[length(a2$objective)]))

[1] -1.514531e+06 -1.514460e+06 -7.093301e+01

print(c(g_fit$objective[length(g_fit$objective)], g2$objective[length(g2$objective)], g_fit$objective[length(g_fit$objective)] - g2$objective[length(g2$objective)]))

[1] -1521579.038 -1525155.707     3576.669

names(a_fit$U)[which(a_fit$w>tol)]

a_fit$w[which(a_fit$w>tol)]

names(g_fit$U)[which(g_fit$w>tol)]

g_fit$w[which(g_fit$w>tol)]

FLASH_1	0.0928199835385606
FLASH_2	4.66681026838408e-34
FLASH_3	3.58932690856478e-41
FLASH_4	0.170916160597565
FLASH_5	3.3348060289056e-13
FLASH_6	0.00103070787447895
FLASH_7	0.00355415988483168
FLASH_8	3.49373257500002e-08
FLASH_9	0.00123344516969065
FLASH_10	4.05310561664215e-05
FLASH_11	1.58651158359917e-05
FLASH_12	6.96387520012721e-14
FLASH_13	7.09713951397582e-252
FLASH_14	0.000303771798467122
FLASH_15	0.00771806711741353
FLASH_16	2.82609862930679e-05
FLASH_17	0.000673891478132913
FLASH_18	0.00743608012106805
FLASH_19	8.57092101664386e-55
FLASH_20	7.04580345327831e-05
FLASH_21	3.6285723566672e-131
FLASH_22	0.00713627997477638
FLASH_23	9.87323133685571e-152
FLASH_24	0.000653767355009305
FLASH_25	0.00643943152051095
FLASH_26	4.15821235716626e-96
FLASH_27	0.00673856900028108
FLASH_28	2.22523384016994e-215
FLASH_29	0.00713532869510493
FLASH_30	7.65575053718967e-248
⋮	⋮
FLASH_42	6.07665358145801e-32
FLASH_43	6.65021812056226e-250
FLASH_44	9.70947519250964e-242
FLASH_45	1.77280612953097e-245
FLASH_46	4.19452309083203e-248
FLASH_47	8.70263840260855e-253
FLASH_48	4.28761448188762e-71
FLASH_49	6.20473152181847e-123
FLASH_50	2.94656246783141e-159
FLASH_51	1.51620214735344e-109
FLASH_52	3.98906367030447e-152
FLASH_53	4.99388822780586e-68
FLASH_54	6.70099398398243e-251
FLASH_55	6.79023576643794e-157
FLASH_56	2.91730007655672e-253
FLASH_57	3.6453881263049e-05
FLASH_58	7.69808159090617e-18
FLASH_59	3.57265620931525e-215
FLASH_60	1.0991104542132e-229
FLASH_61	2.26306695858188e-206
FLASH_62	1.12559259784551e-247
FLASH_63	3.56669069909909e-242
FLASH_64	5.58913658363362e-234
FLASH_65	2.30751593347542e-186
tFLASH	0.287018132738336
PCA_1	0.00381418677387528
PCA_2	0.0154803817130112
PCA_3	0.00113763435850343
tPCA	0.139767896365797
XX	0.238800519912781