Multivariate Bayesian variable selection regression

The updated mvSuSiE benchmark

During the past few months we have implemented a few fixes with input from Yuxin who performed mvSuSiE analysis in GWAS context and ironed out some corner cases. Also progress from udr package offers us better estimate for mixture prior. We now rerun all the benchmark previously developed and look at updated results.

Benchmark execution

Under the dsc/mnm_prototype directory,

sos run 20210224_MNM_Benchmark simulation_only --data-dir mnm_sumstats
sos run 20210224_MNM_Benchmark extract_sumstats --data-dir mnm_sumstats
sos run 20210224_MNM_Benchmark mixture_model --data-dir mnm_sumstats
sos run 20210224_MNM_Benchmark mvSuSiE --data-dir mnm_20210409
sos run 20210224_MNM_Benchmark mvSuSiE_missing --data-dir mnm_missing_20210409
sos run 20210224_MNM_Benchmark mthess --data-dir mthess_20210409
In [ ]:
[global]
parameter: cwd = path('~/GIT/mvarbvs/dsc/mnm_prototype')
parameter: data_dir = path
def fmtP(x):
    return str(x).replace(".", "p")

Data simulation

In [ ]:
[simulation_only]
script: interpreter= 'qsub', expand = True
#!/bin/bash

#SBATCH --time=36:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=16000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

module load R
./gtex_qtl.dsc --host dsc_mnm.yml --target simulate_only --n_dataset 25000 -o {data_dir} -s existing -e ignore &> {data_dir}.log

Extract summary statistics for mixture model

In [ ]:
[extract_sumstats_1, mixture_model_1]
download: dest_file = 'mixture_prior.ipynb'
    https://raw.githubusercontent.com/cumc/bioworkflows/master/multivariate-fine-mapping/mixture_prior.ipynb

[extract_sumstats_2]
def get_cmd(m):
    return f'''
    cd {m} && ls *.rds | sed 's/\.rds//g' > analysis_units.txt && cd -
    sos run mixture_prior.ipynb extract_effects --cwd {data_dir} \
        --analysis-units {m}/analysis_units.txt \
        --datadir {m} --name {m:b} \
        -c ../../midway2.yml -q midway2 -s build &> extract_sumstats_{m:b}.log
    '''
cmds = [get_cmd(path(m)) for m in [f"{data_dir}/artificial_mixture_identity", f"{data_dir}/gtex_mixture_identity"]]
input: for_each = 'cmds'
script: interpreter= 'qsub', expand = True
#!/bin/bash
  
#SBATCH --time=36:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

module load R
{_cmds}

Mixture model fitting

In [ ]:
[mixture_model_2]
def get_cmd(m):
    c1 = f'''
    sos run mixture_prior.ipynb ud --name {m} --cwd {data_dir} --mixture_components flash flash_nonneg pca \
        -c ../../midway2.yml -q midway2 -s build &> ed_{m}.log
    '''
    c2 = f'''
    sos run mixture_prior.ipynb ud --ud-method teem --name {m} --cwd {data_dir} --mixture_components flash flash_nonneg pca \
        -c ../../midway2.yml -q midway2 -s build &> teem_{m}.log
    '''
    c3 = f'''
    sos run mixture_prior.ipynb ed --name {m} --cwd {data_dir} --mixture_components flash flash_nonneg pca \
        -c ../../midway2.yml -q midway2 -s build &> bovy_{m}.log
    '''
    return [c1,c2,c3]
cmds = sum([get_cmd(m) for m in ["artificial_mixture_identity", "gtex_mixture_identity"]], [])
input: for_each = 'cmds'
script: interpreter= 'qsub', expand = True
#!/bin/bash
  
#SBATCH --time=36:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

module load R
{_cmds}

Get prior input files for benchmark

I then made prior data files from these results:

In [1]:
%cd ../..
/project2/mstephens/gaow/mvarbvs
In [2]:
priors = readRDS('data/prior_simulation.rds')
In [3]:
priors$gtex_mixture$ED = readRDS('dsc/mnm_prototype/mnm_sumstats/gtex_mixture_identity.ed_bovy.rds')
priors$artificial_mixture_50$ED = readRDS('dsc/mnm_prototype/mnm_sumstats/artificial_mixture_identity.ed_bovy.rds')
In [4]:
priors$gtex_mixture$TEEM = readRDS('dsc/mnm_prototype/mnm_sumstats/gtex_mixture_identity.teem.rds')
priors$artificial_mixture_50$TEEM = readRDS('dsc/mnm_prototype/mnm_sumstats/artificial_mixture_identity.teem.rds')
In [5]:
priors$gtex_mixture$ED_UDR = readRDS('dsc/mnm_prototype/mnm_sumstats/gtex_mixture_identity.ed.rds')
priors$artificial_mixture_50$ED_UDR = readRDS('dsc/mnm_prototype/mnm_sumstats/artificial_mixture_identity.ed.rds')
In [6]:
saveRDS(priors, 'data/prior_simulation.rds')

And save the residual null z correlation estimations,

In [7]:
nullz = list()
nullz$gtex_mixture = readRDS('dsc/mnm_prototype/mnm_sumstats/gtex_mixture_identity.rds')$null.cor
nullz$artificial_mixture_50 = readRDS('dsc/mnm_prototype/mnm_sumstats/artificial_mixture_identity.rds')$null.cor
saveRDS(nullz, 'data/nullz_cor_simulation.rds')

Mixture model results visualization

sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name artificial_mixture_50 --cwd mnm_sumstats
sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name gtex_mixture --cwd mnm_sumstats # --to-cor
sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name artificial_mixture_50\$ED --cwd mnm_sumstats # --to-cor
sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name gtex_mixture\$ED --cwd mnm_sumstats # --to-cor
sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name artificial_mixture_50\$TEEM --cwd mnm_sumstats # --to-cor
sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name gtex_mixture\$TEEM --cwd mnm_sumstats # --to-cor
sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name artificial_mixture_50\$ED_UDR --cwd mnm_sumstats # --to-cor
sos run ~/GIT/bioworkflows/multivariate-fine-mapping/mixture_prior.ipynb plot_U --model-data ../../data/prior_simulation.rds --name gtex_mixture\$ED_UDR --cwd mnm_sumstats # --to-cor

Run mvSuSiE benchmark

Comparison with atlasqtl is also included in this run

In [ ]:
[mvSuSiE]
script: interpreter= 'qsub', expand = True
#!/bin/bash
  
#SBATCH --time=96:00:00
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

module load R
# Skip error with -e ignore because atlas analysis can go wrong that i cannot control
./gtex_qtl.dsc --host dsc_mnm.yml -o {data_dir} -s existing -e ignore &> {data_dir}.log

Run mvSuSiE benchmark with missing data

In [ ]:
[mvSuSiE_missing]
script: interpreter= 'qsub', expand = True
#!/bin/bash
  
#SBATCH --time=96:00:00
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

module load R
./gtex_qtl.dsc --host dsc_mnm.yml  --target missing_data -o {data_dir} -s existing -e ignore &> {data_dir}.log

Run mthess benchmark

In [ ]:
[mthess]
script: interpreter= 'qsub', expand = True
#!/bin/bash
  
#SBATCH --time=72:00:00
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

module load R
./gtex_qtl.dsc --host dsc_mnm.yml --target mthess -o {data_dir} -s existing -e ignore --n_dataset 200 &> {data_dir}.log

Benchmark results: PIP

For full data simulation,

sos run 20210224_MNM_Benchmark.ipynb simulation_result --data-dir mnm_20210409
sos run 20210224_MNM_Benchmark.ipynb missing_simulation_result --data-dir mnm_missing_20210409
In [ ]:
[simulation_result]
script: interpreter= 'qsub', expand = True
#!/bin/bash

#SBATCH --time=72:00:00
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

    ALL_METHODS="mnm_oracle mnm_naive mnm_identity mnm_shared mnm_ed mnm_teem mnm_ed_max10 mnm_rss_ed_corY susie atlasqtl"
    PLOT_METHODS="mnm_oracle+flash mnm_oracle+oracle mnm_ed+flash mnm_ed+oracle mnm_teem+flash mnm_teem+oracle mnm_ed_max10+flash mnm_ed_max10+oracle mnm_naive+flash mnm_identity+flash mnm_shared+flash mnm_rss_ed_corY susie atlasqtl"

    # condition specific PIP
    sos run 20210224_MNM_Benchmark.ipynb "get_pip_meta + get_pip_r + pip_calibration" -s build \
        --data-dir {data_dir} \
        --methods $ALL_METHODS
    # can take 6hrs to get the data for curves, from all methods
    sos run 20210224_MNM_Benchmark.ipynb "get_pip_meta + get_pip_r + curves" -s build \
        --data-dir {data_dir} \
        --methods $ALL_METHODS \
        --plot-methods $PLOT_METHODS
    sos run 20210224_MNM_Benchmark.ipynb "get_pip_meta + get_pip_r + curves" -s build \
        --data-dir {data_dir} \
        --methods $ALL_METHODS  \
        --plot-methods $PLOT_METHODS \
        --table roc --xlim 0.005
    # global PIP
    sos run 20210224_MNM_Benchmark.ipynb "get_pip_meta + get_pip + pip_calibration" -s build \
        --data-dir {data_dir} \
        --methods $ALL_METHODS
    sos run 20210224_MNM_Benchmark.ipynb "get_pip_meta + get_pip + curves" -s build \
        --plot-methods $PLOT_METHODS \
        --data-dir {data_dir} \
        --methods $ALL_METHODS
    sos run 20210224_MNM_Benchmark.ipynb "get_pip_meta + get_pip + curves" -s build \
        --data-dir {data_dir} \
        --methods $ALL_METHODS \
        --plot-methods $PLOT_METHODS \
        --table roc --xlim 0.005 -s build
In [ ]:
[missing_simulation_result]
script: interpreter= 'qsub', expand = True
#!/bin/bash
  
#SBATCH --time=72:00:00
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

    ALL_METHODS="mnm_oracle mnm_naive mnm_identity mnm_shared mnm_ed mnm_teem mnm_ed_max10 mnm_rss_ed_corY"
    PLOT_METHODS="mnm_oracle+flash mnm_oracle+oracle mnm_ed+flash mnm_ed+oracle mnm_teem+flash mnm_teem+oracle mnm_ed_max10+flash mnm_ed_max10+oracle mnm_naive+flash mnm_identity+flash mnm_shared+flash mnm_rss_ed_corY"
    # condition specific PIP
    sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip_r + pip_calibration" -s build \
        --data-dir {data_dir} --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
        --methods $ALL_METHODS
    # can take 6hrs to get the data for curves, from all methods
    sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip_r + curves" -s build \
        --data-dir {data_dir} --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
        --methods $ALL_METHODS  \
        --plot-methods $PLOT_METHODS
    sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip_r + curves" -s build \
        --data-dir {data_dir} --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
        --methods $ALL_METHODS   \
        --plot-methods $PLOT_METHODS \
        --table roc --xlim 0.005
    # global PIP
    sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip + pip_calibration" -s build \
        --data-dir {data_dir} --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
        --methods $ALL_METHODS
    sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip + curves" -s build \
        --plot-methods $PLOT_METHODS \
        --data-dir {data_dir} --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
        --methods $ALL_METHODS
    sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip + curves" -s build \
        --data-dir {data_dir} --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
        --methods $ALL_METHODS  \
        --plot-methods $PLOT_METHODS \
        --table roc --xlim 0.005 -s build

For missing data simulation,

ALL_METHODS="mnm_oracle mnm_naive mnm_identity mnm_shared mnm_ed mnm_rss_oracle mnm_rss_naive_corZ mnm_rss_ed_corZ mnm_rss_identity_corZ mnm_rss_shared_corZ"
PLOT_METHODS="mnm_oracle+flash mnm_oracle+oracle mnm_ed+flash mnm_ed+oracle mnm_naive+flash mnm_identity+flash mnm_shared+flash"
PLOT_RSS="mnm_rss_oracle+flash mnm_rss_oracle+oracle mnm_rss_ed_corZ+flash mnm_rss_ed_corZ+oracle mnm_rss_naive_corZ+flash mnm_rss_identity_corZ+flash mnm_rss_shared_corZ+flash"
DATA=mnm_missing_20210409
# condition specific PIP
sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip_r + pip_calibration" -s build \
    --data-dir $DATA --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
    --methods $ALL_METHODS
# can take 6hrs to get the data for curves, from all methods
sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip_r + curves" -s build \
    --data-dir $DATA --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
    --methods $ALL_METHODS  \
    --plot-methods $PLOT_METHODS
sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip_r + curves" -s build \
    --data-dir $DATA --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
    --methods $ALL_METHODS   \
    --plot-methods $PLOT_METHODS \
    --table roc --xlim 0.005
# global PIP
sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip + pip_calibration" -s build \
    --data-dir $DATA --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
    --methods $ALL_METHODS
sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip + curves" -s build \
    --plot-methods $PLOT_METHODS \
    --data-dir $DATA --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
    --methods $ALL_METHODS
sos run 20210224_MNM_Benchmark "get_pip_meta + get_pip + curves" -s build \
    --data-dir $DATA --simulate simulate_missing --simulate-method artificial_mixture_missing gtex_mixture_missing \
    --methods $ALL_METHODS  \
    --plot-methods $PLOT_METHODS \
    --table roc --xlim 0.005 -s build

Get meta-data information

In [ ]:
[get_pip_meta (meta info)]
parameter: methods = ["mnm_oracle", "mnm_naive", "mnm_identity", "mnm_shared", "mnm_ed", "mnm_rss_ed_corZ", "susie", "atlasqtl"]
parameter: simulate = 'simulate'
output: f'{cwd}/{data_dir}/{data_dir:b}_pip_meta.rds'
R: expand = '${ }', workdir = cwd
    dat = dscrutils::dscquery(${data_dir:r}, target = c("small_data.dataset", "${simulate}", "method", "method.resid_method"), module.output.files=c("${simulate}", "method"), group = c("method: ${' '.join(methods)}", "mnm:", "mnm_missing:", "mnm_rss:"), ignore.missing.file=T)
    print(dim(dat))
    # remove bad files
    # mostly from atlasqtl
    bad_files = vector()
    for (f in dat$method.output.file) {
      if (!file.exists(paste0("${data_dir}/", f,'.rds'))) bad_files = append(bad_files, f)
    }
    dat = dat[which(!(dat$method.output.file %in% bad_files)),]
    print(dim(dat))
    dat$method_rename = NA
    dat$method_rename[which(!is.na(dat$method.resid_method))] = paste(dat$method[which(!is.na(dat$method.resid_method))], dat$method.resid_method[which(!is.na(dat$method.resid_method))], sep = '+')
    dat$method_rename[which(is.na(dat$method.resid_method))] = dat$method[which(is.na(dat$method.resid_method))]
    names(dat)[names(dat) == '${simulate}'] <- 'simulate'
    names(dat)[names(dat) == '${simulate}.output.file'] <- 'simulate.output.file'
    saveRDS(dat, ${_output:r})

Get condition specific PIP

For mvSuSiE we use per condition 1-lfsr as condition specific PIP.

In [ ]:
# Extract condition specific PIP
[get_pip_r (condition specific PIP)]
parameter: simulate_method = ['artificial_mixture', 'gtex_mixture']
parameter: subset = -1
input: for_each = 'simulate_method'
output: f'{cwd}/{data_dir}/{data_dir:b}_{_simulate_method}{("_" + str(subset)) if subset>0 else ""}_pip_condition.rds'
R: expand = '${ }', workdir = f'{cwd}/{data_dir}'
    meta = readRDS(${_input:r})
    # apply some filters
    meta = meta[which(meta$simulate == "${_simulate_method}"),]
    if (${subset}<nrow(meta) && ${subset}>0) {
      set.seed(999)
      meta = meta[sample(1:nrow(meta))[1:${subset}],]
    }
    # now collect matrices for each method, of two columns: pip and true_coef
    res = list()
    for (i in 1:nrow(meta)) {
      true_coef = as.integer(readRDS(paste0(meta[i,]$simulate.output.file, '.rds'))$meta$true_coef != 0)
      # make it a vector
      true_coef = c(true_coef)
      method = meta[i,]$method_rename
      if (method == "atlasqtl") {
          pip = readRDS(paste0(meta[i,]$method.output.file, '.rds'))$result$gam_vb_completed
      } else if (method == "susie") {
          pip = do.call(cbind, lapply(readRDS(paste0(meta[i,]$method.output.file, '.rds'))$fitted, function(x) x$pip))  
      } else {
          # approximate per condition PIP using condition specific 1 - lfsr
          pip = 1 - readRDS(paste0(meta[i,]$method.output.file, '.rds'))$result$lfsr          
      }
      # PIP is matrix of P (SNPs) by R (conditions); now make it a vector
      pip = c(pip)
      if (!(method %in% names(res))) {
        res[[method]] = list(pip = pip, truth = true_coef)
      } else {
        res[[method]]$pip = append(res[[method]]$pip, pip)
        res[[method]]$truth = append(res[[method]]$truth, true_coef)
      }
      if (i%%100==0) {
          print(i)
          print(c(method, tail(res[[method]]$pip)))
      }
    }
    for (method in unique(meta$method_rename)) {
      res[[method]] = do.call(cbind, res[[method]])
    }
    saveRDS(res, ${_output:r})

Get global PIP

In [ ]:
# Extract global PIP
[get_pip (global PIP)]
parameter: simulate_method = ['artificial_mixture', 'gtex_mixture']
parameter: subset = -1
input: for_each = 'simulate_method'
output: f'{cwd}/{data_dir}/{data_dir:b}_{_simulate_method}{("_" + str(subset)) if subset>0 else ""}_pip_global.rds'
R: expand = '${ }', workdir = f'{cwd}/{data_dir}'
    meta = readRDS(${_input:r})
    # apply some filters
    meta = meta[which(meta$simulate == "${_simulate_method}"),]
    if (${subset}<nrow(meta) && ${subset}>0) {
      set.seed(999)
      meta = meta[sample(1:nrow(meta))[1:${subset}],]
    }
    # now collect matrices for each method, of two columns: pip and true_coef
    res = list()
    for (i in 1:nrow(meta)) {
      method = meta[i,]$method_rename
      true_coef = as.integer(rowSums(readRDS(paste0(meta[i,]$simulate.output.file, '.rds'))$meta$true_coef) != 0)
      if (method == "atlasqtl") {
          # use max pip cross condition for atlasqtl global PIP
          pip = apply(readRDS(paste0(meta[i,]$method.output.file, '.rds'))$result$gam_vb_completed, 1, max)
      } else if (method == "susie") {
          pip = do.call(cbind, lapply(readRDS(paste0(meta[i,]$method.output.file, '.rds'))$fitted, function(x) x$pip))
          pip = apply(pip, 1, max)
      } else {
          pip = readRDS(paste0(meta[i,]$method.output.file, '.rds'))$result$pip
      }
      if (!(method %in% names(res))) {
        res[[method]] = list(pip = pip, truth = true_coef)
      } else {
        res[[method]]$pip = append(res[[method]]$pip, pip)
        res[[method]]$truth = append(res[[method]]$truth, true_coef)
      }
      if (i%%100==0) print(i)
    }
    for (method in unique(meta$method_rename)) {
      if (!is.null(res[[method]])) res[[method]] = do.call(cbind, res[[method]])
    }
    saveRDS(res, ${_output:r})

Compile PIP calibration data

In [ ]:
# Calibration data
[pip_calibration_1 (PIP calibration data)]
parameter: bin_size = 10
output: f'{_input:n}_calibration.rds'
R: expand = '${ }', workdir = cwd
  dat = readRDS(${_input:r})
  bins = cbind(seq(1:${bin_size})/${bin_size}-1/${bin_size}, seq(1:${bin_size})/${bin_size})
  pip_cali = list()
  for (method in names(dat)) {
    pip_cali[[method]] = matrix(NA, nrow(bins), 3)
    for (i in 1:nrow(bins)) {
      data_in_bin = dat[[method]][which(dat[[method]][,1] > bins[i,1] & dat[[method]][,1] < bins[i,2]),]
      if(!is.null(dim(data_in_bin))) {
          pip_cali[[method]][i,1] = sum(data_in_bin[,1])
          pip_cali[[method]][i,2] = sum(data_in_bin[,2])
          pip_cali[[method]][i,3] = nrow(data_in_bin)
      } else {
        pip_cali[[method]][i,] = c(0,0,0) 
      }
    }
  }
  for (method in names(dat)) {
      pip_cali[[method]][,c(1,2)] = pip_cali[[method]][,c(1,2)] / pip_cali[[method]][,3]
  }
  saveRDS(pip_cali, ${_output:r})

Plot PIP calibration

In [ ]:
# Calibration plot
[pip_calibration_2 (PIP calibration plot)]
depends: R_library('cowplot'), executable('convert')
output: f'{_input:n}.png'
R: expand = '${ }', workdir = cwd
    library(ggplot2)
    library(cowplot)
    rename = list('mnm_oracle+oracle' = 'Oracle prior and residual', 'mnm_oracle+flash' = 'Oracle prior', 'mnm_naive+oracle' = 'Default prior oracle residual', 
                  'mnm_naive+flash' = 'Default prior', 'mnm_ed+oracle' = 'ED prior oracle residual', 'mnm_ed+flash' = 'ED prior', 'mnm_teem+oracle' = 'TEEM prior oracle residual', 'mnm_teem+flash' = 'TEEM prior', 
                  'mnm_ed_max10+oracle' = 'ED prior oracle residual (10 iter)', 'mnm_ed_max10+flash' = 'ED prior (10 iter)',
                  'mnm_identity+oracle' = 'Random effects prior oracle residual', 'mnm_identity+flash' = 'Random effects prior',
                  'mnm_shared+oracle' = 'Fixed effect prior oracle residual', 'mnm_shared+flash' = 'Fixed effect prior', atlasqtl = 'atlasqtl', susie = 'SuSiE',
                  "mnm_rss_naive_corZ+nullz" = "RSS default prior", "mnm_rss_ed_corZ+nullz" = "RSS ED prior", "mnm_rss_oracle+oracle" = "RSS oracle prior and residual",
                  "mnm_rss_oracle+identity" = "RSS oracle prior and diag residual", "mnm_rss_shared_corZ+nullz" = "RSS fixed effect prior", "mnm_rss_oracle+nullz" = "RSS oracle prior",
                  "mnm_rss_oracle+corY" = "RSS oracle prior and unit-specific residual")
    dot_plot = function(dataframe) {
        ggplot(dataframe, aes(x=mean_pip, y=observed_freq)) + 
          geom_errorbar(aes(ymin=observed_freq-se, ymax=observed_freq+se), colour="gray", size = 0.2, width=.01) +
          geom_point(size=1.5, shape=21, fill="#002b36") + # 21 is filled circle
          xlab("Estimated rate") +
          ylab("Observed frequency") +
          coord_cartesian(ylim=c(0,1), xlim=c(0,1)) +
          geom_abline(slope=1,intercept=0,colour='red', size=0.2) +
          ggtitle(rename[[name]]) +
          expand_limits(y=0) +                        # Expand y range
          theme_cowplot()
    }
    dat = readRDS(${_input:r})
    idx = 0
    for (name in sort(names(dat))) {
      idx = idx + 1
      dat[[name]][,3] = sqrt(dat[[name]][,2] * (1 - dat[[name]][,2]) / dat[[name]][,3]) * 2
      dat[[name]] = as.data.frame(dat[[name]])
      colnames(dat[[name]]) = c("mean_pip", "observed_freq", "se")
      pdf(paste0(${_output:nr}, '_' , idx, '.pdf'), width=3, height=3, pointsize=16)
      print(dot_plot(dat[[name]]))
      dev.off()
      system(paste0("convert -density 120 ", ${_output:nr}, '_' , idx, '.pdf', " ", ${_output:nr}, '_' , idx, '.png'))
    }
    files = paste0(${_output:nr}, '_', seq(1:idx), '.png')
    cmd = paste('convert +append', paste(files, collapse=" "), ${_output:r})
    system(cmd)
    system(paste('rm -f', paste(files, collapse=" ")))

Get data for ROC and PR curves

In [ ]:
# Data for ROC
[curves_1 (Get data for curves)]
pip_cutoff = 0.05
output: f'{_input:n}_curves.rds'
R: expand = '${ }', workdir = cwd
    curves_data = function(d1, cutoff = c(${pip_cutoff}, 0.999), connect_org = T) {
        grid = 1000
        ttv = seq(1:grid)/grid
        ttv = ttv[which(ttv>=cutoff[1] & ttv<=cutoff[2])]
        rst1 = t(sapply(ttv, function(x) c(sum(d1[,2][d1[,1]>=x]), length(d1[,2][d1[,1]>=x]), sum(d1[,2][d1[,1]>=x]==0))))
        rst1 = cbind(rst1, sum(d1[,2]), sum(1-d1[,2]))
        rst1 = as.data.frame(rst1)
        colnames(rst1) = c('true_positive', 'total_positive', 'false_positive', 'total_signal', 'total_null')
        rst2 = as.data.frame(cbind(rst1$true_positive / rst1$total_positive, rst1$true_positive / rst1$total_signal,  ttv))
        rst3 = as.data.frame(cbind(1 - rst1$false_positive / rst1$total_null, rst1$true_positive / rst1$total_signal,  ttv))
        if (connect_org) {
            # make a stair to origin
            rst2 = rbind(rst2, c(max(0.995, rst2[nrow(rst2),1]), max(rst2[nrow(rst2),2]-0.01, 0), rst2[nrow(rst2),3]))
            rst2 = rbind(rst2, c(1, 0, 1))
            rst3 = rbind(rst3, c(1, 0, 1))
        }
        colnames(rst2) = c('Precision', 'Recall', 'Threshold')
        colnames(rst3) = c('TN', 'TP', 'Threshold')
        return(list(counts = rst1, pr = rst2, roc = rst3))
    }
  
    print("Compiling data for ROC/PR curves ...")
    curves = list()
    dat = readRDS(${_input:r})
    for (method in names(dat)) {
      print(method)
      curves[[method]] = curves_data(dat[[method]])
    }
    saveRDS(curves, ${_output:r})

Plot ROC and PR curves

In [ ]:
# Plot for ROC
[curves_2 (Plot curves)]
depends: R_library('scam')
parameter: chunks = 0
parameter: smooth = 'FALSE'
parameter: xlim = 0.8
parameter: ylim = 0.8
# Only plot for certain methods
parameter: plot_methods = ['mnm_oracle+flash', 'mnm_naive+flash', 'mnm_ed+flash', 'mnm_identity+flash', 'mnm_shared+flash']
# "pr" or "roc" 
parameter: table = "pr" # or, `roc`
if table == "pr":
    main = "Precision-FDR curve"
    ylab = "power"
    xlab = "FDR"
else:
    main = "ROC curve"
    ylab = "True Positive"
    xlab = "False Positive"
opt = f"lwd = 2, xlim = c(0,{xlim}), ylim = c(0,{ylim})"
output: f'{_input:n}_{table}.pdf'
R: expand = '${ }'
    colors = c('#A60628', '#7A68A6', '#348ABD', '#467821', '#FF0000', '#188487', '#E2A233',
                  '#A9A9A9', '#000000', '#FF00FF', '#FFD700', '#ADFF2F', '#00FFFF')
    dat = readRDS(${_input:r})
    library(scam)
    create_chunks = function(item, n) {
      splitted = suppressWarnings(split(item, 1:n))
      return(c(splitted[[1]], splitted[[length(splitted)]][length(splitted[[length(splitted)]])]))
    }
    make_smooth = function(x,y,subset=${chunks}, smooth = ${smooth}) {
      if (smooth) {
          if (subset < length(x) && subset > 0) {
              x = create_chunks(x, subset)
              y = create_chunks(y, subset)
          }
          dat = data.frame(cbind(x,y))
          colnames(dat) = c('x','y')
          y=predict(scam(y ~ s(x, bs = "mpi"), data = dat))
      }
      return(list(x=x,y=y))
    }
    add_text = function(thresholds, x, y, threshold, color, delta = -0.06) {
        idx = which(thresholds == threshold)
        text(x[idx] - delta, y[idx], labels = threshold, col = color, cex=0.8)
        points(x[idx],y[idx])
    }
    rename = list('mnm_oracle+oracle' = 'Oracle prior and residual', 'mnm_oracle+flash' = 'Oracle prior', 'mnm_naive+oracle' = 'Default prior oracle residual', 
                  'mnm_naive+flash' = 'Default prior', 'mnm_ed+oracle' = 'ED prior oracle residual', 'mnm_ed+flash' = 'ED prior', 'mnm_teem+oracle' = 'TEEM prior oracle residual', 'mnm_teem+flash' = 'TEEM prior', 
                  'mnm_ed_max10+oracle' = 'ED prior oracle residual (10 iter)', 'mnm_ed_max10+flash' = 'ED prior (10 iter)',
                  'mnm_identity+oracle' = 'Random effects prior oracle residual', 'mnm_identity+flash' = 'Random effects prior',
                  'mnm_shared+oracle' = 'Fixed effect prior oracle residual', 'mnm_shared+flash' = 'Fixed effect prior', atlasqtl = 'atlasqtl', susie = 'SuSiE',
                  "mnm_rss_naive_corZ+nullz" = "RSS default prior", "mnm_rss_ed_corZ+nullz" = "RSS ED prior", "mnm_rss_oracle+oracle" = "RSS oracle prior and residual",
                  "mnm_rss_oracle+identity" = "RSS oracle prior and diag residual", "mnm_rss_shared_corZ+nullz" = "RSS fixed effect prior", "mnm_rss_oracle+nullz" = "RSS oracle prior",
                  "mnm_rss_oracle+corY" = "RSS oracle prior and unit-specific residual")
    labels = vector()
    pdf(${_output:r}, width=10, height=10, pointsize=15)
    i = 1
    for (method in names(dat)) {
        if (method %in% c(${paths(plot_methods):r,})) {
            yy = make_smooth(1 - dat[[method]]$${table}[,1], dat[[method]]$${table}[,2])
            if (i == 1) {
                plot(yy$x, yy$y, t="l", col=colors[i], ylab = "${ylab}", xlab ="${xlab}", main = "${main}", bty='l', ${opt})
            } else {
                lines(yy$x, yy$y, col=colors[i], ${opt})
            }
            #add_text(dat[[method]]$${table}[,3], yy$x, yy$y, 0.9, colors[i])
            add_text(dat[[method]]$${table}[,3], yy$x, yy$y, 0.95, colors[i])
            labels[i] = rename[[method]]
            i = i + 1
      }
    }
    legend("bottomright", legend=labels, col=colors[1:i], lty=c(1,1,1), cex=0.8)
    dev.off()

Benchmark results: CS

For full data simulation,

sos run 20210224_MNM_Benchmark get_cs_summary -s build \
    --data-dir mnm_20210409 \
    --methods mnm_oracle mnm_naive mnm_identity mnm_shared mnm_ed

For missing data simulation,

sos run 20210224_MNM_Benchmark get_cs_summary -s build \
    --data-dir mnm_missing_20210409 --simulate simulate_missing \
    --methods mnm_oracle mnm_naive mnm_identity mnm_shared mnm_ed
In [ ]:
[get_cs_summary_1 (meta info)]
parameter: methods = ["mnm_oracle", "mnm_naive", "mnm_identity", "mnm_shared", "mnm_ed", "mnm_rss_ed_corZ"]
parameter: simulate = 'simulate'
output: f'{cwd}/{data_dir}/{data_dir:b}_cs_meta.rds'
R: expand = "${ }"
    dat = dscrutils::dscquery(${data_dir:r}, targets = c('${simulate}', "method", "method.resid_method", 
                                                      'score', 'score.total', 'score.valid', 'score.size', 
                                                      'score.purity', 'score.top', 'score.n_causal', 
                                                      'score.included_causal', 'score.overlap_var', 'score.overlap_cs',
                                                      'score.false_pos_cond_discoveries', 'score.total_cond_discoveries', 'score.size_cond_cs',
                                                      'score.purity_cond_cs', 'score.avg_diff_eff_size_percentile', 'score.cs_correlation',
                                                      'score.false_neg_cond_discoveries', 'score.true_cond_discoveries', 'score.converged'),
                                                       module.output.files = "score", group = c("method: ${' '.join(methods)}", "score: mvsusie_scores", "mnm:", "mnm_missing:", "mnm_rss:"), ignore.missing.file=T)
    saveRDS(dat, ${_output:r})
In [ ]:
[get_cs_summary_2 (summarize info to a table)]
output: f'{_input:nn}_cs_summary_raw.rds'
R: expand = "${ }"
    dat = readRDS(${_input:r})
    dat = tibble::as_tibble(dat)
    print(dim(dat))
    # str(head(dat,1))
    # score.size: average (median)
    # score.purity: average (median)
    # score.total_cond_discoveries: sum
    # score.false_pos_cond_discoveries: sum
    # score.false_neg_cond_discoveries: sum
    # score.true_cond_discoveries: sum
    # score.size_cond_cs: unlist then average
    # score.purity_cond_cs: unlist then average
    # Reformat data
    method = rep(NA, nrow(dat))
    method[which(!is.na(dat$method.resid_method))] = paste(dat$method[which(!is.na(dat$method.resid_method))], paste(dat$method.resid_method[which(!is.na(dat$method.resid_method))], "residual", sep = '_'), sep = '+')
    method[which(is.na(dat$method.resid_method))] = dat$method[which(is.na(dat$method.resid_method))]
    dat$method = method
    # Will deal with median later
    #dat$score.size = sapply(1:length(dat$score.size), function(i) median(dat$score.size[[i]], na.rm=T))
    #dat$score.purity = sapply(1:length(dat$score.purity), function(i) median(dat$score.purity[[i]], na.rm=T))
    dat$score.total_cond_discoveries = sapply(1:length(dat$score.total_cond_discoveries), function(i) sum(dat$score.total_cond_discoveries[[i]], na.rm=T))
    dat$score.false_pos_cond_discoveries = sapply(1:length(dat$score.false_pos_cond_discoveries), function(i) sum(dat$score.false_pos_cond_discoveries[[i]], na.rm=T))
    dat$score.false_neg_cond_discoveries = sapply(1:length(dat$score.false_neg_cond_discoveries), function(i) sum(dat$score.false_neg_cond_discoveries[[i]], na.rm=T))
    dat$score.true_cond_discoveries = sapply(1:length(dat$score.true_cond_discoveries), function(i) sum(dat$score.true_cond_discoveries[[i]], na.rm=T))
    #dat$score.size_cond_cs = sapply(1:length(dat$score.size_cond_cs), function(i) median(unlist(dat$score.size_cond_cs[[i]]), na.rm=T))
    #dat$score.purity_cond_cs = sapply(1:length(dat$score.purity_cond_cs), function(i) median(unlist(dat$score.purity_cond_cs[[i]]), na.rm=T)) 
    colnames(dat) = gsub("score.", "", colnames(dat))
    saveRDS(dat, ${_output:r})
In [58]:
[get_cs_summary_3 (summarize for plot data)]
parameter: simulate = 'simulate'
output: f'{_input:nn}_cs_summary.rds'
R: expand = "${ }"
    res = readRDS(${_input:r})
    tibble_median = function(x) {
      if (class(x) == "list") median(unlist(x), na.rm=T)
      else median(x, na.rm=T)
    }
    ###
    # Global metric
    ###
    # Global size
    size = aggregate(size~${simulate} + method, res, tibble_median)
    size = size[order(size$${simulate}, size$size),]
    # Global purity
    purity = aggregate(purity~${simulate} + method, res, tibble_median)
    purity = purity[order(purity$${simulate}, purity$purity),]
    # Global power
    # `valid` is the number of CS that contains a signal, 
    # `included_causal` is the total number of signals captured by CS
    total_true_included = aggregate(valid ~ ${simulate} + method, res, sum)
    total_true = aggregate(n_causal ~  ${simulate} + method, res, sum)
    cs_overlap = aggregate(overlap_cs ~  ${simulate} + method, res, sum)
    snp_overlap = aggregate(overlap_var ~  ${simulate} + method, res, sum)
    power = merge(total_true_included, total_true, by = c( '${simulate}' , 'method'))
    power = merge(power, cs_overlap,  by = c( '${simulate}' , 'method'))
    power = merge(power, snp_overlap,  by = c( '${simulate}' , 'method'))  
    power$power = round(power$included_causal/power$n_causal,3)
    power$overlap_cs = round(power$overlap_cs, 3)
    power$overlap_var = round(power$overlap_var, 3)
    power = power[order(power$${simulate}, power$power),]
    # Global FDR
    valid = aggregate(valid ~ ${simulate} + method, res, sum)
    total = aggregate(total ~ ${simulate} + method, res, sum)
    fdr = merge(valid, total, by = c( '${simulate}' , 'method'))
    fdr$fdr = round((fdr$total - fdr$valid)/fdr$total,3)
    fdr = fdr[order(fdr$${simulate}, fdr$fdr),]
    # Percentage of CS with top SNP being causal SNP
    # Not very exciting; will not show
    top = aggregate(top~${simulate} + method, res, sum)
    valid = aggregate(valid~${simulate} + method, res, sum)
    top_rate = merge(top, valid,  by = c( '${simulate}' , 'method'))
    top_rate$top_prop = round(top_rate$top/top_rate$valid,3)
    top_rate = top_rate[order(top_rate$${simulate}, top_rate$top_prop),]
    # convergence
    converged = aggregate(converged~${simulate} + method, res, mean)
    converged = converged[order(converged$${simulate}, converged$converged),]
    # CS correlation, defined as the maximum pair-wise correlation for all combinations of variables between two CSs
    cs_correlation = aggregate(cs_correlation~${simulate} + method, res, mean)
    cs_correlation = cs_correlation[order(cs_correlation$${simulate}, cs_correlation$cs_correlation),]    
    ###
    # Condition specific metric: condition specific CS are CS that are significant in a condition by lfsr < some cutoff
    ###
    # size
    size_cond_cs = aggregate(size_cond_cs~${simulate} + method, res, tibble_median)
    size_cond_cs = size_cond_cs[order(size_cond_cs$${simulate}, size_cond_cs$size_cond_cs),]
    # purity
    purity_cond_cs = aggregate(purity_cond_cs~${simulate} + method, res, tibble_median)
    purity_cond_cs = purity_cond_cs[order(purity_cond_cs$${simulate}, purity_cond_cs$purity_cond_cs),]
    # power and FDR
    TP = aggregate(true_cond_discoveries ~ ${simulate} + method, res, sum)
    FP = aggregate(false_pos_cond_discoveries ~  ${simulate} + method, res, sum)
    FN = aggregate(false_neg_cond_discoveries ~  ${simulate} + method, res, sum)
    power_cond = merge(TP, FP, by = c( '${simulate}' , 'method'))
    power_cond = merge(power_cond, FN, by = c( '${simulate}' , 'method'))
    # FIXME: this is not exactly correct because (true_cond_discoveries + false_neg_cond_discoveries) is the total number of CS. There are signals not included in any CS.
    power_cond$power_cond_cs = round(power_cond$true_cond_discoveries / (power_cond$true_cond_discoveries + power_cond$false_neg_cond_discoveries), 3)
    power_cond$fdr_cond_cs = round(power_cond$false_pos_cond_discoveries / (power_cond$false_pos_cond_discoveries + power_cond$true_cond_discoveries), 3)
    power_cond = power_cond[order(fdr$${simulate}, power_cond$power_cond_cs),]
    ###
    # Consolidate into tables
    ###
    b = c( '${simulate}' , 'method')
    out_all = merge(merge(merge(merge(merge(merge(merge(merge(size, 
      purity, by = b), 
      power, by = b), 
      fdr, by = b),
      cs_correlation, by = b),
      size_cond_cs, by = b),
      purity_cond_cs, by = b),
      power_cond, by = b),
      converged, by = b)
    out = out_all[,c("${simulate}", "method", "total", "overlap_cs", "size", "purity", "power", "fdr", "cs_correlation", "size_cond_cs", "purity_cond_cs", "power_cond_cs", "fdr_cond_cs", "converged")]
    out = out[order(out$${simulate}, out$power),]
    saveRDS(list(table=out, table_verbose=out_all), ${_output:r})

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago