Multivariate Bayesian variable selection regression

SusieR benchmark

A first (comprehensive) set of simulations to learn properties of the new fine-mapping method.

In [1]:
%revisions -s -n 10
Revision Author Date Message
6ac6528 Gao Wang 2018-05-29 Add draft PIP comparison
fde6c24 Gao Wang 2018-05-21 Fix notebook scratch pad
a5637a6 Gao Wang 2018-05-21 Update susie benchmark
7d8e7a3 Gao Wang 2018-05-21 Use LD purity cutoff to evaluate CS quality
70590b4 Gao Wang 2018-05-18 Update description
c2d5ca9 Gao Wang 2018-05-18 Increase number of genes evaluated
54e5338 Gao Wang 2018-05-16 Update batch plot workflow
70a6d7a Gao Wang 2018-05-16 Update draft plot codes
8f4c2bb Gao Wang 2018-05-15 Add more test data-set and a no-plot pipeline

As a first pass, we run susie on simulations of 100 genes, with

  • real genotypes, of length 1000 for starters
  • randomly-chosen SNPs as eQTLs
  • 1-5 eQTLs per gene
  • explaining 5-40% of the variance in Y

We characterize 95% confident sets (CS) it produces in terms of

  • size: number of variants in it
  • purity: min, median or mean LD (turns out we can live with using min)
  • significance: lfsr of a CS
  • false discovery: whether or not they capture a signal

and, to compare with other methods

  • power: whether or not all simulated signals are captured
  • PIP consistency: whether or not

We run Susie with

  • estimate_residual_variance: TRUE and FALSE
  • prior_variance: 5-40%

We should expect that results will be reasonably robust to these choices.

In [1]:
%cd ~/GIT/github/mvarbvs/dsc
/home/gaow/GIT/github/mvarbvs/dsc
In [ ]:
dsc susie.dsc --target run_susie -c 20

Utility function

In [10]:
get_combined = function(sub, dirname, ld_col, ld_cutoff = 0.2, lfsr_cutoff = 0.05) {
    out_files = sub[,c("fit_susie.output.file", "plot_susie.output.file")]
    combined = list(purity = NULL, lfsr = NULL, size = NULL, 
                    captures = NULL, total_captures = NULL, pip = NULL)
    for (i in 1:nrow(out_files)) {
        fit = readRDS(paste0(dirname, out_files[i,1], '.rds'))$posterior
        purity = readRDS(paste0(dirname, out_files[i,2], '.rds'))
        #
        if (is.null(combined$purity)) combined$purity = purity$purity$V1[,ld_col]
        else combined$purity = cbind(combined$purity, purity$purity$V1[,ld_col])
        #
        if (is.null(combined$size)) combined$size = fit$n_in_CI[,1]
        else combined$size = cbind(combined$size, fit$n_in_CI[,1])
        #
        if (is.null(combined$lfsr)) combined$lfsr = fit$set_lfsr[,1]
        else combined$lfsr = cbind(combined$lfsr, fit$set_lfsr[,1])
        #
        alpha = fit$alpha[which(purity$purity$V1[,ld_col] > ld_cutoff),,drop=FALSE]
        pip = t(1 - apply(1 - alpha, 2, prod))
        if (is.null(combined$pip)) combined$pip = pip
        else combined$pip = cbind(combined$pip, pip)     
        #
        if (is.null(combined$captures)) combined$captures = rowSums(purity$signal$V1)
        else combined$captures = cbind(combined$captures, rowSums(purity$signal$V1))
        #
        detected = apply(t(purity$signal$V1[which(fit$set_lfsr[,1] < lfsr_cutoff),,drop=FALSE]), 1, sum)
        if (is.null(combined$total_captures)) combined$total_captures = detected
        else combined$total_captures = combined$total_captures + detected
    }
    return(combined)
}
In [54]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='ticks', context='talk')

COLORS = ['#348ABD', '#7A68A6', '#A60628', '#467821', '#FF0000', '#188487', '#E2A233',
              '#A9A9A9', '#000000', '#FF00FF', '#FFD700', '#ADFF2F', '#00FFFF']
color_mapper = np.vectorize(lambda x: dict([(i,j) for i,j in enumerate(COLORS)]).get(x))

def plot_purity(data, output, lfsr_cutoff = 0.05):
    purity = np.array(data['purity'])
    lfsr = np.array(data['lfsr'])
    size = np.array(data['size'])
    capture = np.array(data['captures'])
    capture_summary = [f"Signal {idx+1} captured {item}/{purity.shape[1]} times" for idx, item in enumerate(data['total_captures'])]
    idx = 0
    plt.figure(figsize=(12, 8))
    L = purity.shape[0]
    cols = 3
    rows = L // cols + L % cols
    position = range(1,L + 1)
    for x, y, z, c in zip(size, purity, lfsr, capture):
        z_sig = [i for i, zz in enumerate(z) if zz <= lfsr_cutoff]
        z_nsig = [i for i, zz in enumerate(z) if zz > lfsr_cutoff]
        colors = [4 if i == 0 else 0 for i in c]
        plt.subplot(rows,cols,position[idx])
        idx += 1
        if len(z_sig):
            label = f'{idx}: lfsr<={lfsr_cutoff}'
            plt.scatter(np.take(x, z_sig),
                        np.take(y, z_sig),
                        c = color_mapper(np.take(colors, z_sig)), 
                        label = label, marker = '*')
        if len(z_nsig):
            label = f'{idx}: lfsr>{lfsr_cutoff}'
            plt.scatter(np.take(x, z_nsig),
                        np.take(y, z_nsig),
                        c = color_mapper(np.take(colors, z_nsig)), 
                        label = label, marker = 'x')   
        plt.legend(bbox_to_anchor=(0,1.02,1,0.2), loc="lower left",
                mode="expand", borderaxespad=0, ncol=2, handletextpad=0.1)
    plt.subplots_adjust(hspace=0.3, wspace = 0.3)
    plt.suptitle(f"95% CI set sizes vs min(abs(LD))\n{'; '.join(capture_summary)}")
    plt.savefig(output, dpi=500, bbox_inches='tight')                    
    plt.gca()

Results

In [2]:
out = dscrutils::dscquery('benchmark', 
                    target = "liter_data.dataset simple_lm.pve simple_lm.n_signal fit_susie.estimate_residual_variance fit_susie.prior_var fit_susie plot_susie")
Loading dsc-query output from CSV file.
In [59]:
head(out)
DSCliter_data.datasetsimple_lm.pvesimple_lm.n_signalfit_susie.estimate_residual_variancefit_susie.prior_varfit_susie.output.fileeval_susie.output.file
1 ~/Documents/GTExV8/Thyroid.Lung.FMO2.filled.rds 0.05 1 FALSE 0.05 fit_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_1 eval_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_1_eval_susie_1
1 ~/Documents/GTExV8/Thyroid.Lung.FMO2.filled.rds 0.05 1 FALSE 0.10 fit_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_3 eval_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_3_eval_susie_1
1 ~/Documents/GTExV8/Thyroid.Lung.FMO2.filled.rds 0.05 1 FALSE 0.20 fit_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_5 eval_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_5_eval_susie_1
1 ~/Documents/GTExV8/Thyroid.Lung.FMO2.filled.rds 0.05 1 FALSE 0.40 fit_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_7 eval_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_7_eval_susie_1
1 ~/Documents/GTExV8/Thyroid.Lung.FMO2.filled.rds 0.05 1 TRUE 0.05 fit_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_2 eval_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_2_eval_susie_1
1 ~/Documents/GTExV8/Thyroid.Lung.FMO2.filled.rds 0.05 1 TRUE 0.10 fit_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_4 eval_susie/liter_data_1_summarize_ld_1_simple_lm_1_fit_susie_4_eval_susie_1
In [8]:
dim(out)
  1. 8000
  2. 8

It is a lot of results to look at. Here we focus on PVE 20% and 40%, having 2, 3, 5 signals, with prior set to 0.2. We focus on measuring "purity" by min(abs(LD)).

2 signals + 20% PVE + est_residual=FALSE

In [4]:
pve = 0.2
n = 2
est_res = FALSE
prior = 0.2
ld_col = 1 # LD_Min
lfsr_cutoff = 0.05
dirname = 'benchmark/'
sub = out[which(out$simple_lm.pve == pve & out$simple_lm.n_signal == n & out$fit_susie.estimate_residual_variance == est_res & out$fit_susie.prior_var == prior),]
In [ ]:
combined = get_combined(sub, dirname, ld_col)
combined
In [50]:
saveRDS(combined, '../data/20180516_susie_bm.rds')
In [51]:
dsc-io ../data/20180516_susie_bm.rds ../data/20180516_susie_bm.pkl
In [1]:
import pickle
data = pickle.load(open('../data/20180516_susie_bm.pkl', 'rb'))
In [55]:
plot_purity(data, '2_p2pve_false.png')

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago