Multivariate Bayesian variable selection regression

CS outlier scenarios

This notebook explores scenarios when CS tend not contain causal signals (false positives).

Among the CS we identify, ideally 95% should contain one causal signal. This is indeed the case, shown in this workflow(see the ld_5 step). However from the plot we noticed that there are some outlier cases when our Bayesian CS contain too many false positives. We suspect that these outliers belong to "near null" cases, that is, low PVE and high number of causal.

This is formally explored here.

What we learned

  • Indeed that outlier case are difficult cases
  • It seems we should be conservative: that is, do not estimate residual variance
In [4]:
%cd ~/GIT/github/mvarbvs/dsc

import pickle, os
ld_cutoff = 0.25
capture_cutoff = 0.90
data = pickle.load(open('benchmark/ld_20180516.pkl', 'rb'))[ld_cutoff]
data = [(os.path.basename(x)[:-4] + '.png', y) for x, y in data]

import pandas as pd
data = pd.DataFrame(data, columns = ['output', 'capture_rate'])

result = pd.read_csv('benchmark/purity_20180516/index.csv')
result['output'] = result['output'].apply(lambda x: os.path.basename(x))

merged = pd.merge(result,data, on='output')

merged['avg_pve'] = merged['PVE'] / merged['N_Causal']
pd.options.display.max_rows = 999
# merged.sort_values(by='purity')

merged['capture_rate'] = merged['capture_rate'].apply(lambda x: f'over {capture_cutoff*100:.1f}%' if x > capture_cutoff else f'under {capture_cutoff*100:.1f}%')
/home/gaow/GIT/github/mvarbvs/dsc
In [5]:
import seaborn as sns
sns.set(rc={'figure.figsize':(15,6)}, style = "whitegrid")
ax = sns.factorplot(x="PVE", y="N_Causal",
                   hue="est_residual", col="capture_rate",
                    data=merged, kind="swarm",
                    size=4, aspect=.7)
ax.fig.suptitle(f'LD = {ld_cutoff}',y=1.02)
ax.savefig("benchmark/ld_20180516_outlier.png", dpi=500)

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago