Multivariate Bayesian variable selection regression

Looking into the 3 CAVIAR outliers

Here we want to understand the examples that CAVIAR -c 1 and susie L 1 do not agree for n = 1 causal variable.

The plan is to pin-point the data in question and get the corresponding data-set, then use interactive analysis to explore in detail.

In [1]:
%revisions -s
Revision Author Date Message
0a2b456 Gao Wang 2018-06-05 Polish calibrated PIP plot
8f4f078 Gao Wang 2018-06-03 Add z-score check
880f1d3 Gao Wang 2018-06-03 Add z-score check
6b6455a Gao Wang 2018-06-01 Add another gene's example for susie single effect model
074f4db Gao Wang 2018-06-01 Add susie L=5 comparison
e4990ea Gao Wang 2018-06-01 Add a comment on -c 1 case for CAVIAR
ecefc0f Gao Wang 2018-06-01 Update CAVIAR table
9894dcf Gao Wang 2018-06-01 Update documentation
e6fce6f Gao Wang 2018-06-01 Finish up one CAVIAR example
7705774 Gao Wang 2018-06-01 Add interactive notebook for CAVIAR issues

Extract simulated dataset

In [2]:
%cd ~/GIT/github/mvarbvs/dsc
/home/gaow/GIT/github/mvarbvs/dsc
In [3]:
dataset = c('~/Documents/GTExV8/Toys/Thyroid.ENSG00000144445.RDS', '~/Documents/GTExV8/Toys/Thyroid.ENSG00000155324.RDS', '~/Documents/GTExV8/Toys/Thyroid.ENSG00000156738.RDS')
out = dscrutils::dscquery('susie_comparison', 
                          targets = "liter_data.dataset lm_less.n_signal lm_less", 
                          conditions = "lm_less.n_signal = 1")
Loading dsc-query output from CSV file.
In [4]:
out[which(out$liter_data.dataset %in% dataset),]
DSCliter_data.datasetlm_less.n_signallm_less.output.file
391 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000144445.RDS1 lm_less/liter_data_39_summarize_ld_1_lm_less_1
481 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000155324.RDS1 lm_less/liter_data_48_summarize_ld_1_lm_less_1
491 ~/Documents/GTExV8/Toys/Thyroid.ENSG00000156738.RDS1 lm_less/liter_data_49_summarize_ld_1_lm_less_1
In [5]:
bash:
    cp susie_comparison/lm_less/liter_data_{39,48,49}_summarize_ld_1_lm_less_1.pkl ../data

The ENSG00000156738 example

Here I take one data-set and use narratives to work all the way to the point we get CAVIAR and susie results. Hopefully this transparent process will help us pin-pointing the problem.

Load data first:

In [6]:
name = 'ENSG00000156738'
prefix = paste0("/tmp/", name, '_CAVIAR')
In [7]:
dat = dscrutils:::read_dsc('../data/liter_data_49_summarize_ld_1_lm_less_1.pkl')$data

Data preparation

In [8]:
names(dat)
  1. 'X'
  2. 'y'
  3. 'Z'
  4. 'chrom'
  5. 'pos'
  6. 'true_coef'
  7. 'residual_variance'
  8. 'Y'
  9. 'allele_freq'
  10. 'V'
In [9]:
dim(dat$X)
  1. 574
  2. 1001

The data has two response variables. We will focus on Y[,1]:

In [10]:
dim(dat$Y)
  1. 574
  2. 2

The true signal is 816.

In [11]:
which(dat$true_coef[,1] != 0)
816

Now output LD and summary stats for CAVIAR

In [12]:
r = cor(dat$X)
write.table(r,paste0(prefix, '.ld'),quote=F,col.names=F,row.names=F)
In [13]:
source('modules/regression.R')
source('modules/fit_caviar.R')
res = mm_regression(as.matrix(dat$X), as.matrix(dat$Y))
z_score = res[1,,]/res[2,,]
cfg = write_caviar_sumstats(z_score, prefix)

show the top z-scores as is

In [14]:
max10 = head(order(abs(z_score[,1]), decreasing = T),10)
max10
  1. 816
  2. 769
  3. 770
  4. 771
  5. 791
  6. 796
  7. 797
  8. 798
  9. 814
  10. 776
In [15]:
z_score[max10]
  1. -176.56604706574
  2. -163.004840205237
  3. -163.004840205237
  4. -163.004840205237
  5. -163.004840205237
  6. -163.004840205237
  7. -163.004840205237
  8. -163.004840205237
  9. -163.004840205237
  10. -159.404561188874

CAVIAR

Now run CAVIAR, with prior 0.001 for 1 effect in 1000.

In [16]:
cmd = paste("CAVIAR", "-z", cfg$z, "-l", paste0(prefix, ".ld"), "-o", prefix, "-g 0.001")
dscrutils:::run_cmd(cmd)
0
In [17]:
log <- readLines(cfg$log)
library(dplyr)
library(magrittr)
# read output tables
snp <- read.delim(cfg$post)  
stopifnot(ncol(snp) == 3)
names(snp) <- c("snp", "snp_prob_set", "snp_prob")
snp$snp <- as.character(snp$snp)
snp <- rank_snp(snp)

# `set` of snps
set <- readLines(cfg$set)
set_ordered <- left_join(data_frame(snp = set), snp, by = "snp") %>% 
arrange(rank) %$% snp
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:

    filter, lag
The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
In [18]:
set_ordered
  1. '816'
  2. '950'
  3. '925'
In [19]:
head(snp, 15)
ranksnpsnp_probsnp_prob_cumsumsnp_prob_set
1 816 1.00000e+000.5000000 5.00000e-01
2 950 6.37421e-010.8187105 3.18711e-01
3 925 3.62579e-011.0000000 1.81289e-01
4 906 4.48000e-091.0000000 2.24000e-09
5 902 1.91918e-091.0000000 9.59592e-10
6 911 1.11758e-141.0000000 5.58791e-15
7 860 2.82036e-171.0000000 1.41018e-17
8 875 8.75390e-191.0000000 4.37695e-19
9 647 3.42789e-191.0000000 1.71394e-19
10 837 1.29764e-191.0000000 6.48822e-20
11 879 2.48016e-221.0000000 1.24008e-22
12 898 1.64572e-231.0000000 8.22860e-24
13 890 3.47679e-321.0000000 1.73839e-32
14 954 8.88897e-391.0000000 4.44449e-39
15 892 9.10015e-651.0000000 4.55008e-65

So here CAVIAR reports one set that contains one causal variant 816. Notice:

  • The ordering is largely consistent with ordering of z-scores.
  • In the CAVIAR call I did not explicitly specify -c 1 but I still get this one signal reported in their *_set file. When I do specify -c 1, I will get snp_prob_set of 1 for 816 and 0 for others (result not shown). The default is -c 2.
  • also in CAVIAR although 816 reports snp_prob 1.0, when -c is not set, the snp_prob_set it reports is 0.5 as shown in the table. The other high LD SNPs do share the rest 0.5.

Also note that in CAVIAR original output file they use Causal_Post._Prob. for snp_prob, interpreted as "the probability of each variant is causal", and Prob_in_pCausalSet for snp_prob_set, interpreted as "the amount that this variant contributes to credible set". See documentation here.

susie single effect

Set L=1 for the susie fit, which is just a single effect regression. In sum:

  • Susie still picks 816 the top one, as expected, but the PIP is 0.16
  • There are 14 other variables have PIP under 0.08
In [20]:
# Here my X and Y are already centered

X = scale(dat$X,center=FALSE, scale=TRUE)
Y = dat$Y[,1]
fit = susieR:::single_effect_regression(Y,X,sa2=0.2,s2=var(dat$Y[,1]))
In [21]:
which.max(fit$alpha)
816

For L=1 the alpha is the PIP:

In [22]:
plot(fit$alpha, pch=20, xlab='variables', ylab = 'alpha')

Notice that ordering of SNPs are largely consistent between CAVIAR and susie.

In [23]:
order(fit$alpha, decreasing=T)[1:15]
  1. 816
  2. 769
  3. 770
  4. 771
  5. 791
  6. 796
  7. 797
  8. 798
  9. 814
  10. 776
  11. 768
  12. 815
  13. 775
  14. 786
  15. 751

Purity of susie CS, defined by the min of abs(LD):

In [25]:
cs = which(susieR:::in_CS_x(fit$alpha)>0)
purity = r[cs,cs]
purity
1.00000000.99816930.99848740.99848740.99848740.99696680.99817130.99696680.99848740.99848740.99848740.99848740.99848740.99696240.9966477
0.99816931.00000000.99968090.99968090.99968090.99817360.99936240.99817360.99968090.99968090.99968090.99968090.99968090.99816930.9978522
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99696680.99817360.99849170.99849170.99849171.00000000.99817560.99697110.99849170.99849170.99849170.99849170.99849170.99696680.9966520
0.99817130.99936240.99968290.99968290.99968290.99817561.00000000.99817560.99968290.99968290.99968290.99968290.99968290.99817130.9978542
0.99696680.99817360.99849170.99849170.99849170.99697110.99817561.00000000.99849170.99849170.99849170.99849170.99849170.99696680.9966520
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99848740.99968091.00000001.00000001.00000000.99849170.99968290.99849171.00000001.00000001.00000001.00000001.00000000.99848740.9981717
0.99696240.99816930.99848740.99848740.99848740.99696680.99817130.99696680.99848740.99848740.99848740.99848740.99848741.00000000.9966477
0.99664770.99785220.99817170.99817170.99817170.99665200.99785420.99665200.99817170.99817170.99817170.99817170.99817170.99664771.0000000
In [26]:
length(cs)
15
In [27]:
min(abs(purity))
0.996647731707717

single effect BF's

In [28]:
plot(exp(fit$lbf), pch = 20, ylab = 'BF', xlab = 'variable')

susie L=5

To fairly compare with not setting CAVIAR -c option here I set susie L = 5 and fit a susie run:

In [5]:
fit = susieR::susie(dat$X,dat$Y[,1],
                               L=5,
                               estimate_residual_variance = F, 
                               prior_variance=0.2, 
                               intercept=FALSE,
                               tol=1e-3)
In [6]:
susieR:::susie_get_niter(fit)
4
In [14]:
pip = 1 - apply(1 - fit$alpha, 2, prod)
In [15]:
plot(pip, pch=20, xlab='variables', ylab = 'pip')

The PIP I get here are mostly identical to the single effect model alpha.

The ENSG00000155324 example

In [25]:
name = 'ENSG00000155324'
dat = dscrutils:::read_dsc('../data/liter_data_48_summarize_ld_1_lm_less_1.pkl')$data
r2 = cor(dat$X)
r2 = r2 ^ 2 * sign(r2)

In this example, I want to see the status of susie BF when the CS idenfied has minimum LD 0.96, as previously reported. Under the L=5 model it reported 6 SNPs.

In [28]:
X = scale(dat$X,center=FALSE, scale=TRUE)
Y = dat$Y[,1]
fit = susieR:::single_effect_regression(Y,X,sa2=0.2,s2=var(dat$Y[,1]))
In [29]:
plot(fit$alpha, pch=20, xlab='variables', ylab = 'alpha')
In [30]:
cs = which(susieR:::in_CS_x(fit$alpha)>0)
purity = r2[cs,cs]
purity
1.00000000.97956750.96998380.97199370.9678303
0.97956751.00000000.98725640.98499380.9882706
0.96998380.98725641.00000000.98728420.9755299
0.97199370.98499380.98728421.00000000.9805825
0.96783030.98827060.97552990.98058251.0000000
In [32]:
min(abs(purity))
0.967830269918982
In [35]:
max(abs(purity-diag(nrow(purity))))
0.98827055482377

So under L = 1 model the CS size is 5. The largest LD is 0.988, which splits away <0.2 of the PIP. The BF's are:

In [36]:
plot(exp(fit$lbf), pch = 20, ylab = 'BF', xlab = 'variable')

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago