Multivariate Bayesian variable selection regression

A hard case fine-mapping example

From simulated data we pick this case example where the top z-score might be a result of two adjecent clusters of non-zero effects. See here for a brief description on simulation.

We demonstrate that this setting makes it problematic to run step-wise conditional analysis, but SuSiE's variational procedure can correctly identify the causal signals.

In [1]:
%revisions -s -n 10
Revision Author Date Message
c80c3a6 Gao Wang 2018-07-17 Add illustration to estimate_prior_variance
b2c0458 Gao Wang 2018-07-12 Add an analysis with L=2
8d71645 Gao Wang 2018-07-12 Add comparisons to other methods on the demo example
f383a75 Gao Wang 2018-07-11 Update documentation for a hard-case demo
In [1]:
library(susieR)
set.seed(1)
In [2]:
dat = readRDS(system.file("data", "N2finemapping.rds", package = "susieR"))
names(dat)
  1. 'X'
  2. 'chrom'
  3. 'pos'
  4. 'true_coef'
  5. 'residual_variance'
  6. 'Y'
  7. 'allele_freq'
  8. 'V'

This simulated data-set has two replicates. We focus on the fist replicate:

In [3]:
r = 1

Comparing SuSiE with per-variable regression

We fit SuSiE and obtain its CS and PIP:

In [4]:
fitted = susie(dat$X, dat$Y[,r], L=5,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.2,
               tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
            coverage = 0.95,
            X = dat$X, 
            min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)

We also perform a per-feature simple regression and get (asymptotic) z-scores from it:

In [5]:
z_score = susieR:::calc_z(dat$X, dat$Y[,r])

We plot p-values from per-feature regression in parallel with PIP from SuSiE.

In [6]:
b = dat$true_coef[,r]
b[which(b!=0)] = 1
png('demo.png', 12, 6, units = 'in', res = 500)
par(mfrow=c(1,2))
susie_pplot(z_score, dtype='z', b=b, main = 'Per-feature regression p-values')
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))
dev.off()
png: 2
In [7]:
%preview demo.png
%preview demo.png
> demo.png (207.7 KiB):

The identified CS are:

In [8]:
print(sets$cs)
$L2
 [1]  850  913  914  915  916  920  924  925  926  927  930  931  933  934  935
[16]  942  946  948  951  952  962  967  968  979  980  982  983  985  988  989
[31]  993  994  996  999 1000 1001 1002

$L3
[1] 337 379 440

L2 corresponds to the blue cluster around position 900. L3 corresponds to the cyan cluster around position 400. In both cases the simulated non-zero effect are captured by the confidence sets.

The CS L2 has many variables. We check how correlated they are:

In [9]:
sets$purity
min.abs.corrmean.abs.corrmedian.abs.corr
L20.97223860.99397380.9949722
L30.85349810.91839930.8944609

The minimum absolute correlation is 0.97 for CS L2. So SuSiE behavior is reasonable.

Looking into SuSiE VEM iterations

It takes 23 iterations for SuSiE to converge on this example. With track_fit=TRUE we keep in fitted$trace object various quantities to help with diagnotics.

In [10]:
fitted$niter
23
In [11]:
susie_iterplot(fitted, 3, 'demo')
Creating GIF animation ...
Iterplot saved to demo.gif
In [12]:
%preview demo.gif
%preview demo.gif
> demo.gif (262.0 KiB):

At the first iteration, the variable with the top z-score was picked up by the first CS. But in later iterations, the first CS vanishes; the 2nd and 3rd CS correctly pick up the two "true" signals.

SuSiE with L=2

We now set L=2 and see how SuSiE converges to the true signals, if capable:

In [13]:
fitted = susie(dat$X, dat$Y[,r], L=2,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.2,
               tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
            coverage = 0.95,
            X = dat$X, 
            min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))
In [14]:
fitted$niter
10
In [15]:
susie_iterplot(fitted, 2, 'demo_l2')
Creating GIF animation ...
Iterplot saved to demo_l2.gif
In [16]:
%preview demo_l2.gif
%preview demo_l2.gif
> demo_l2.gif (174.0 KiB):

It seems at the first iteration, the first CS picks up the induced wrong signal cluster, but the true signal were both picked up by the 2nd CS which is quite wide. Later iterations removed the induced wrong cluster and the 2 CS properly converged to the two true signals as desired.

Is the result robust to prior choice?

Here instead of fixing prior, I ask SuSiE to update prior variance

In [17]:
fitted = susie(dat$X, dat$Y[,r], L=5,
               estimate_residual_variance=TRUE, 
               estimate_prior_variance=TRUE,
               tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
            coverage = 0.95,
            X = dat$X, 
            min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)
In [18]:
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))

Now I fix SuSiE prior to a smaller value (scaled prior is 0.01):

In [21]:
fitted = susie(dat$X, dat$Y[,r], L=5,
               scaled_prior_variance=0.01, 
               estimate_prior_variance=TRUE,
               tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
            coverage = 0.95,
            X = dat$X, 
            min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)
In [22]:
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))

Result remains largely the same.

How other methods cope with this

In this folder we provide scripts to run CAVIAR, FINEMAP (version 1.1) and DAP-G on the same data-set for a comparison. We generate summary statistics and run these 3 other methods on the example.

In [20]:
library(abind)
mm_regression = function(X, Y, Z=NULL) {
  if (!is.null(Z)) {
      Z = as.matrix(Z)
  }
  reg = lapply(seq_len(ncol(Y)), function (i) simplify2array(susieR:::univariate_regression(X, Y[,i], Z)))
  reg = do.call(abind, c(reg, list(along=0)))
  # return array: out[1,,] is betahat, out[2,,] is shat
  return(aperm(reg, c(3,2,1)))
}
sumstats = mm_regression(as.matrix(dat$X), as.matrix(dat$Y))
saveRDS(list(data=dat, sumstats=sumstats), 'N2.with_sumstats.rds')
In [21]:
export PATH=~/GIT/github/mvarbvs/dsc/modules/linux:$PATH
Rscript ~/GIT/software/susieR/inst/code/finemap.R input=\"N2.with_sumstats.rds\" output=\"N2finemapping.FINEMAP\" args=\"--n-causal-max\ 2\"
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Running shell command:
finemap --sss --log --in-files /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_1.master --n-causal-max 2

|---------------------------------|
| Welcome to FINEMAP v1.1         |
|                                 |
| (c) 2015 University of Helsinki |
|                                 |
| Help / Documentation:           |
| - ./finemap --help              |
| - www.christianbenner.com       |
|                                 |
| Contact:                        |
| - christian.benner@helsinki.fi  |
| - matti.pirinen@helsinki.fi     |
|---------------------------------|

--------
SETTINGS
--------
- regions         : 1 
- n-causal-max    : 2
- n-configs-top   : 50000
- n-iterations    : 100000
- n-convergence   : 1000
- prob-tol        : 0.001
- corr-config     : 0.95
- prior-std       : 0.05
- prior-k0        : 0

---------------
RUNNING FINEMAP (1/1)
---------------
- GWAS z-scores                   : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_1.z
- SNP correlations                : /tmp/Rtmpatg4Zf/file2f0876a3af8.ld
- Causal SNP configurations       : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_1.config
- Single-SNP statistics           : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_1.snp
- Log file                        : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_1.log

- Reading input                   : done!
- Number of SNPs in region        : 1002
- Number of individuals in GWAS   : 574
- Prior-Pr(# of causal SNPs is k) : 
  (0 -> 0)
   1 -> 0.666667
   2 -> 0.333333
- 125357 configurations evaluated (1.291/100%) : converged after 1291 iterations
- log10(Evidence of at least 1 causal SNP) : 11.6571
- Post-Pr(# of causal SNPs is k)  : 
   0 -> 0
   1 -> 0.00454175
   2 -> 0.995458
- Writing output                  : done!
- Run time                        : 0 hours, 0 minutes, 3 seconds

Running shell command:
finemap --sss --log --in-files /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_2.master --n-causal-max 2

|---------------------------------|
| Welcome to FINEMAP v1.1         |
|                                 |
| (c) 2015 University of Helsinki |
|                                 |
| Help / Documentation:           |
| - ./finemap --help              |
| - www.christianbenner.com       |
|                                 |
| Contact:                        |
| - christian.benner@helsinki.fi  |
| - matti.pirinen@helsinki.fi     |
|---------------------------------|

--------
SETTINGS
--------
- regions         : 1 
- n-causal-max    : 2
- n-configs-top   : 50000
- n-iterations    : 100000
- n-convergence   : 1000
- prob-tol        : 0.001
- corr-config     : 0.95
- prior-std       : 0.05
- prior-k0        : 0

---------------
RUNNING FINEMAP (1/1)
---------------
- GWAS z-scores                   : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_2.z
- SNP correlations                : /tmp/Rtmpatg4Zf/file2f0876a3af8.ld
- Causal SNP configurations       : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_2.config
- Single-SNP statistics           : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_2.snp
- Log file                        : /tmp/Rtmpatg4Zf/file2f08321fe714.finemap_condition_2.log

- Reading input                   : done!
- Number of SNPs in region        : 1002
- Number of individuals in GWAS   : 574
- Prior-Pr(# of causal SNPs is k) : 
  (0 -> 0)
   1 -> 0.666667
   2 -> 0.333333
- 66792 configurations evaluated (1.012/100%) : converged after 1012 iterations
- log10(Evidence of at least 1 causal SNP) : 23.5595
- Post-Pr(# of causal SNPs is k)  : 
   0 -> 0
   1 -> 0.00121549
   2 -> 0.998785
- Writing output                  : done!
- Run time                        : 0 hours, 0 minutes, 2 seconds

In [22]:
export PATH=~/GIT/github/mvarbvs/dsc/modules/linux:$PATH
Rscript ~/GIT/software/susieR/inst/code/caviar.R input=\"N2.with_sumstats.rds\" output=\"N2finemapping.CAVIAR\" args=\"-c\ 2\ -g\ 0.01\"
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Running shell command:Running shell command:

CAVIAR -z /tmp/Rtmp3Owh2y/file2f29f83980.caviar_condition_2.z -l /tmp/Rtmp3Owh2y/file2f20410f882e.ld -o /tmp/Rtmp3Owh2y/file2f29f83980.caviar_condition_2 -c 2 -g 0.01
CAVIAR -z /tmp/Rtmp3Owh2y/file2f28f83980.caviar_condition_1.z -l /tmp/Rtmp3Owh2y/file2f20410f882e.ld -o /tmp/Rtmp3Owh2y/file2f28f83980.caviar_condition_1 -c 2 -g 0.01
@-------------------------------------------------------------@
| CAVIAR!		| 	   v2.2         |  10/Apr/2018| 
|-------------------------------------------------------------|
@-------------------------------------------------------------@
| (C) 2018 Farhad Hormozdiari, GNU General Public License, v2 |
| CAVIAR!		| 	   v2.2         |  10/Apr/2018| 
|-------------------------------------------------------------|
| (C) 2018 Farhad Hormozdiari, GNU General Public License, v2 |
|-------------------------------------------------------------|
| For documentation, citation & bug-report instructions:      |
| 		http://genetics.cs.ucla.edu/caviar/           |
@-------------------------------------------------------------@
|-------------------------------------------------------------|
| For documentation, citation & bug-report instructions:      |
| 		http://genetics.cs.ucla.edu/caviar/           |
@-------------------------------------------------------------@
Max Causal=2
2.38804%Max Causal=2
99.8997%Total Likelihood= 2.410519e+01 SNP=1002 

779 3.414553e-02
764 6.690504e-02
868 9.838524e-02
836 1.298654e-01
867 1.544940e-01
336 1.768328e-01
915 1.950728e-01
848 2.130203e-01
934 2.300667e-01
378 2.470607e-01
439 2.640285e-01
846 2.806369e-01
998 2.970392e-01
837 3.124345e-01
833 3.275188e-01
913 3.425284e-01
823 3.567867e-01
813 3.704611e-01
769 3.826089e-01
966 3.946910e-01
963 4.066651e-01
917 4.184433e-01
816 4.299721e-01
766 4.412555e-01
366 4.521876e-01
992 4.630164e-01
993 4.738438e-01
364 4.844559e-01
919 4.949483e-01
921 5.050235e-01
410 5.150233e-01
995 5.249182e-01
1001 5.344758e-01
999 5.440333e-01
929 5.534842e-01
996 5.628957e-01
925 5.722803e-01
912 5.814287e-01
941 5.905772e-01
932 5.997257e-01
933 6.088741e-01
924 6.180226e-01
926 6.271710e-01
781 6.362407e-01
961 6.451347e-01
803 6.539878e-01
947 6.619442e-01
1000 6.698408e-01
951 6.776310e-01
984 6.851970e-01
930 6.927349e-01
950 7.001960e-01
374 7.075786e-01
967 7.142736e-01
386 7.209363e-01
914 7.274741e-01
872 7.338345e-01
981 7.398116e-01
982 7.457887e-01
979 7.517658e-01
978 7.577429e-01
891 7.636534e-01
976 7.695196e-01
777 7.752637e-01
945 7.808495e-01
805 7.863586e-01
773 7.918351e-01
775 7.973030e-01
987 8.025522e-01
988 8.077198e-01
849 8.127752e-01
936 8.175888e-01
826 8.222920e-01
423 8.269216e-01
371 8.314985e-01
783 8.360307e-01
425 8.405091e-01
923 8.449679e-01
939 8.494181e-01
787 8.538675e-01
986 8.583037e-01
873 8.626282e-01
841 8.669528e-01
835 8.712773e-01
790 8.755104e-01
789 8.797186e-01
791 8.838890e-01
782 8.877957e-01
854 8.916988e-01
464 8.955490e-01
928 8.992789e-01
968 9.029833e-01
972 9.066447e-01
454 9.102568e-01
451 9.138688e-01
802 9.171606e-01
840 9.202919e-01
381 9.233193e-01
442 9.262792e-01
831 9.292185e-01
788 9.319735e-01
811 9.347242e-01
808 9.374748e-01
901 9.401715e-01
776 9.427875e-01
898 9.453469e-01
920 9.478777e-01
340 9.503686e-01

99.8997%Total Likelihood= 5.365968e+01 SNP=1002 

480 1.821856e-01
495 3.616711e-01
565 5.051735e-01
481 6.150687e-01
530 7.173440e-01
500 7.506963e-01
503 7.806433e-01
539 7.969867e-01
735 8.068142e-01
569 8.165952e-01
582 8.253396e-01
506 8.337623e-01
601 8.415058e-01
492 8.491573e-01
675 8.564842e-01
606 8.633343e-01
716 8.700622e-01
677 8.765487e-01
625 8.829365e-01
626 8.893242e-01
637 8.957119e-01
669 9.020997e-01
574 9.084874e-01
705 9.148442e-01
688 9.212010e-01
661 9.275577e-01
618 9.339145e-01
522 9.402712e-01
519 9.466280e-01
699 9.529596e-01

In [23]:
export PATH=~/GIT/github/mvarbvs/dsc/modules/linux:$PATH
python ~/GIT/software/susieR/inst/code/dap-g.py N2.with_sumstats.rds N2finemapping.DAP -ld_control 0.20 --all

Now we look at these results.

In [24]:
caviar = readRDS("N2finemapping.CAVIAR.rds")
finemap = readRDS("N2finemapping.FINEMAP.rds")
dap = readRDS("N2finemapping.DAP.rds")

CAVIAR results

PIP

In [25]:
snp = caviar[[1]]$snp
pip = snp[order(as.numeric(snp$snp)),]$snp_prob
susie_pplot(pip, dtype='PIP', b=b, main = 'CAVIAR')

95% CS

CAVIAR provides single 95% confidence sets as follows:

In [26]:
print(caviar[[1]]$set)
  [1] "780"  "765"  "837"  "869"  "868"  "337"  "916"  "849"  "935"  "379" 
 [11] "440"  "847"  "999"  "838"  "834"  "914"  "824"  "814"  "770"  "967" 
 [21] "964"  "918"  "817"  "767"  "367"  "993"  "994"  "365"  "920"  "922" 
 [31] "411"  "996"  "1000" "1002" "930"  "997"  "926"  "913"  "925"  "927" 
 [41] "933"  "934"  "942"  "782"  "962"  "804"  "948"  "1001" "952"  "985" 
 [51] "931"  "951"  "375"  "968"  "387"  "915"  "873"  "979"  "980"  "982" 
 [61] "983"  "892"  "977"  "778"  "946"  "806"  "774"  "776"  "988"  "989" 
 [71] "850"  "937"  "827"  "424"  "372"  "784"  "426"  "924"  "940"  "788" 
 [81] "987"  "836"  "842"  "874"  "791"  "790"  "792"  "783"  "855"  "465" 
 [91] "929"  "969"  "973"  "452"  "455"  "803"  "841"  "382"  "443"  "832" 
[101] "789"  "809"  "812"  "902"  "777"  "899"  "921"  "341" 
In [27]:
any(which(b>0) %in% caviar[[1]]$set)
TRUE

It is a wide CS, although it does capture the causal SNPs.

FINEMAP results

PIP

In [28]:
snp = finemap[[1]]$snp
pip = snp[order(as.numeric(snp$snp)),]$snp_prob
susie_pplot(pip, dtype='PIP', b=b, main = 'FINEMAP')

95% CS

For each configuration, FINEMAP provides the corresponding configuration probability:

In [29]:
head(finemap[[1]]$set, 10)
rankconfigconfig_probconfig_log10bf
1 780,916 0.00243037615.22015
2 780,935 0.00233733715.20320
3 780,999 0.00233203015.20221
4 765,999 0.00220979515.17883
5 765,935 0.00220849815.17858
6 765,916 0.00217186115.17131
7 837,999 0.00213420115.16371
8 869,999 0.00213420115.16371
9 869,935 0.00213331815.16353
10 837,935 0.00213331815.16353

It requires post-processing to come up with a single 95% set similar to that of CAVIAR with all 3 causal variables captured. The result from running finemap.R provided has been truncated to the minimum set of configurations having cumulative probability of 95%. But the top 10 configurations completely missed the causal signals.

DAP-G results

PIP

In [30]:
snp = dap[[1]]$snp
pip = snp[order(as.numeric(snp$snp)),]$snp_prob
susie_pplot(pip, dtype='PIP', b=b, main = 'DAP-G')

Good job!

95% CS

Similar to SuSiE, DAP-G provides per-signal 95% CS:

In [31]:
dap[[1]]$set
clustercluster_probcluster_avg_r2snp
01 1.000000 0.855 782,778,776,788,790,791,792,780,765,847,869,837,868,824,849,817,838,834,770,814,767,803,783,774,827
12 1.000000 0.966 935,916,999,914,967,993,994,920,926,925,913,927,933,934,942,996,1002,1000,962,1001,930,985,804,892
23 1.000000 0.821 337,379,440,411,365,367
34 0.055970 0.558 538,467,947,737,965,911,956
45 0.000072 1.000 157

Using an average $r^2$ filter of 0.2 ($r=0.44$, compared to SuSiE's $r=0.1$ in earlier analysis), it reports five 95% CS; the first 3 CS contain causal signals. Also the first CS is wider.


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago