From simulated data we pick this case example where the top z-score might be a result of two adjecent clusters of non-zero effects. See here for a brief description on simulation.
We demonstrate that this setting makes it problematic to run step-wise conditional analysis, but SuSiE's variational procedure can correctly identify the causal signals.
%revisions -s -n 10
library(susieR)
set.seed(1)
dat = readRDS(system.file("data", "N2finemapping.rds", package = "susieR"))
names(dat)
This simulated data-set has two replicates. We focus on the fist replicate:
r = 1
We fit SuSiE and obtain its CS and PIP:
fitted = susie(dat$X, dat$Y[,r], L=5,
estimate_residual_variance=TRUE,
scaled_prior_variance=0.2,
tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
coverage = 0.95,
X = dat$X,
min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)
We also perform a per-feature simple regression and get (asymptotic) z-scores from it:
z_score = susieR:::calc_z(dat$X, dat$Y[,r])
We plot p-values from per-feature regression in parallel with PIP from SuSiE.
b = dat$true_coef[,r]
b[which(b!=0)] = 1
png('demo.png', 12, 6, units = 'in', res = 500)
par(mfrow=c(1,2))
susie_pplot(z_score, dtype='z', b=b, main = 'Per-feature regression p-values')
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))
dev.off()
%preview demo.png
The identified CS are:
print(sets$cs)
L2
corresponds to the blue cluster around position 900. L3
corresponds to the cyan cluster around position 400. In both cases the simulated non-zero effect are captured by the confidence sets.
The CS L2
has many variables. We check how correlated they are:
sets$purity
The minimum absolute correlation is 0.97 for CS L2
. So SuSiE behavior is reasonable.
It takes 23 iterations for SuSiE to converge on this example. With track_fit=TRUE
we keep in fitted$trace
object various quantities to help with diagnotics.
fitted$niter
susie_iterplot(fitted, 3, 'demo')
%preview demo.gif
At the first iteration, the variable with the top z-score was picked up by the first CS. But in later iterations, the first CS vanishes; the 2nd and 3rd CS correctly pick up the two "true" signals.
L=2
¶We now set L=2
and see how SuSiE converges to the true signals, if capable:
fitted = susie(dat$X, dat$Y[,r], L=2,
estimate_residual_variance=TRUE,
scaled_prior_variance=0.2,
tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
coverage = 0.95,
X = dat$X,
min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))
fitted$niter
susie_iterplot(fitted, 2, 'demo_l2')
%preview demo_l2.gif
It seems at the first iteration, the first CS picks up the induced wrong signal cluster, but the true signal were both picked up by the 2nd CS which is quite wide. Later iterations removed the induced wrong cluster and the 2 CS properly converged to the two true signals as desired.
Here instead of fixing prior, I ask SuSiE to update prior variance
fitted = susie(dat$X, dat$Y[,r], L=5,
estimate_residual_variance=TRUE,
estimate_prior_variance=TRUE,
tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
coverage = 0.95,
X = dat$X,
min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))
Now I fix SuSiE prior to a smaller value (scaled prior is 0.01):
fitted = susie(dat$X, dat$Y[,r], L=5,
scaled_prior_variance=0.01,
estimate_prior_variance=TRUE,
tol=1e-3, track_fit=TRUE)
sets = susie_get_CS(fitted,
coverage = 0.95,
X = dat$X,
min_abs_corr = 0.1)
pip = susie_get_PIP(fitted, sets$cs_index)
susie_pplot(pip, res=fitted, CS=sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(sets$cs), 'CS identified'))
Result remains largely the same.
In this folder we provide scripts to run CAVIAR
, FINEMAP
(version 1.1) and DAP-G
on the same data-set for a comparison. We generate summary statistics and run these 3 other methods on the example.
library(abind)
mm_regression = function(X, Y, Z=NULL) {
if (!is.null(Z)) {
Z = as.matrix(Z)
}
reg = lapply(seq_len(ncol(Y)), function (i) simplify2array(susieR:::univariate_regression(X, Y[,i], Z)))
reg = do.call(abind, c(reg, list(along=0)))
# return array: out[1,,] is betahat, out[2,,] is shat
return(aperm(reg, c(3,2,1)))
}
sumstats = mm_regression(as.matrix(dat$X), as.matrix(dat$Y))
saveRDS(list(data=dat, sumstats=sumstats), 'N2.with_sumstats.rds')
export PATH=~/GIT/github/mvarbvs/dsc/modules/linux:$PATH
Rscript ~/GIT/software/susieR/inst/code/finemap.R input=\"N2.with_sumstats.rds\" output=\"N2finemapping.FINEMAP\" args=\"--n-causal-max\ 2\"
export PATH=~/GIT/github/mvarbvs/dsc/modules/linux:$PATH
Rscript ~/GIT/software/susieR/inst/code/caviar.R input=\"N2.with_sumstats.rds\" output=\"N2finemapping.CAVIAR\" args=\"-c\ 2\ -g\ 0.01\"
export PATH=~/GIT/github/mvarbvs/dsc/modules/linux:$PATH
python ~/GIT/software/susieR/inst/code/dap-g.py N2.with_sumstats.rds N2finemapping.DAP -ld_control 0.20 --all
Now we look at these results.
caviar = readRDS("N2finemapping.CAVIAR.rds")
finemap = readRDS("N2finemapping.FINEMAP.rds")
dap = readRDS("N2finemapping.DAP.rds")
snp = caviar[[1]]$snp
pip = snp[order(as.numeric(snp$snp)),]$snp_prob
susie_pplot(pip, dtype='PIP', b=b, main = 'CAVIAR')
CAVIAR provides single 95% confidence sets as follows:
print(caviar[[1]]$set)
any(which(b>0) %in% caviar[[1]]$set)
It is a wide CS, although it does capture the causal SNPs.
snp = finemap[[1]]$snp
pip = snp[order(as.numeric(snp$snp)),]$snp_prob
susie_pplot(pip, dtype='PIP', b=b, main = 'FINEMAP')
For each configuration, FINEMAP provides the corresponding configuration probability:
head(finemap[[1]]$set, 10)
It requires post-processing to come up with a single 95% set similar to that of CAVIAR with all 3 causal variables captured. The result from running finemap.R
provided has been truncated to the minimum set of configurations having cumulative probability of 95%. But the top 10 configurations completely missed the causal signals.
snp = dap[[1]]$snp
pip = snp[order(as.numeric(snp$snp)),]$snp_prob
susie_pplot(pip, dtype='PIP', b=b, main = 'DAP-G')
Good job!
Similar to SuSiE, DAP-G provides per-signal 95% CS:
dap[[1]]$set
Using an average $r^2$ filter of 0.2 ($r=0.44$, compared to SuSiE's $r=0.1$ in earlier analysis), it reports five 95% CS; the first 3 CS contain causal signals. Also the first CS is wider.