Here we want to understand the examples that CAVIAR -c 1
and susie L 1
do not agree for n = 1
causal variable.
The plan is to pin-point the data in question and get the corresponding data-set, then use interactive analysis to explore in detail.
%revisions -s
%cd ~/GIT/github/mvarbvs/dsc
dataset = c('~/Documents/GTExV8/Toys/Thyroid.ENSG00000144445.RDS', '~/Documents/GTExV8/Toys/Thyroid.ENSG00000155324.RDS', '~/Documents/GTExV8/Toys/Thyroid.ENSG00000156738.RDS')
out = dscrutils::dscquery('susie_comparison',
targets = "liter_data.dataset lm_less.n_signal lm_less",
conditions = "lm_less.n_signal = 1")
out[which(out$liter_data.dataset %in% dataset),]
bash:
cp susie_comparison/lm_less/liter_data_{39,48,49}_summarize_ld_1_lm_less_1.pkl ../data
ENSG00000156738
example¶Here I take one data-set and use narratives to work all the way to the point we get CAVIAR and susie results. Hopefully this transparent process will help us pin-pointing the problem.
Load data first:
name = 'ENSG00000156738'
prefix = paste0("/tmp/", name, '_CAVIAR')
dat = dscrutils:::read_dsc('../data/liter_data_49_summarize_ld_1_lm_less_1.pkl')$data
names(dat)
dim(dat$X)
The data has two response variables. We will focus on Y[,1]
:
dim(dat$Y)
The true signal is 816
.
which(dat$true_coef[,1] != 0)
Now output LD and summary stats for CAVIAR
r = cor(dat$X)
write.table(r,paste0(prefix, '.ld'),quote=F,col.names=F,row.names=F)
source('modules/regression.R')
source('modules/fit_caviar.R')
res = mm_regression(as.matrix(dat$X), as.matrix(dat$Y))
z_score = res[1,,]/res[2,,]
cfg = write_caviar_sumstats(z_score, prefix)
max10 = head(order(abs(z_score[,1]), decreasing = T),10)
max10
z_score[max10]
Now run CAVIAR, with prior 0.001 for 1 effect in 1000.
cmd = paste("CAVIAR", "-z", cfg$z, "-l", paste0(prefix, ".ld"), "-o", prefix, "-g 0.001")
dscrutils:::run_cmd(cmd)
log <- readLines(cfg$log)
library(dplyr)
library(magrittr)
# read output tables
snp <- read.delim(cfg$post)
stopifnot(ncol(snp) == 3)
names(snp) <- c("snp", "snp_prob_set", "snp_prob")
snp$snp <- as.character(snp$snp)
snp <- rank_snp(snp)
# `set` of snps
set <- readLines(cfg$set)
set_ordered <- left_join(data_frame(snp = set), snp, by = "snp") %>%
arrange(rank) %$% snp
set_ordered
head(snp, 15)
So here CAVIAR reports one set that contains one causal variant 816
. Notice:
-c 1
but I still get this one signal reported in their *_set
file. When I do specify -c 1
, I will get snp_prob_set
of 1 for 816 and 0 for others (result not shown). The default is -c 2
.816
reports snp_prob
1.0, when -c
is not set, the snp_prob_set
it reports is 0.5 as shown in the table. The other high LD SNPs do share the rest 0.5.Also note that in CAVIAR original output file they use Causal_Post._Prob.
for snp_prob
, interpreted as "the probability of each variant is causal", and Prob_in_pCausalSet
for snp_prob_set
, interpreted as "the amount that this variant contributes to credible set". See documentation here.
Set L=1
for the susie fit, which is just a single effect regression. In sum:
816
the top one, as expected, but the PIP is 0.16# Here my X and Y are already centered
X = scale(dat$X,center=FALSE, scale=TRUE)
Y = dat$Y[,1]
fit = susieR:::single_effect_regression(Y,X,sa2=0.2,s2=var(dat$Y[,1]))
which.max(fit$alpha)
For L=1 the alpha
is the PIP:
plot(fit$alpha, pch=20, xlab='variables', ylab = 'alpha')
Notice that ordering of SNPs are largely consistent between CAVIAR and susie.
order(fit$alpha, decreasing=T)[1:15]
Purity of susie CS, defined by the min of abs(LD):
cs = which(susieR:::in_CS_x(fit$alpha)>0)
purity = r[cs,cs]
purity
length(cs)
min(abs(purity))
plot(exp(fit$lbf), pch = 20, ylab = 'BF', xlab = 'variable')
To fairly compare with not setting CAVIAR -c
option here I set susie L = 5
and fit a susie run:
fit = susieR::susie(dat$X,dat$Y[,1],
L=5,
estimate_residual_variance = F,
prior_variance=0.2,
intercept=FALSE,
tol=1e-3)
susieR:::susie_get_niter(fit)
pip = 1 - apply(1 - fit$alpha, 2, prod)
plot(pip, pch=20, xlab='variables', ylab = 'pip')
The PIP I get here are mostly identical to the single effect model alpha
.
ENSG00000155324
example¶name = 'ENSG00000155324'
dat = dscrutils:::read_dsc('../data/liter_data_48_summarize_ld_1_lm_less_1.pkl')$data
r2 = cor(dat$X)
r2 = r2 ^ 2 * sign(r2)
In this example, I want to see the status of susie BF when the CS idenfied has minimum LD 0.96, as previously reported. Under the L=5 model it reported 6 SNPs.
X = scale(dat$X,center=FALSE, scale=TRUE)
Y = dat$Y[,1]
fit = susieR:::single_effect_regression(Y,X,sa2=0.2,s2=var(dat$Y[,1]))
plot(fit$alpha, pch=20, xlab='variables', ylab = 'alpha')
cs = which(susieR:::in_CS_x(fit$alpha)>0)
purity = r2[cs,cs]
purity
min(abs(purity))
max(abs(purity-diag(nrow(purity))))
So under L = 1
model the CS size is 5. The largest LD is 0.988, which splits away <0.2 of the PIP. The BF's are:
plot(exp(fit$lbf), pch = 20, ylab = 'BF', xlab = 'variable')