Looking into the 3 CAVIAR outliers¶

Here we want to understand the examples that CAVIAR -c 1 and susie L 1 do not agree for n = 1 causal variable.

The plan is to pin-point the data in question and get the corresponding data-set, then use interactive analysis to explore in detail.

%revisions -s

Extract simulated dataset¶

%cd ~/GIT/github/mvarbvs/dsc

/home/gaow/GIT/github/mvarbvs/dsc

dataset = c('~/Documents/GTExV8/Toys/Thyroid.ENSG00000144445.RDS', '~/Documents/GTExV8/Toys/Thyroid.ENSG00000155324.RDS', '~/Documents/GTExV8/Toys/Thyroid.ENSG00000156738.RDS')
out = dscrutils::dscquery('susie_comparison', 
                          targets = "liter_data.dataset lm_less.n_signal lm_less", 
                          conditions = "lm_less.n_signal = 1")

Loading dsc-query output from CSV file.

out[which(out$liter_data.dataset %in% dataset),]

bash:
    cp susie_comparison/lm_less/liter_data_{39,48,49}_summarize_ld_1_lm_less_1.pkl ../data

The `ENSG00000156738` example¶

Here I take one data-set and use narratives to work all the way to the point we get CAVIAR and susie results. Hopefully this transparent process will help us pin-pointing the problem.

Load data first:

name = 'ENSG00000156738'
prefix = paste0("/tmp/", name, '_CAVIAR')

dat = dscrutils:::read_dsc('../data/liter_data_49_summarize_ld_1_lm_less_1.pkl')$data

Data preparation¶

names(dat)

dim(dat$X)

The data has two response variables. We will focus on Y[,1]:

dim(dat$Y)

The true signal is 816.

which(dat$true_coef[,1] != 0)

Now output LD and summary stats for CAVIAR

r = cor(dat$X)
write.table(r,paste0(prefix, '.ld'),quote=F,col.names=F,row.names=F)

source('modules/regression.R')
source('modules/fit_caviar.R')
res = mm_regression(as.matrix(dat$X), as.matrix(dat$Y))
z_score = res[1,,]/res[2,,]
cfg = write_caviar_sumstats(z_score, prefix)

show the top z-scores as is¶

max10 = head(order(abs(z_score[,1]), decreasing = T),10)
max10

z_score[max10]

CAVIAR¶

Now run CAVIAR, with prior 0.001 for 1 effect in 1000.

cmd = paste("CAVIAR", "-z", cfg$z, "-l", paste0(prefix, ".ld"), "-o", prefix, "-g 0.001")
dscrutils:::run_cmd(cmd)

log <- readLines(cfg$log)
library(dplyr)
library(magrittr)
# read output tables
snp <- read.delim(cfg$post)  
stopifnot(ncol(snp) == 3)
names(snp) <- c("snp", "snp_prob_set", "snp_prob")
snp$snp <- as.character(snp$snp)
snp <- rank_snp(snp)

# `set` of snps
set <- readLines(cfg$set)
set_ordered <- left_join(data_frame(snp = set), snp, by = "snp") %>% 
arrange(rank) %$% snp

Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:

    filter, lag
The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

set_ordered

head(snp, 15)

So here CAVIAR reports one set that contains one causal variant 816. Notice:

The ordering is largely consistent with ordering of z-scores.
In the CAVIAR call I did not explicitly specify -c 1 but I still get this one signal reported in their *_set file. When I do specify -c 1, I will get snp_prob_set of 1 for 816 and 0 for others (result not shown). The default is -c 2.
also in CAVIAR although 816 reports snp_prob 1.0, when -c is not set, the snp_prob_set it reports is 0.5 as shown in the table. The other high LD SNPs do share the rest 0.5.

Also note that in CAVIAR original output file they use Causal_Post._Prob. for snp_prob, interpreted as "the probability of each variant is causal", and Prob_in_pCausalSet for snp_prob_set, interpreted as "the amount that this variant contributes to credible set". See documentation here.

susie single effect¶

Set L=1 for the susie fit, which is just a single effect regression. In sum:

Susie still picks 816 the top one, as expected, but the PIP is 0.16
There are 14 other variables have PIP under 0.08

# Here my X and Y are already centered

X = scale(dat$X,center=FALSE, scale=TRUE)
Y = dat$Y[,1]
fit = susieR:::single_effect_regression(Y,X,sa2=0.2,s2=var(dat$Y[,1]))

which.max(fit$alpha)

For L=1 the alpha is the PIP:

plot(fit$alpha, pch=20, xlab='variables', ylab = 'alpha')

Notice that ordering of SNPs are largely consistent between CAVIAR and susie.

order(fit$alpha, decreasing=T)[1:15]

Purity of susie CS, defined by the min of abs(LD):

cs = which(susieR:::in_CS_x(fit$alpha)>0)
purity = r[cs,cs]
purity

length(cs)

min(abs(purity))

single effect BF's¶

plot(exp(fit$lbf), pch = 20, ylab = 'BF', xlab = 'variable')

susie L=5¶

To fairly compare with not setting CAVIAR -c option here I set susie L = 5 and fit a susie run:

fit = susieR::susie(dat$X,dat$Y[,1],
                               L=5,
                               estimate_residual_variance = F, 
                               prior_variance=0.2, 
                               intercept=FALSE,
                               tol=1e-3)

susieR:::susie_get_niter(fit)

pip = 1 - apply(1 - fit$alpha, 2, prod)

plot(pip, pch=20, xlab='variables', ylab = 'pip')

The PIP I get here are mostly identical to the single effect model alpha.

The `ENSG00000155324` example¶

name = 'ENSG00000155324'
dat = dscrutils:::read_dsc('../data/liter_data_48_summarize_ld_1_lm_less_1.pkl')$data
r2 = cor(dat$X)
r2 = r2 ^ 2 * sign(r2)

In this example, I want to see the status of susie BF when the CS idenfied has minimum LD 0.96, as previously reported. Under the L=5 model it reported 6 SNPs.

X = scale(dat$X,center=FALSE, scale=TRUE)
Y = dat$Y[,1]
fit = susieR:::single_effect_regression(Y,X,sa2=0.2,s2=var(dat$Y[,1]))

plot(fit$alpha, pch=20, xlab='variables', ylab = 'alpha')

cs = which(susieR:::in_CS_x(fit$alpha)>0)
purity = r2[cs,cs]
purity

min(abs(purity))

max(abs(purity-diag(nrow(purity))))

So under L = 1 model the CS size is 5. The largest LD is 0.988, which splits away <0.2 of the PIP. The BF's are:

plot(exp(fit$lbf), pch = 20, ylab = 'BF', xlab = 'variable')

Revision	Author	Date	Message
0a2b456	Gao Wang	2018-06-05	Polish calibrated PIP plot
8f4f078	Gao Wang	2018-06-03	Add z-score check
880f1d3	Gao Wang	2018-06-03	Add z-score check
6b6455a	Gao Wang	2018-06-01	Add another gene's example for susie single effect model
074f4db	Gao Wang	2018-06-01	Add susie L=5 comparison
e4990ea	Gao Wang	2018-06-01	Add a comment on -c 1 case for CAVIAR
ecefc0f	Gao Wang	2018-06-01	Update CAVIAR table
9894dcf	Gao Wang	2018-06-01	Update documentation
e6fce6f	Gao Wang	2018-06-01	Finish up one CAVIAR example
7705774	Gao Wang	2018-06-01	Add interactive notebook for CAVIAR issues

rank	snp	snp_prob	snp_prob_cumsum	snp_prob_set
1	816	1.00000e+00	0.5000000	5.00000e-01
2	950	6.37421e-01	0.8187105	3.18711e-01
3	925	3.62579e-01	1.0000000	1.81289e-01
4	906	4.48000e-09	1.0000000	2.24000e-09
5	902	1.91918e-09	1.0000000	9.59592e-10
6	911	1.11758e-14	1.0000000	5.58791e-15
7	860	2.82036e-17	1.0000000	1.41018e-17
8	875	8.75390e-19	1.0000000	4.37695e-19
9	647	3.42789e-19	1.0000000	1.71394e-19
10	837	1.29764e-19	1.0000000	6.48822e-20
11	879	2.48016e-22	1.0000000	1.24008e-22
12	898	1.64572e-23	1.0000000	8.22860e-24
13	890	3.47679e-32	1.0000000	1.73839e-32
14	954	8.88897e-39	1.0000000	4.44449e-39
15	892	9.10015e-65	1.0000000	4.55008e-65

1.0000000	0.9981693	0.9984874	0.9984874	0.9984874	0.9969668	0.9981713	0.9969668	0.9984874	0.9984874	0.9984874	0.9984874	0.9984874	0.9969624	0.9966477
0.9981693	1.0000000	0.9996809	0.9996809	0.9996809	0.9981736	0.9993624	0.9981736	0.9996809	0.9996809	0.9996809	0.9996809	0.9996809	0.9981693	0.9978522
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9969668	0.9981736	0.9984917	0.9984917	0.9984917	1.0000000	0.9981756	0.9969711	0.9984917	0.9984917	0.9984917	0.9984917	0.9984917	0.9969668	0.9966520
0.9981713	0.9993624	0.9996829	0.9996829	0.9996829	0.9981756	1.0000000	0.9981756	0.9996829	0.9996829	0.9996829	0.9996829	0.9996829	0.9981713	0.9978542
0.9969668	0.9981736	0.9984917	0.9984917	0.9984917	0.9969711	0.9981756	1.0000000	0.9984917	0.9984917	0.9984917	0.9984917	0.9984917	0.9969668	0.9966520
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9984874	0.9996809	1.0000000	1.0000000	1.0000000	0.9984917	0.9996829	0.9984917	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000	0.9984874	0.9981717
0.9969624	0.9981693	0.9984874	0.9984874	0.9984874	0.9969668	0.9981713	0.9969668	0.9984874	0.9984874	0.9984874	0.9984874	0.9984874	1.0000000	0.9966477
0.9966477	0.9978522	0.9981717	0.9981717	0.9981717	0.9966520	0.9978542	0.9966520	0.9981717	0.9981717	0.9981717	0.9981717	0.9981717	0.9966477	1.0000000

1.0000000	0.9795675	0.9699838	0.9719937	0.9678303
0.9795675	1.0000000	0.9872564	0.9849938	0.9882706
0.9699838	0.9872564	1.0000000	0.9872842	0.9755299
0.9719937	0.9849938	0.9872842	1.0000000	0.9805825
0.9678303	0.9882706	0.9755299	0.9805825	1.0000000

	DSC	liter_data.dataset	lm_less.n_signal	lm_less.output.file
39	1	~/Documents/GTExV8/Toys/Thyroid.ENSG00000144445.RDS	1	lm_less/liter_data_39_summarize_ld_1_lm_less_1
48	1	~/Documents/GTExV8/Toys/Thyroid.ENSG00000155324.RDS	1	lm_less/liter_data_48_summarize_ld_1_lm_less_1
49	1	~/Documents/GTExV8/Toys/Thyroid.ENSG00000156738.RDS	1	lm_less/liter_data_49_summarize_ld_1_lm_less_1