Here we evaluate the possible benefit adding a null component to SuSiE. The hope is that the CS will be easier to prune (without using purity) and that the pruned CS can achieve smaller FDR.
set.seed(1)
n = 500
p = 1000
b = rep(0,p)
b[200] = 1
b[800] = 1
X = matrix(rnorm(n*p),nrow=n,ncol=p)
X[,200] = X[,400]
X[,600] = X[,800]
y = X %*% b + rnorm(n)
diag_susie = function(purity = 0, dedup = F, ...) {
s = susieR::susie(X, y, L=5, scaled_prior_variance=0.2, track_fit=F, coverage=NULL, ...)
sets = susieR::susie_get_CS(s, X=cbind(X,0), coverage=0.95,min_abs_corr=purity,dedup=dedup)
str(sets$cs)
print('PIP for the null (1st PIP) and causal (other PIPs)')
pip = susieR::susie_get_PIP(s, sets$cs_index)
print(pip[c(ncol(X)+1,200,400,600,800)])
s$sets= sets
s$pip = pip
print("Number of iterations")
print(s$niter)
return(s)
}
run_susie = function(purity = 0.1, dedup = T, ...) {
diag_susie(purity=purity, dedup=dedup, ...)
}
"Diagnostics" means that we set no purity threshold and remove no duplicate CS. The PIP computation will be based on the un-processed result.
First, fit with SuSiE as is, but over-specifiy $L$ and report all CS obtained (setting min_abs_corr
to zero)
s = diag_susie()
So the 3rd and 4th CS are large. Now add a penalty to null,
As expected the CS got slightly narrower, and the PIP for the null is larger than $1/p$ (0.001 in this case). Now I increase the penalty,
s = diag_susie(null_weight=0.01)
s = diag_susie(null_weight=0.5)
s = diag_susie(null_weight=0.9)
s = diag_susie(null_weight=0.98)
Here we set purity threshold to 0.1 and remove duplicate CS.
s = run_susie()
In this example the default SuSiE with purity filter is good enough. No need to bother with a penalty.
s = run_susie(null_weight=0.7)
s = run_susie(null_weight=0.75)
At least from this example, we will get stuck to a null set only when null_weight
is very high.
We simulate random data,
set.seed(1)
n = 500
p = 1000
X = matrix(rnorm(n*p),nrow=n,ncol=p)
y = rnorm(n)
and run SuSiE with / without purity filter:
s = diag_susie()
s = run_susie()
For this simple case, the purity filter itself is good enough to tell between signal and noise.
I took this data-set from our simulation. This is a case SuSiE makes false discovery in multi-signal setting.
%cd ../dsc/susie_comparison
dat = readRDS('lm_less/liter_data_4_summarize_ld_1_lm_less_3.rds')$data
r=1
fitted = susieR::susie(dat$X, dat$Y[,r], L=5,
estimate_residual_variance=TRUE,
scaled_prior_variance=0.1,
tol=1e-3, track_fit=TRUE)
b = dat$true_coef[,r]
b[which(b!=0)] = 1
The data is noisy,
susieR::susie_plot(fitted, y='z', b=b, main = paste('z-scores'))
and SuSiE confidently makes a mistake,
susieR::susie_plot(fitted, y='PIP', b=b, max_cs=0.4, main = paste('SuSiE', length(fitted$sets$cs), 'CS identified'))
Now let's run SuSiE with penalty 0.5,
X=dat$X
y=dat$Y[,r]
s = run_susie(null_weight=0.5)
s$sets$purity
susieR::susie_plot(s, y='PIP', b=b, max_cs=0.1, main = paste('SuSiE', length(fitted$sets$cs), 'CS identified'))
It is not helping. Now try with 0.9,
s = run_susie(null_weight=0.9)
s$sets$purity
susieR::susie_plot(s, y='PIP', b=b, max_cs=0.1, main = paste('SuSiE', length(fitted$sets$cs), 'CS identified'))
dat = dscrutils::load_inputs(c('lm_less/liter_data_12_summarize_ld_1_lm_less_3.pkl'), dscrutils::read_dsc)$data
r=1
fitted = susieR::susie(dat$X, dat$Y[,r], L=5,
estimate_residual_variance=TRUE,
scaled_prior_variance=0.1,
tol=1e-3, track_fit=TRUE)
b = dat$true_coef[,r]
b[which(b!=0)] = 1
susieR::susie_plot(fitted, y='z', b=b, main = paste('z-scores'))
It seems fine. But previous result,
s = readRDS("fit_susie/liter_data_12_summarize_ld_1_lm_less_3_fit_susie_6.rds")$fitted[[r]]
sets = susieR::susie_get_CS(s, X=dat$X, coverage=0.95,min_abs_corr=0.1)
pip = susieR::susie_get_PIP(s, sets$cs_index)
s$sets= sets
s$pip = pip
susieR::susie_plot(s, y='PIP', b=b, main = paste('SuSiE', length(fitted$sets$cs), 'CS identified'))
debug = readRDS("fit_susie/liter_data_12_summarize_ld_1_lm_less_3_fit_susie_6.rds")$DSC_DEBUG
cat(debug$script)
print(debug$session)