Multivariate Bayesian variable selection regression

Re-evaluating the use of null weight in real data examples

Here with susieR version 0.4 I examine an example identified from simulations where SuSiE seems to have made mistakes. The data-set can be downloaded here.

In [1]:
%cd ../dsc/susie_comparison
/project/mstephens/SuSiE/mvarbvs/dsc/susie_comparison

Conclusion

null_weight did help elimating the false discovery in this case example, although on average it did not seem to help much for CS coverage.

Example: A "confident" mistake SuSiE has made

The problem

In [2]:
dat = readRDS('lm_less/liter_data_4_summarize_ld_1_lm_less_3.rds')$data
r=1
fitted = susieR::susie(dat$X, dat$Y[,r], L=5,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.1, compute_univariate_zscore = TRUE,
               tol=1e-3, track_fit=TRUE)
In [3]:
b = dat$true_coef[,r]
b[which(b!=0)] = 1
susieR::susie_plot(fitted, y='z', b=b, main = paste('z-scores'))

Obviously one CS in cyan it identifies is problematic.

In [4]:
susieR::susie_plot(fitted, y='PIP', b=b, main = paste('z-scores'))

Let's look at the fit summary:

In [5]:
summary(fitted)$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
3 33.17007 1.000000 1.000000 217
1 94.13759 0.998824 0.995986 754,758,760,761,764,765,769,773,774,775,776,777,778,779,788,789,792,795

Notice immediately that the 2nd single effect was dropped by the algorithm. It warrants to track the fit:

In [30]:
susieR::susie_plot_iteration(fitted, 5, 'test_track_fit')
Creating GIF animation ...
Iterplot saved to test_track_fit.gif
In [31]:
%preview test_track_fit.gif
%preview test_track_fit.gif
> test_track_fit.gif (423.7 KiB):

The BF is quite large. The z-score for variable 217 is quite small, though:

In [8]:
fitted$z[217]
-0.05499248453103

The true effect sizes are:

In [9]:
cbind(dat$true_coef[,r][which(b==1)], which(b==1))
-0.2253267168
1.0921209229
-1.3953977779

Where the first one is perhaps too small to pick up anyways. The observed z-scores for true causal variables are:

In [10]:
cbind(fitted$z[which(b==1)], which(b==1))
-5.617059168
-2.998813229
-7.489842779

The 2nd signal, which has smallest z-score, is in fact quite close to the variable 217 SuSiE has picked up. Their correlation is:

In [11]:
cor(dat$X[,217], dat$X[,229])
0.777030361613249

They are highly correlated as expected. It is just a bit puzzling why SuSiE would rather pick 217 which has observed z-score of -0.05.

Try with estimated prior

In [12]:
f2 = susieR::susie(dat$X, dat$Y[,r], L=5,
               estimate_residual_variance=TRUE, 
               estimate_prior_variance=TRUE, 
               compute_univariate_zscore = TRUE,
               tol=1e-3, track_fit=TRUE)
In [13]:
susieR::susie_plot(f2, y='PIP', b=b, main = paste('z-scores'))

Now SuSiE identifies two causal signals, but still not the first one. Let's see their PIP:

In [14]:
cbind(fitted$pip[which(b==1)], which(b==1))
6.661338e-16168
2.861821e-03229
7.659831e-02779

PIP for the first signal is very small. But estimated prior variance are large for the 2 effects,

In [15]:
f2$V / var(dat$Y[,r])
  1. 0.574991283995021
  2. 0
  3. 0.28486861797216
  4. 0.000741382956399607
  5. 0
In [16]:
summary(f2)$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
1 193.78532 0.9988240 0.9959860 754,758,760,761,764,765,769,773,774,775,776,777,778,779,788,789,792,795
3 95.33272 0.9733277 0.9391438 192,198,215,227,229,243

Try set a large fixed prior

In [17]:
f22 = susieR::susie(dat$X, dat$Y[,r], L=5,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.5,
               compute_univariate_zscore = TRUE,
               tol=1e-3, track_fit=TRUE)
In [18]:
summary(f22)$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
3 38.19586 1.000000 1.000000 217
1 107.93769 0.998824 0.995986 754,758,760,761,764,765,769,773,774,775,776,777,778,779,788,789,792,795

It still makes a decision on including variable 217.

Try with different L

As pointed out by Matthew, in case L=5 susie is actually mis-using the “non-significant effects’ (2,4,5) to induce a correlation at variable 217. Here I try with L=3 for example,

In [37]:
summary(susieR::susie(dat$X, dat$Y[,r], L=3,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.1,
               compute_univariate_zscore = TRUE,
               tol=1e-3))$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
1 89.58071 0.998824 0.9959860 754,758,760,761,764,765,769,773,774,775,776,777,778,779,788,789,792,795
3 31.90596 0.791897 0.4415596 136,156,192,198,200,215,217,227,229

Here, 229 is included, though in a low purity set. For L=4,

In [38]:
summary(susieR::susie(dat$X, dat$Y[,r], L=4,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.1,
               compute_univariate_zscore = TRUE,
               tol=1e-3))$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
1 91.80845 0.9988240 0.9959860 754,758,760,761,764,765,769,773,774,775,776,777,778,779,788,789,792,795
3 32.56526 0.7272491 0.4660795 136,217,227

Again we are missing 229, but got 227 which is very close. This is a convergence issue to be revisited.

Try with null weights

First, I use a small null weight (just to test the water),

In [19]:
f3 = susieR::susie(dat$X, dat$Y[,r], L=5,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.1,
               null_weight = 0.5,
               compute_univariate_zscore = TRUE,
               tol=1e-3, track_fit=TRUE)
In [20]:
summary(f3)$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
1 30.24555 0.9979646 0.9899808 754,758,760,761,764,765,768,769,773,774,775,776,777,778,779,782,788,789,792,795
2 10.80581 0.7911250 0.3237647 253,317,458,477,480,487,498,512,524,538,541,560,572,594,601,606,609,626,627,672,703,721

Interestingly it now picks another irrelevant CS. Although it has a small BF, compared to the BF with null variable it is still a lot larger. That is, a small penalty does not help but made it worse. This in fact highlights the originally missed 2nd effect. The convergence happened rather quickly. It seems there is a potential to make the 2nd CS a lot larger (to contain variables around position 250) thus removing it for good.

In [32]:
susieR::susie_plot_iteration(f3, 5, 'test_track_fit_3')
Creating GIF animation ...
Iterplot saved to test_track_fit_3.gif
In [33]:
%preview test_track_fit_3.gif
%preview test_track_fit_3.gif
> test_track_fit_3.gif (37.9 KiB):

The BF for set involving the null variable is small compared to others.

In [23]:
f3$lbf
  1. 30.2455528655131
  2. 10.8058107717509
  3. 2.79325956022625
  4. 2.79040767159697
  5. 2.75436781147105

Now I increase the null penalty,

In [24]:
f4 = susieR::susie(dat$X, dat$Y[,r], L=5,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.1,
               null_weight = 0.9,
               compute_univariate_zscore = TRUE,
               tol=1e-3, track_fit=TRUE)
In [25]:
summary(f4)$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
1 28.1656 0.9943364 0.956193 753,754,758,760,761,764,765,768,769,773,774,775,776,777,778,779,782,785,788,789,792,795
In [34]:
susieR::susie_plot_iteration(f4, 5, 'test_track_fit_4')
Creating GIF animation ...
Iterplot saved to test_track_fit_4.gif
In [35]:
%preview test_track_fit_4.gif
%preview test_track_fit_4.gif
> test_track_fit_4.gif (27.7 KiB):

It seems the problematic variable is removed. I now try to trick summary to display more information just to checkout how it helped,

In [28]:
f4$null_index=0
f4$sets = susieR::susie_get_CS(f4, cbind(dat$X,0), min_abs_corr=0)
summary(f4)$cs
cscs_log10bfcs_avg_r2cs_min_r2variable
3 3.011818 1.0000000 1.00000000 1003
1 28.165605 0.9943364 0.95619299 753,754,758,760,761,764,765,768,769,773,774,775,776,777,778,779,782,785,788,789,792,795
2 11.554834 0.5869098 0.08081462 136,162,217,253,317,477,480,487,498,512,524,538,541,560,572,594,601,606,609,626,627,672,703,721
In [29]:
f4$lbf
  1. 28.1656046669751
  2. 11.5548344446241
  3. 3.0118177636621
  4. 3.01032381524374
  5. 3.00829064435163

Notice that with the use of null variable, the previously "wrong" variable 217 now ends up in a CS with purity of 0.08, thus not selected. The null variable itself forms a separate CS, though, with a small BF.


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago