Multivariate Bayesian variable selection regression

EM estimate vs direct optimization issue

Here I compare EM with optim in a larger simulated data-set involving 50 conditions. Example below is when EM and optim results are different and optim result seems better.

In [41]:
dat = readRDS('lite_data_1_artificial_mixture_1_mnm_shared_1.rds')

This data-set can be found in mvsusieR/inst/prototypes. Script to reproduce mvsusie analysis is available via cat(dat$DSC_DEBUG$script). Simply copy that code and run the data generation; then run the analysis using EM and optim. I save the X, Y, meta and resid_Y data into em_optim_difference.rds file for ease with loading it for further investigation.

EM result

In [25]:
result = mvsusie(X, Y, L=L, prior_variance=matrix(1,50,50), residual_variance=resid_Y, compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                estimate_prior_method='EM', track_fit=T, precompute_covariances=F)
Loading mvsusieR

In [26]:
result$V
  1. 0.133190141967517
  2. 0.133122023124095
  3. 0.132797959361328
  4. 0.132660010456815
  5. 0.132633887742004
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0

It seems the prior estimate are consistantly large for all effects. This data-set has 3 simulated signals,

In [3]:
which(rowSums(meta$true_coef) != 0)
  1. 43
  2. 129
  3. 162
In [4]:
meta$true_coef[which(rowSums(meta$true_coef) != 0),]
A matrix: 3 × 50 of type dbl
0.0000000 0.0000000 0.00000000.0000000 0.00000000.0000000 0.0000000 0.0000000 0.0000001.05848305 0.00000000.00000000.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
-0.4515795 0.2123624-0.23139931.2908204-0.13919290.4586080 1.1908100 0.5195072-0.4440410.55288239 0.80745421.16550010.55635426 0.4118661-0.4539847-0.3545635 0.5953944-0.7013678 1.0421826 1.2604212
-0.7054777-0.7788946-0.63421080.7883361-0.74360800.0631215-0.7432346-0.1296268-1.1300150.05221385-0.74236890.50439650.01963459-0.4722924-0.2650975-0.7530046-0.9196253-0.1616042-0.2687143-0.8154856

Clearly this will not fit a model of "all shared" effects.

Using "all shared" prior we have captured 4 effects with non-zero PIP, 2 of them are real.

In [28]:
susieR::susie_plot(result, y='PIP',b=rowSums(meta$true_coef))
In [21]:
result$sets
$cs
$L1
  1. 153
  2. 156
  3. 162
$L2
129
$L4
  1. 129
  2. 130
  3. 131
  4. 132
  5. 134
  6. 135
$L5
  1. 139
  2. 144
  3. 145
  4. 153
  5. 156
  6. 162
$purity
A data.frame: 4 × 3
min.abs.corrmean.abs.corrmedian.abs.corr
<dbl><dbl><dbl>
L11.00000001.00000001.0000000
L21.00000001.00000001.0000000
L40.98457230.99571451.0000000
L50.96018720.98009360.9800936
$cs_index
  1. 1
  2. 2
  3. 4
  4. 5
$coverage
0.95

From single effect CS result, L4 contains L2 and L5 contains L1. As a result both L4 and L5 appear to contain a true signals. Because of overlap between L1 and L5, the 3 variables in L1 (orange) appears to have marginal PIP 0.6 rather than 0.3. Both L4 and L5 have high purity.

In [12]:
result$alpha[,c(153,156,162)]
A matrix: 10 × 3 of type dbl
3.333333e-01 3.333333e-01 3.333333e-01
6.349176e-1426.349176e-1426.349176e-142
3.333311e-01 3.333311e-01 3.333311e-01
1.788361e-11 1.788361e-11 1.788361e-11
8.156961e-02 8.156961e-02 8.156961e-02
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
In [13]:
result$pip[c(153,156,162)]
  1. 0.591807364084137
  2. 0.591807364084137
  3. 0.591807364084137

Looking into the effect size,

In [17]:
result$coef[130,]
  1. 0.127650809566136
  2. 0.127650809566136
  3. 0.127650809566136
  4. 0.127650809566136
  5. 0.127650809566136
  6. 0.127650809566136
  7. 0.127650809566136
  8. 0.127650809566136
  9. 0.127650809566136
  10. 0.127650809566136
  11. 0.127650809566136
  12. 0.127650809566136
  13. 0.127650809566136
  14. 0.127650809566136
  15. 0.127650809566136
  16. 0.127650809566136
  17. 0.127650809566136
  18. 0.127650809566136
  19. 0.127650809566136
  20. 0.127650809566136
  21. 0.127650809566136
  22. 0.127650809566136
  23. 0.127650809566136
  24. 0.127650809566136
  25. 0.127650809566136
  26. 0.127650809566136
  27. 0.127650809566136
  28. 0.127650809566136
  29. 0.127650809566136
  30. 0.127650809566136
  31. 0.127650809566136
  32. 0.127650809566136
  33. 0.127650809566136
  34. 0.127650809566136
  35. 0.127650809566136
  36. 0.127650809566136
  37. 0.127650809566136
  38. 0.127650809566136
  39. 0.127650809566136
  40. 0.127650809566136
  41. 0.127650809566136
  42. 0.127650809566136
  43. 0.127650809566136
  44. 0.127650809566136
  45. 0.127650809566136
  46. 0.127650809566136
  47. 0.127650809566136
  48. 0.127650809566136
  49. 0.127650809566136
  50. 0.127650809566136
In [33]:
meta$true_coef[129,]
  1. -0.451579520293967
  2. 0.212362419670355
  3. -0.23139926245904
  4. 1.29082044865503
  5. -0.139192877987568
  6. 0.45860797023633
  7. 1.19081004850437
  8. 0.519507177038623
  9. -0.444040958330359
  10. 0.552882389431176
  11. 1.02044255171292
  12. -0.671814615437313
  13. 0.648295380611219
  14. 0.904728920517463
  15. 0.56318766496157
  16. -0.380772279580034
  17. 0.349855351519188
  18. -0.333712926727966
  19. 0.351859726264355
  20. 0.0755333676724879
  21. 1.7465511006099
  22. 0.132764293589086
  23. 0.790278763571395
  24. 1.74287183971923
  25. -0.357770774475342
  26. -0.0669878126419395
  27. 0.984384930972871
  28. 1.34967007988781
  29. 0.653897431049535
  30. 0.286275590107287
  31. 0.373163107935322
  32. 0.725860516251139
  33. 0.45548786043258
  34. 1.25877287508022
  35. 0.389427756565688
  36. -0.534945131447272
  37. 0.177056489702033
  38. -0.403823956895729
  39. 0.0192486701159794
  40. -0.152833418179631
  41. 0.807454176367542
  42. 1.16550009635113
  43. 0.556354264855155
  44. 0.411866111150109
  45. -0.453984697496325
  46. -0.354563471973746
  47. 0.595394391140776
  48. -0.701367761799491
  49. 1.04218255084502
  50. 1.26042124189206

and 2nd moment of the estimated effect size,

In [19]:
tcrossprod(result$coef[130,])[1,1]
0.0162924351076161
In [16]:
sum(diag(tcrossprod(result$coef[130,])))/50
0.0162924351076161

So the estimated V should be about this scale for the first effect.

In [24]:
result$coef[result$sets$cs$L4+1,]
A matrix: 6 × 50 of type dbl
0.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.1276508100.127650810
0.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.006085951
0.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.006085951
0.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.006085951
0.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.006085951
0.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.0060859510.006085951

optim result

It takes long time to compute, but the PIP is cleaner. The estimated prior scalar is also smaller.

In [8]:
result2 = mvsusie(X, Y, L=L, prior_variance=matrix(1,50,50), residual_variance=resid_Y, compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                estimate_prior_method='optim', track_fit=T, precompute_covariances=F)
In [9]:
result2$V
  1. 0.000755795213844379
  2. 0.000929637524553294
  3. 0.000261454666168821
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0
In [5]:
result2$elbo[length(result2$elbo)]
-8854.68325950608
In [10]:
susieR::susie_plot(result2, y='PIP', b=rowSums(meta$true_coef))
In [20]:
result2$sets
$cs
$L1
  1. 153
  2. 156
  3. 162
$L2
129
$purity
A data.frame: 2 × 3
min.abs.corrmean.abs.corrmedian.abs.corr
<dbl><dbl><dbl>
L1111
L2111
$cs_index
  1. 1
  2. 2
$coverage
0.95
In [18]:
result2$alpha[,c(153,156,162)]
A matrix: 10 × 3 of type dbl
3.333333e-01 3.333333e-01 3.333333e-01
1.447861e-2211.447861e-2211.447861e-221
3.332946e-01 3.332946e-01 3.332946e-01
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03
4.694836e-03 4.694836e-03 4.694836e-03

It seems to have the same issue that the first and the 3rd effects are overlapping. But because they are identical CS, we only kept one of them.

In [28]:
susieR::susie_get_cs(result2, dedup=F)
$cs
$L1
  1. 153
  2. 156
  3. 162
$L2
129
$L3
  1. 153
  2. 156
  3. 162
$coverage
0.95

However such duplicate is not adjusted for when computing PIP. This is why PIP for those 3 variables still sum to more than 1.

In [19]:
result2$pip[c(153,156,162)]
  1. 0.555529717829305
  2. 0.555529717829305
  3. 0.555529717829305
In [20]:
result2$coef[130,]
  1. 0.156619953095882
  2. 0.156619953095882
  3. 0.156619953095882
  4. 0.156619953095882
  5. 0.156619953095882
  6. 0.156619953095882
  7. 0.156619953095882
  8. 0.156619953095882
  9. 0.156619953095882
  10. 0.156619953095882
  11. 0.156619953095882
  12. 0.156619953095882
  13. 0.156619953095882
  14. 0.156619953095882
  15. 0.156619953095882
  16. 0.156619953095882
  17. 0.156619953095882
  18. 0.156619953095882
  19. 0.156619953095882
  20. 0.156619953095882
  21. 0.156619953095882
  22. 0.156619953095882
  23. 0.156619953095882
  24. 0.156619953095882
  25. 0.156619953095882
  26. 0.156619953095882
  27. 0.156619953095882
  28. 0.156619953095882
  29. 0.156619953095882
  30. 0.156619953095882
  31. 0.156619953095882
  32. 0.156619953095882
  33. 0.156619953095882
  34. 0.156619953095882
  35. 0.156619953095882
  36. 0.156619953095882
  37. 0.156619953095882
  38. 0.156619953095882
  39. 0.156619953095882
  40. 0.156619953095882
  41. 0.156619953095882
  42. 0.156619953095882
  43. 0.156619953095882
  44. 0.156619953095882
  45. 0.156619953095882
  46. 0.156619953095882
  47. 0.156619953095882
  48. 0.156619953095882
  49. 0.156619953095882
  50. 0.156619953095882

The estimated V is expected to roughly have the scale of

In [29]:
sum(diag(tcrossprod(result2$coef[130,])))/50
0.0245298097077564

But the actual estimated V is a lot smaller.

Update: the conclusion here is wrong because coef is after rescaling the data to original scale ... for comparsion before rescale ,see this notebook.

Analysis using naive canonical mixture prior

The example is a bit pathological because we knowingly use a wrong model for it. What if we use a more flexible MASH prior for it? Here I use canonical mixture with uniform weights,

In [35]:
prior = list(xUlist = mvsusieR:::create_cov_canonical(ncol(Y)))
m_init = create_mash_prior(mixture_prior = list(matrices=prior$xUlist, weights=prior$pi), null_weight=prior$null_weight, max_mixture_len=-1)
result3 = mvsusie(X, Y, L=L, prior_variance=m_init, residual_variance=resid_Y, 
                 compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                 estimate_prior_method='EM', track_fit=T, precompute_covariances=T)
In [36]:
result3$V
  1. 0.0573641369329346
  2. 0.0114649642051601
  3. 0.000261985863953441
  4. 0.373828753858814
  5. 0
  6. 0.13265845805623
  7. 0
  8. 0
  9. 0
  10. 0
In [37]:
susieR::susie_plot(result3, y='PIP',b=rowSums(meta$true_coef))
In [38]:
result3$sets
$cs
$L1
129
$L2
  1. 153
  2. 156
  3. 162
$L4
43
$L6
  1. 129
  2. 130
  3. 131
  4. 132
  5. 134
  6. 135
$purity
A data.frame: 4 × 3
min.abs.corrmean.abs.corrmedian.abs.corr
<dbl><dbl><dbl>
L11.00000001.00000001
L21.00000001.00000001
L41.00000001.00000001
L60.98457230.99571451
$cs_index
  1. 1
  2. 2
  3. 4
  4. 6
$coverage
0.95

The result is a bit better. We have now captured all 3 variables -- the new variable in L4 is a singleton effect.

In [41]:
max(result3$coef[44,])
1.10600340990259
In [42]:
min(result3$coef[44,])
-8.20245834611722e-12

But there is still overlap between L6 and L1, even though L6 indeed does contain a signal. And estimated prior scalar for the 6th variable is again the magic number 0.13. Notice the purity for L6 is quite high as we would also conclude from the plot.

What if we use oracle mixture?

The data is simulated using artificial_mixture. Here I'll analyze it with the same mixture prior structure and weight, but using EM to estimate the scale.

In [45]:
names(meta$prior$oracle)
  1. 'xUlist'
  2. 'pi'
  3. 'null_weight'
In [46]:
m_init = create_mash_prior(mixture_prior = list(matrices=meta$prior$oracle$xUlist, weights=meta$prior$oracle$pi), null_weight=meta$prior$oracle$null_weight, max_mixture_len=-1)
result4 = mvsusie(X, Y, L=L, prior_variance=m_init, residual_variance=resid_Y, 
                 compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                 estimate_prior_method='EM', track_fit=T, precompute_covariances=T)
In [47]:
result4$V
  1. 0.0573683484422172
  2. 0.0114170110679292
  3. 0.000265189597661553
  4. 0.373827041421149
  5. 0
  6. 0.12995270253752
  7. 0
  8. 0
  9. 0
  10. 0
In [48]:
susieR::susie_plot(result4, y='PIP',b=rowSums(meta$true_coef))
In [49]:
result4$sets
$cs
$L1
129
$L2
  1. 153
  2. 156
  3. 162
$L4
43
$L6
  1. 129
  2. 130
  3. 131
  4. 132
  5. 134
  6. 135
$purity
A data.frame: 4 × 3
min.abs.corrmean.abs.corrmedian.abs.corr
<dbl><dbl><dbl>
L11.00000001.00000001
L21.00000001.00000001
L41.00000001.00000001
L60.98457230.99571451
$cs_index
  1. 1
  2. 2
  3. 4
  4. 6
$coverage
0.95

Unfortunately, oracle prior does not help here.

Updated results: more EM iterations

As I looked into this example further, I noticed previous runs with EM failed to converge after maximum iterations. I therefore increase the number of iterations to 1000:

In [8]:
result5 = mvsusie(X, Y, L=L, prior_variance=mvsusieR::MashInitializer$new(list(matrix(1,50,50)),1), residual_variance=resid_Y, compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                estimate_prior_method='EM', track_fit=T, precompute_covariances=T, max_iter = 1000)
Loading mvsusieR

In [9]:
result5$V
  1. 0.00158092887545148
  2. 0.000940417090182203
  3. 0
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0

The estimated prior estimate is still different from optim, but ELBO here is -8847 and is higher than -8854 from optim.

In [6]:
result5$elbo[length(result5$elbo)]
-8847.34648400442
In [10]:
susieR::susie_plot(result5, y='PIP',b=rowSums(meta$true_coef))
In [11]:
result5$niter
924
In [2]:
susieR::susie_get_cs(result5, dedup=F)
$cs
$L1
  1. 153
  2. 156
  3. 162
$L2
129
$coverage
0.95

There is no longer overlapped CS in this analysis.

Now I try to use a naive mixture prior with more iterations. The convergence is faster, the estimated V is closer to the scale of the effect of variable, and ELBO is also better. All three signals have been captured.

In [19]:
prior = list(xUlist = mvsusieR:::create_cov_canonical(ncol(Y)))
m_init = create_mash_prior(mixture_prior = list(matrices=prior$xUlist, weights=prior$pi), null_weight=prior$null_weight, max_mixture_len=-1)
result6 = mvsusie(X, Y, L=L, prior_variance=m_init, residual_variance=resid_Y, 
                 compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                 estimate_prior_method='EM', track_fit=T, precompute_covariances=T, max_iter=1500)
In [20]:
result6$niter
587
In [21]:
result6$V
  1. 0.0574661081656338
  2. 0.0114651345388783
  3. 0
  4. 0.277511158893728
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0
In [22]:
susieR::susie_plot(result6, y='PIP',b=rowSums(meta$true_coef))
In [23]:
result6$elbo[length(result6$elbo)]
-4762.07666765973

Assessing impact of initialization on agreement between EM and optim

Here we run optim starting from the EM solution and vice versa, to see if that leads to agreement between these two approaches in estimating prior scalar.

First, initalize EM with optim result,

In [32]:
result7 = mvsusie(X, Y, L=L, s_init = result2, prior_variance=matrix(1,50,50), residual_variance=resid_Y, compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                estimate_prior_method='EM', track_fit=T, precompute_covariances=F)
Loading mvsusieR

In [33]:
result7$niter
2

Here is EM result using optim initialization,

In [34]:
result7$V
  1. 0.000755844772649111
  2. 0.00092964882084519
  3. 0.000261429236288484
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0

Here is optim result,

In [39]:
result2$V
  1. 0.000755795213844379
  2. 0.000929637524553294
  3. 0.000261454666168821
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0

Now initialize optim with EM result,

In [35]:
result8 = mvsusie(X, Y, L=L, s_init = result5, prior_variance=matrix(1,50,50), residual_variance=resid_Y, compute_objective=T, estimate_residual_variance=F, estimate_prior_variance=T, 
                estimate_prior_method='optim', track_fit=T, precompute_covariances=F)
In [36]:
result8$niter
2

Here is optim result using EM initialization,

In [37]:
result8$V
  1. 0.00190715760523663
  2. 0.000930484657641357
  3. 0
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0

Here is EM result,

In [38]:
result5$V
  1. 0.00158092887545148
  2. 0.000940417090182203
  3. 0
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago