Multivariate Bayesian variable selection regression

Create prior mixture for simulation studies

This notebook contains scripts to create mixture prior for use with simulations in DSC.

Artificial structure

Here I'm going to make for 50 conditions patterns of covariate with

  1. canonical
  2. paired sharing eg a pair of tissues
  3. block sharing eg brain

With 50 conditions

Canonical patterns of sharing:

In [1]:
R=50
prior = mvsusieR:::create_cov_canonical(R)

Paired sharing:

In [2]:
paired = matrix(0,R,R)
paired[1:2,1:2] = 1
prior[['paired_1']] = paired

Block sharing:

In [3]:
block = matrix(0,R,R)
block[1:R/2, 1:R/2] = 1
block[(R/2+1):R, (R/2+1):R] = 1
prior[['blocked_1']] = block
In [4]:
names(prior)
  1. 'singleton_1'
  2. 'singleton_2'
  3. 'singleton_3'
  4. 'singleton_4'
  5. 'singleton_5'
  6. 'singleton_6'
  7. 'singleton_7'
  8. 'singleton_8'
  9. 'singleton_9'
  10. 'singleton_10'
  11. 'singleton_11'
  12. 'singleton_12'
  13. 'singleton_13'
  14. 'singleton_14'
  15. 'singleton_15'
  16. 'singleton_16'
  17. 'singleton_17'
  18. 'singleton_18'
  19. 'singleton_19'
  20. 'singleton_20'
  21. 'singleton_21'
  22. 'singleton_22'
  23. 'singleton_23'
  24. 'singleton_24'
  25. 'singleton_25'
  26. 'singleton_26'
  27. 'singleton_27'
  28. 'singleton_28'
  29. 'singleton_29'
  30. 'singleton_30'
  31. 'singleton_31'
  32. 'singleton_32'
  33. 'singleton_33'
  34. 'singleton_34'
  35. 'singleton_35'
  36. 'singleton_36'
  37. 'singleton_37'
  38. 'singleton_38'
  39. 'singleton_39'
  40. 'singleton_40'
  41. 'singleton_41'
  42. 'singleton_42'
  43. 'singleton_43'
  44. 'singleton_44'
  45. 'singleton_45'
  46. 'singleton_46'
  47. 'singleton_47'
  48. 'singleton_48'
  49. 'singleton_49'
  50. 'singleton_50'
  51. 'shared_1'
  52. 'shared_2'
  53. 'shared_3'
  54. 'shared_4'
  55. 'shared_5'
  56. 'paired_1'
  57. 'blocked_1'

Now assign some weights:

  1. singleton total 35%
    • singleton_1 has 10% (to be picked up later by ED methods)
    • singleton_2 to singleton_26 has 25% (1% each )
  2. shared total 25%
  3. paired 20% (hopefully picked up by ED methods)
  4. blocked 20% (hopefully picked up by ED methods)
In [5]:
w = c(0.1, rep(0.01,25), rep(0, 24), rep(0.25/5,5), 0.2, 0.2)
In [6]:
prior = prior[which(w>0)]
In [7]:
names(prior)
  1. 'singleton_1'
  2. 'singleton_2'
  3. 'singleton_3'
  4. 'singleton_4'
  5. 'singleton_5'
  6. 'singleton_6'
  7. 'singleton_7'
  8. 'singleton_8'
  9. 'singleton_9'
  10. 'singleton_10'
  11. 'singleton_11'
  12. 'singleton_12'
  13. 'singleton_13'
  14. 'singleton_14'
  15. 'singleton_15'
  16. 'singleton_16'
  17. 'singleton_17'
  18. 'singleton_18'
  19. 'singleton_19'
  20. 'singleton_20'
  21. 'singleton_21'
  22. 'singleton_22'
  23. 'singleton_23'
  24. 'singleton_24'
  25. 'singleton_25'
  26. 'singleton_26'
  27. 'shared_1'
  28. 'shared_2'
  29. 'shared_3'
  30. 'shared_4'
  31. 'shared_5'
  32. 'paired_1'
  33. 'blocked_1'
In [8]:
w = w[which(w>0)]
sum(w)
1
In [9]:
length(w)
33
In [10]:
artificial_mixture_50 = list(U=prior,w=w)

With 6 conditions

In [11]:
R=6
prior = mvsusieR:::create_cov_canonical(R)
paired = matrix(0,R,R)
paired[1:2,1:2] = 1
prior[['paired_1']] = paired
block = matrix(0,R,R)
block[1:R/2, 1:R/2] = 1
block[(R/2+1):R, (R/2+1):R] = 1
prior[['blocked_1']] = block
In [12]:
names(prior)
  1. 'singleton_1'
  2. 'singleton_2'
  3. 'singleton_3'
  4. 'singleton_4'
  5. 'singleton_5'
  6. 'singleton_6'
  7. 'shared_1'
  8. 'shared_2'
  9. 'shared_3'
  10. 'shared_4'
  11. 'shared_5'
  12. 'paired_1'
  13. 'blocked_1'

Now assign some weights:

  1. singleton total 30%
    • singleton_1, singleton_3, singleton_5 each has 10%
  2. shared total 30%
  3. paired 20%
  4. blocked 20%
In [13]:
w = c(0.1, 0, 0.1, 0, 0.1, 0, rep(0.3/5, 5), 0.2, 0.2)
In [14]:
prior = prior[which(w>0)]
names(prior)
  1. 'singleton_1'
  2. 'singleton_3'
  3. 'singleton_5'
  4. 'shared_1'
  5. 'shared_2'
  6. 'shared_3'
  7. 'shared_4'
  8. 'shared_5'
  9. 'paired_1'
  10. 'blocked_1'
In [15]:
w = w[which(w>0)]
sum(w)
1
In [16]:
length(w)
10
In [17]:
artificial_mixture_6 = list(U=prior,w=w)

Mixture from GTEx

Using this workflow,

sos run ~/GIT/gtexresults/workflows/mashr_flashr_workflow.ipynb flash \
    --cwd /project2/compbio/GTEx_eQTL/mashr_flashr_workflow_output \
    --data /project2/compbio/GTEx_eQTL/mashr_flashr_workflow_output/FastQTLSumStats.mash.rds \
    --effect-model EE -c midway2.yml -q midway2
sos run analysis/20200502_Prepare_ED_prior.ipynb ed \
    --cwd /project2/compbio/GTEx_eQTL/mashr_flashr_workflow_output \
    --model FastQTLSumStats.mash -c midway2.yml -q midway2
In [18]:
prior = readRDS('../data/FastQTLSumStats.mash.FL_PC3.ED.rds')
In [19]:
names(prior$U)
  1. 'FLASH_1'
  2. 'FLASH_2'
  3. 'FLASH_3'
  4. 'FLASH_4'
  5. 'FLASH_5'
  6. 'FLASH_6'
  7. 'FLASH_7'
  8. 'FLASH_8'
  9. 'FLASH_9'
  10. 'FLASH_10'
  11. 'FLASH_11'
  12. 'FLASH_12'
  13. 'FLASH_13'
  14. 'FLASH_14'
  15. 'FLASH_15'
  16. 'FLASH_16'
  17. 'FLASH_17'
  18. 'FLASH_18'
  19. 'FLASH_19'
  20. 'FLASH_20'
  21. 'tFLASH'
  22. 'PCA_1'
  23. 'PCA_2'
  24. 'PCA_3'
  25. 'tPCA'
  26. 'XX'
In [20]:
length(prior$U)
26

Most weighths are in tFLASH and XX,

In [21]:
names(prior$U)[prior$w>0.1]
  1. 'tFLASH'
  2. 'XX'

But many other weights are also non-trivial

In [22]:
names(prior$U)[prior$w>0.001]
  1. 'FLASH_1'
  2. 'FLASH_3'
  3. 'FLASH_5'
  4. 'FLASH_6'
  5. 'FLASH_7'
  6. 'FLASH_8'
  7. 'FLASH_9'
  8. 'FLASH_10'
  9. 'FLASH_11'
  10. 'FLASH_12'
  11. 'FLASH_13'
  12. 'FLASH_14'
  13. 'FLASH_16'
  14. 'FLASH_17'
  15. 'FLASH_18'
  16. 'FLASH_19'
  17. 'FLASH_20'
  18. 'tFLASH'
  19. 'tPCA'
  20. 'XX'
In [23]:
tol = 1E-12
U = prior$U[which(prior$w>tol)]
w = prior$w[which(prior$w>tol)]
names(U)
gtex_mixture = list(U=U,w=w)
  1. 'FLASH_1'
  2. 'FLASH_2'
  3. 'FLASH_3'
  4. 'FLASH_4'
  5. 'FLASH_5'
  6. 'FLASH_6'
  7. 'FLASH_7'
  8. 'FLASH_8'
  9. 'FLASH_9'
  10. 'FLASH_10'
  11. 'FLASH_11'
  12. 'FLASH_12'
  13. 'FLASH_13'
  14. 'FLASH_14'
  15. 'FLASH_15'
  16. 'FLASH_16'
  17. 'FLASH_17'
  18. 'FLASH_18'
  19. 'FLASH_19'
  20. 'FLASH_20'
  21. 'tFLASH'
  22. 'PCA_1'
  23. 'PCA_3'
  24. 'tPCA'
  25. 'XX'
In [24]:
saveRDS(list(gtex_mixture=gtex_mixture, artificial_mixture_50=artificial_mixture_50, artificial_mixture_6=artificial_mixture_6), '../data/prior_simulation.rds')

Using weights learned from simulated data

Workflow see this notebook.

In [25]:
dat = readRDS('../data/prior_simulation.rds')
In [26]:
names(dat$gtex_mixture)
  1. 'U'
  2. 'w'

Load ED mixture,

In [27]:
gtex_ed_mixture = readRDS('~/GIT/mvarbvs/dsc/mnm_prototype/mnm_sumstats/gtex_mixture_identity.FL_PC3.ED.rds')
In [28]:
names(gtex_ed_mixture)
  1. 'U'
  2. 'w'
  3. 'loglik'
In [29]:
artificial_ed_mixture_50 = readRDS('~/GIT/mvarbvs/dsc/mnm_prototype/mnm_sumstats/artificial_mixture_identity.FL_PC3.ED.rds')
In [30]:
names(artificial_ed_mixture_50)
  1. 'U'
  2. 'w'
  3. 'loglik'

Filter the components by weight cutoff 1E-15,

In [31]:
tol = 1E-15
artificial_ed_mixture_50$U = artificial_ed_mixture_50$U[which(artificial_ed_mixture_50$w > tol)]
artificial_ed_mixture_50$w = artificial_ed_mixture_50$w[which(artificial_ed_mixture_50$w > tol)]
gtex_ed_mixture$U = gtex_ed_mixture$U[which(gtex_ed_mixture$w > tol)]
gtex_ed_mixture$w = gtex_ed_mixture$w[which(gtex_ed_mixture$w > tol)]
In [32]:
names(artificial_ed_mixture_50$U)
  1. 'FLASH_1'
  2. 'FLASH_4'
  3. 'FLASH_5'
  4. 'FLASH_6'
  5. 'FLASH_7'
  6. 'FLASH_8'
  7. 'FLASH_9'
  8. 'FLASH_10'
  9. 'FLASH_11'
  10. 'FLASH_12'
  11. 'FLASH_14'
  12. 'FLASH_15'
  13. 'FLASH_16'
  14. 'FLASH_17'
  15. 'FLASH_18'
  16. 'FLASH_20'
  17. 'FLASH_22'
  18. 'FLASH_24'
  19. 'FLASH_25'
  20. 'FLASH_27'
  21. 'FLASH_29'
  22. 'FLASH_57'
  23. 'tFLASH'
  24. 'PCA_1'
  25. 'PCA_2'
  26. 'PCA_3'
  27. 'tPCA'
  28. 'XX'
In [33]:
artificial_ed_mixture_50$w
  1. 0.0928199835385606
  2. 0.170916160597565
  3. 3.3348060289056e-13
  4. 0.00103070787447895
  5. 0.00355415988483168
  6. 3.49373257500002e-08
  7. 0.00123344516969065
  8. 4.05310561664215e-05
  9. 1.58651158359917e-05
  10. 6.96387520012721e-14
  11. 0.000303771798467122
  12. 0.00771806711741353
  13. 2.82609862930679e-05
  14. 0.000673891478132913
  15. 0.00743608012106805
  16. 7.04580345327831e-05
  17. 0.00713627997477638
  18. 0.000653767355009305
  19. 0.00643943152051095
  20. 0.00673856900028108
  21. 0.00713532869510493
  22. 3.6453881263049e-05
  23. 0.287018132738336
  24. 0.00381418677387528
  25. 0.0154803817130112
  26. 0.00113763435850343
  27. 0.139767896365797
  28. 0.238800519912781
In [34]:
names(gtex_ed_mixture$U)
  1. 'FLASH_1'
  2. 'FLASH_9'
  3. 'FLASH_13'
  4. 'tFLASH'
  5. 'PCA_1'
  6. 'tPCA'
  7. 'XX'
In [35]:
gtex_ed_mixture$w
  1. 0.164536632815805
  2. 4.01300227957019e-05
  3. 4.01303516159582e-05
  4. 0.393424772106715
  5. 6.97386166989233e-07
  6. 0.00164345872118286
  7. 0.440314178595684
In [36]:
dat$artificial_mixture_50$ED = artificial_ed_mixture_50
dat$gtex_mixture$ED = gtex_ed_mixture
In [37]:
saveRDS(dat, '../data/prior_simulation.rds')

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago