Multivariate Bayesian variable selection regression

Compute pairwise LD for selected SNPs

Though it is straightfoward enough to do it in R / Python, I use PLINK to compute the LD matrix.

PLINK is highly efficient. First I extract the subset of variants of interest from all data, then for a quick look I use --indep-pairwise to prune SNPs at given LD level. Finally I use --r2 option to compute LD stats and formally report a summary.

In [6]:
[global]
cwd = '~/Documents/GTEx'
mash_snps = '~/GIT/github/gtexresults_mash/Data/maxz.txt'
mash_snps_all = "${cwd!a}/mash_revision/MatrixEQTLSumStats.Portable.h5"
snp_list = "${cwd!a}/mash_revision/snp_eqtls.txt"
snp_list_random = "${cwd!a}/mash_revision/snp_random.txt"
genotype_data = "${cwd!a}/genotype_plink/GTEx7.Imputed.bed"

Prepare SNP ID list to extract

For example, codes below prepares SNPs using the eQTLs from mash paper.

From text file

In particular this file.

In [3]:
%preview /home/gaow/GIT/github/gtexresults_mash/Data/maxz.txt --limit 2
> /home/gaow/GIT/github/gtexresults_mash/Data/maxz.txt (12.7 MiB):
"Adipose_Subcutaneous" "Adipose_Visceral_Omentum" "Adrenal_Gland" "Artery_Aorta" "Artery_Coronary" "Artery_Tibial" "Brain_Anterior_cingulate_cortex_BA24" "Brain_Caudate_basal_ganglia" "Brain_Cerebellar_Hemisphere" "Brain_Cerebellum" "Brain_Cortex" "Brain_Frontal_Cortex_BA9" "Brain_Hippocampus" "Brain_Hypothalamus" "Brain_Nucleus_accumbens_basal_ganglia" "Brain_Putamen_basal_ganglia" "Breast_Mammary_Tissue" "Cells_EBV-transformed_lymphocytes" "Cells_Transformed_fibroblasts" "Colon_Sigmoid" "Colon_Transverse" "Esophagus_Gastroesophageal_Junction" "Esophagus_Mucosa" "Esophagus_Muscularis" "Heart_Atrial_Appendage" "Heart_Left_Ventricle" "Liver" "Lung" "Muscle_Skeletal" "Nerve_Tibial" "Ovary" "Pancreas" "Pituitary" "Prostate" "Skin_Not_Sun_Exposed_Suprapubic" "Skin_Sun_Exposed_Lower_leg" "Small_Intestine_Terminal_Ileum" "Spleen" "Stomach" "Testis" "Thyroid" "Uterus" "Vagina" "Whole_Blood"
"ENSG00000000419.8_20_49461813_G_C_b37" 0.140035769899035 0.0475048018545887 -0.185199569115366 -0.707475748922068 1.02660000912618 -0.0771338912837303 0.146658841279342 -0.259401046879882 -0.337210009547116 0.39750489467677 -1.05580658870737 0.913743710447618 -0.128762251875501 0.0558638858658976 1.01386831943158 0.377598395957177 -0.479508220110367 0.388489721109613 -0.48229879064876 -0.364704657885265 -1.87231328500187 -2.06022009981082 -0.298559593121012 -3.19471555767819 0.362237923742557 0.62719316987283 2.00021812889745 0.629467212595717 -0.23561462140269 -0.332783730867599 -1.00055889996764 -0.960125863288923 -0.96386316399376 -1.50881700277794 -0.976703698613329 0.577720300522746 -2.60306648809215 -4.85288408765059 0.362665598516544 -1.65680479255712 -0.344732538236625 -0.268148514893524 -0.366403004075725 -0.900413650345225
"ENSG00000000457.9_1_169695110_T_A_b37" 0.954516011454517 -0.0681166567008093 0.109699644838174 2.86331872495593 0.804939975404017 1.66768958423126 1.64030493197248 1.33440714691758 1.7328767895192 5.31231399034445 2.18652298262293 2.79319687293344 0.245677971654259 0.432459156419959 0.720654077312806 1.16943846086518 1.71398578810753 -0.876469953614605 -0.145387404414833 4.50500960459263 1.64784013770758 1.99891911608046 -1.03246629335908 2.93744925688004 0.732560299095379 1.18471867566798 -0.947922268527059 1.33685666941406 1.39094097748753 2.37188680055047 3.07760951738704 -0.277142986781852 1.55727472300494 1.15513554742342 -0.533098042780075 2.58868900891793 1.26249236100172 1.64771825873614 2.12867732922824 -0.0755260844598542 2.11300849607303 3.3059042901439 0.0979415385612671 1.05121147377042
"ENSG00000000460.12_1_169655079_G_A_b37" 0.319836864977651 -0.28858559927055 -0.293302417448151 -0.382708750244296 -1.88572937411031 -0.731279532668538 -2.28866218174984 -3.15452327558819 -5.62131620925679 -6.28282827031234 -3.99155600986804 -1.02718390039357 -1.47529757341139 -1.04324409042758 -1.12330319823082 -1.77672428663425 0.0924974003354592 0.521176495698238 0.439544399604355 -1.29444083282421 -0.285879455613173 -0.810438102891698 -1.2185792561654 -1.05394253140009 -3.29222412895229 -2.47189298266347 3.12046607511662 -2.9484270988292 -2.41368323230473 -3.02474209862182 -1.32488348689969 -0.144362657854571 -2.53363783918856 -2.06456099567809 -0.0106500419867326 0.394751496292719 -0.183182088998179 0.267137151309772 -0.140226463470851 -1.38261451093832 -4.57185776099714 -3.26786047004333 -1.78685206860429 -2.00467145209083
"ENSG00000000938.8_1_27888990_T_C_b37" 1.84387146947355 1.26017531836356 0.277379075325108 2.31477887178685 -0.0762272593808945 1.81648404994728 1.60902560496413 1.39805785681695 0.714104043472631 0.483387491488957 0.691331931801022 1.94391865458563 1.88911503403548 0.12587020716696 -0.136190700141412 1.02030170309299 1.18785319893026 -0.797752689697484 0.131302193902833 -0.211502169456531 1.67371464571622 1.52537581450487 1.57392594799612 2.5610798915074 0.610113571499602 0.988896273913113 -0.148514063655499 3.33285120204704 1.11503124376176 0.58105653912896 1.39801915019313 -0.307152421902023 -1.257704217961 1.71768552657065 1.90436415367498 0.419554183298909 0.487052615599714 -4.4523573610964 0.678535731823725 1.09278301526146 1.31522001167607 0.560447857269273 0.177705680259496 -0.495288113604864
In [8]:
%sosrun get_list
[get_list]
output: "${snp_list!a}"
import numpy as np
np.savetxt("${snp_list!a}", 
           [':'.join(x.split('_')[1:3]) for x in np.loadtxt("${mash_snps!a}", dtype = 'str', delimiter=" ", skiprows=1, usecols=(0,)).tolist()],
          fmt = '%s')
%preview /home/gaow/Documents/GTEx/mash_revision/snp_eqtls.txt
> /home/gaow/Documents/GTEx/mash_revision/snp_eqtls.txt (183.2 KiB):
20:49461813
1:169695110
1:169655079
1:27888990
1:196513323

From HDF5 table row names

In particular my GTEx V6 sumstat database MatrixEQTLSumStats.Portable.h5

In [17]:
%sosrun get_all_list
[get_all_list]
output: snp_list, snp_list_random
python:
    import h5py, numpy as np
    f = h5py.File(${mash_snps_all!ar})
    max_list = [':'.join(x.decode().split('_')[1:3]) for x in f['max']['rownames']]
    random_list = [':'.join(x.decode().split('_')[1:3]) for x in f['null']['rownames']]
    f.close()
    np.savetxt(${snp_list!ar}, max_list, fmt = '%s')
    np.savetxt(${snp_list_random!ar}, random_list, fmt = '%s')
%preview /home/gaow/Documents/GTEx/mash_revision/snp_eqtls.txt /home/gaow/Documents/GTEx/mash_revision/snp_random.txt
> /home/gaow/Documents/GTEx/mash_revision/snp_eqtls.txt (183.2 KiB):
20:49461813
1:169695110
1:169655079
1:27888990
1:196513323
> /home/gaow/Documents/GTEx/mash_revision/snp_random.txt (549.6 KiB):
20:49782767
20:49654572
20:49392478
1:169117725
1:170340311

Compute LD for given list of variants

Code chunk below extracts SNPs from GTEx genotype data and computes r2 via PLINK.

  • Steps 1 & 2 extracts data and run PLINK to get a quick estimate for, for example, when a pruning cutoff of 0.2 is set.
  • Steps 3 & 4 more formally calculates LD on per-chromosome basis
In [24]:
%sosrun get_ld:1-2 -b ~/Documents/GTEx/bin
[get_ld_1]
# extract SNPs for given list
parameter: input_list = snp_list
depends: genotype_data
input: input_list
output: "${cwd!a}/mash_revision/${input!bn}.extracted.bed"
task: workdir = cwd
run:
    plink --bfile ${genotype_data!n} \
      --extract ${input} \
      --no-sex --no-pheno --no-parents \
      --make-bed \
      --out ${output!n}

[get_ld_2]
# Quickly survey how many SNPs are removed for given cutoff
pairwise_ld_param = '10000 500 0.2'
output: "${input!n}.prune.out"
run:
    plink --bfile ${input!n} \
          --indep-pairwise ${pairwise_ld_param} \
          --out ${output!nn}

[get_ld_3]
# split data by chrom
chroms = [i+1 for i in range(22)]
input: for_each = 'chroms', pattern = '{name}.prune.out'
output: expand_pattern('{_name}_chr{_chroms}.bed')
task: workdir = cwd
run: 
plink --bfile ${_input!n} \
      --chr ${_chroms} \
      --allow-no-sex \
      --make-bed \
      --out ${_output!n}

[get_ld_4]
# compute LD
input: group_by = 1, pattern = '{name}.bed'
output: expand_pattern('{_name}.r2.ld.gz')
task: workdir = cwd
run:
    plink --bfile ${_input!n} \
          --out ${_output!nn} \
          --r2 square gz
1 task completed: 5b8b
plink --bfile /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted \
      --indep-pairwise 10000 500 0.2 \
      --out /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted
PLINK v1.90b4.3 64-bit (9 May 2017)            www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted.log.
Options in effect:
  --bfile /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted
  --indep-pairwise 10000 500 0.2
  --out /home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted

32120 MB RAM detected; reserving 16060 MB for main workspace.
13030 variants loaded from .bim file.
635 people (0 males, 0 females, 635 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 635 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
13030 variants and 635 people pass filters and QC.
Note: No phenotypes present.
Pruned 490 variants from chromosome 1, leaving 891.
Pruned 263 variants from chromosome 2, leaving 613.
Pruned 243 variants from chromosome 3, leaving 497.
Pruned 105 variants from chromosome 4, leaving 339.
Pruned 158 variants from chromosome 5, leaving 442.
Pruned 192 variants from chromosome 6, leaving 467.
Pruned 207 variants from chromosome 7, leaving 432.
Pruned 132 variants from chromosome 8, leaving 356.
Pruned 161 variants from chromosome 9, leaving 361.
Pruned 164 variants from chromosome 10, leaving 397.
Pruned 247 variants from chromosome 11, leaving 493.
Pruned 235 variants from chromosome 12, leaving 472.
Pruned 36 variants from chromosome 13, leaving 181.
Pruned 136 variants from chromosome 14, leaving 294.
Pruned 159 variants from chromosome 15, leaving 320.
Pruned 298 variants from chromosome 16, leaving 375.
Pruned 345 variants from chromosome 17, leaving 503.
Pruned 29 variants from chromosome 18, leaving 156.
Pruned 350 variants from chromosome 19, leaving 653.
Pruned 88 variants from chromosome 20, leaving 259.
Pruned 39 variants from chromosome 21, leaving 99.
Pruned 127 variants from chromosome 22, leaving 226.
Pruning complete.  4204 of 13030 variants removed.
Marker lists written to
/home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted.prune.in and
/home/gaow/Documents/GTEx/mash_revision/GTEx7.Imputed.extracted.prune.out .

Here I executed steps 1 & 2. You'll see PLINK's log if you expand the code chunk above. But the relevant PLINK output message here is:

Pruning complete.  4204 of 13030 variants removed.

Examine LD strengths per chromosome

This section takes a deeper look at LD pattern using results from steps 3 & 4. Cell below loads all data and compute some summary statistics on LD strength.

In [19]:
setwd('~/Documents/GTEx/mash_revision/')
grid = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99)
prop = matrix(0, length(grid), 22)
snps = matrix(0, length(grid), 22)
n_snps = 0
for (i in 1:22) {
    tmp = read.table(paste0('GTEx7.Imputed.extracted_chr', i, '.r2.ld.gz'))
    tmp[is.na(tmp)] = 0
    rownames(tmp) = colnames(tmp)
    diag(tmp) = 0
    tmp[upper.tri(tmp)] = 0
    n_snps = n_snps + dim(tmp)[1]
    for (j in 1:length(grid)) {
        m = abs(tmp) > grid[j]
        prop[j,i] = sum(m) / ((dim(tmp)[1]^2 - dim(tmp)[1])/ 2)
        ss = c(rownames(m)[row(m)[which(m)]], colnames(m)[col(m)[which(m)]])
        snps[j,i] = length(unique(ss))
    }
}

Here is summary of proportion of pairs on each chromosome (row) having LD greater than thresholds (column):

In [20]:
prop = data.frame(t(prop))
colnames(prop) = grid
rownames(prop) = paste('chr', 1:22)
prop
apply(prop, 2, mean)
0.10.20.30.40.50.60.70.80.90.950.99
chr 10.002647735 0.001113455 0.00076923880.00055200500.00040298460.00030538680.00022772830.00015951470.00010074627.031242e-052.728542e-05
chr 20.002238748 0.001330724 0.00096803650.00072276580.00055055450.00039138940.00029745600.00019830400.00011480766.001305e-051.304631e-05
chr 30.003638957 0.002022455 0.00150678420.00114471710.00082653700.00055224370.00040229670.00032183740.00024137801.353180e-045.485865e-05
chr 40.002470868 0.001535395 0.00115917270.00087446360.00070160450.00056941820.00040672730.00026437270.00017285911.016818e-044.067273e-05
chr 50.002170284 0.001363383 0.00098497500.00081803010.00064552030.00044518640.00035614910.00024485250.00014468568.347245e-051.669449e-05
chr 60.003293191 0.001729617 0.00109773030.00072874530.00057192670.00045200660.00037359730.00028596340.00020294171.245324e-043.689850e-05
chr 70.004022743 0.002099676 0.00150117000.00113323620.00083888910.00062303460.00050039000.00033849910.00023057191.422677e-044.905784e-05
chr 80.003694415 0.001943986 0.00155687210.00131282190.00106035610.00070690410.00053017810.00029454340.00015989501.094018e-041.683105e-05
chr 90.003993205 0.002316500 0.00175759850.00141196200.00107367940.00069862700.00058096350.00045594610.00018384927.353969e-052.206191e-05
chr 100.003673287 0.002107207 0.00160427810.00127323660.00098039220.00072574480.00056659030.00038833720.00021645021.527884e-043.819710e-05
chr 110.003291519 0.001547014 0.00114105990.00090699630.00077899280.00062538860.00047544160.00035109530.00024137801.426325e-045.120140e-05
chr 120.004283350 0.001979397 0.00138637900.00097767770.00070921700.00054092820.00043274260.00033257070.00024441941.482544e-046.411001e-05
chr 130.004053593 0.002261478 0.00157876770.00123741250.00106673490.00068271040.00059737160.00029868580.00012800828.533880e-050.000000e+00
chr 140.004651163 0.002634575 0.00175638320.00126849890.00100829400.00087819160.00073724720.00045535860.00026020491.843118e-044.336749e-05
chr 150.004900376 0.002716608 0.00192171630.00138014170.00096959320.00077742160.00062019020.00049789920.00021837688.735074e-052.620522e-05
chr 160.007256952 0.003975624 0.00292754550.00215364750.00157875190.00126477040.00083138750.00059700700.00032282602.034246e-047.517866e-05
chr 170.003475084 0.001985365 0.00137555410.00111659350.00087434010.00069891520.00054298190.00036198790.00022276181.253035e-043.619879e-05
chr 180.003407756 0.002173913 0.00164512340.00117508810.00099882490.00082256170.00064629850.00047003530.00011750885.875441e-050.000000e+00
chr 190.002358195 0.001201983 0.00087760670.00064875230.00051342980.00039601750.00030248580.00020696390.00013134259.552182e-053.184061e-05
chr 200.004797521 0.002948477 0.00209891560.00139927700.00106611580.00069963850.00046642570.00031650310.00011660646.663224e-050.000000e+00
chr 210.009732360 0.006135618 0.00412567440.00317359570.00232730350.00200994390.00137522480.00137522480.00052893262.115730e-041.057865e-04
chr 220.007098249 0.004217100 0.00309039400.00233389130.00177053820.00151300540.00122328100.00077259850.00049896992.736286e-048.047901e-05
0.1
0.00414316140205588
0.2
0.00233361591922266
0.3
0.00167413526759036
0.4
0.00126107073619321
0.5
0.000968844573194746
0.6
0.000744519756734997
0.7
0.000567870677882685
0.8
0.000408550028183733
0.9
0.000218159995459414
0.95
0.000124366105021849
0.99
3.7725986625247e-05

Here is summary of number of unique SNPs involved on each chromosome (row) having LD greater than thresholds (column):

In [21]:
snps = data.frame(t(snps))
colnames(snps) = grid
rownames(snps) = paste('chr', 1:22)
snps
0.10.20.30.40.50.60.70.80.90.950.99
chr 195777565055345438931623615911649
chr 2531408350278236194166123 76 4610
chr 3483369302264228191146120 94 6029
chr 4234171138110101 88 69 48 34 20 8
chr 5346262211181158122107 76 46 29 6
chr 6405305248194160129110 91 68 4216
chr 7397321273233190149121 98 78 5019
chr 8286208177150131 94 79 62 38 26 4
chr 9313247212172149123108 90 42 20 6
chr 10328261214189147113 97 76 47 3412
chr 11502390330289253216172138103 6725
chr 12446364300267216179151123 94 6629
chr 13 90 65 50 42 36 29 25 11 6 4 0
chr 14277220172134121106 95 64 38 28 8
chr 15325241205174126105 89 76 39 19 6
chr 16526431382334295246196163101 6825
chr 17624512421369303262212150102 6623
chr 18 69 52 42 28 22 21 16 12 4 2 0
chr 19694555456386346280224164111 8232
chr 20189142110 98 87 62 48 37 14 8 0
chr 21 86 69 57 48 38 32 23 23 10 4 2
chr 22240193167142106 92 77 54 40 23 7

Proportion of SNPs involved are:

In [22]:
apply(snps, 2, sum) / n_snps
0.1
0.640675364543361
0.2
0.50353031465848
0.3
0.419570222563315
0.4
0.355717574827322
0.5
0.299539524174981
0.6
0.247275518035303
0.7
0.203146584804298
0.8
0.156178050652341
0.9
0.103146584804298
0.95
0.0675364543361474
0.99
0.0242517267843438
In [23]:
%sessioninfo

Session Info

SoS

SoS Version
0.9.8.10
numpy
1.13.1

R

Kernel
ir
Language
R
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: BunsenLabs GNU/Linux 8.7 (Hydrogen)

Matrix products: default
BLAS: /usr/lib64/microsoft-r/3.4/lib64/R/lib/libRblas.so
LAPACK: /usr/lib/libopenblasp-r0.2.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
 [1] compiler_3.4.0      R6_2.2.0            magrittr_1.5       
 [4] RevoUtils_10.0.4    IRdisplay_0.4.4     pbdZMQ_0.2-5       
 [7] tools_3.4.0         crayon_1.3.2        uuid_0.1-2         
[10] stringi_1.1.5       IRkernel_0.8.7.9000 jsonlite_1.4       
[13] stringr_1.2.0       digest_0.6.12       repr_0.12.0        
[16] evaluate_0.10      

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago