Though it is straightfoward enough to do it in R / Python, I use PLINK to compute the LD matrix.
PLINK is highly efficient. First I extract the subset of variants of interest from all data, then for a quick look I use --indep-pairwise
to prune SNPs at given LD level. Finally I use --r2
option to compute LD stats and formally report a summary.
[global]
cwd = '~/Documents/GTEx'
mash_snps = '~/GIT/github/gtexresults_mash/Data/maxz.txt'
mash_snps_all = "${cwd!a}/mash_revision/MatrixEQTLSumStats.Portable.h5"
snp_list = "${cwd!a}/mash_revision/snp_eqtls.txt"
snp_list_random = "${cwd!a}/mash_revision/snp_random.txt"
genotype_data = "${cwd!a}/genotype_plink/GTEx7.Imputed.bed"
%preview /home/gaow/GIT/github/gtexresults_mash/Data/maxz.txt --limit 2
%sosrun get_list
[get_list]
output: "${snp_list!a}"
import numpy as np
np.savetxt("${snp_list!a}",
[':'.join(x.split('_')[1:3]) for x in np.loadtxt("${mash_snps!a}", dtype = 'str', delimiter=" ", skiprows=1, usecols=(0,)).tolist()],
fmt = '%s')
In particular my GTEx V6 sumstat database MatrixEQTLSumStats.Portable.h5
%sosrun get_all_list
[get_all_list]
output: snp_list, snp_list_random
python:
import h5py, numpy as np
f = h5py.File(${mash_snps_all!ar})
max_list = [':'.join(x.decode().split('_')[1:3]) for x in f['max']['rownames']]
random_list = [':'.join(x.decode().split('_')[1:3]) for x in f['null']['rownames']]
f.close()
np.savetxt(${snp_list!ar}, max_list, fmt = '%s')
np.savetxt(${snp_list_random!ar}, random_list, fmt = '%s')
Code chunk below extracts SNPs from GTEx genotype data and computes r2
via PLINK.
%sosrun get_ld:1-2 -b ~/Documents/GTEx/bin
[get_ld_1]
# extract SNPs for given list
parameter: input_list = snp_list
depends: genotype_data
input: input_list
output: "${cwd!a}/mash_revision/${input!bn}.extracted.bed"
task: workdir = cwd
run:
plink --bfile ${genotype_data!n} \
--extract ${input} \
--no-sex --no-pheno --no-parents \
--make-bed \
--out ${output!n}
[get_ld_2]
# Quickly survey how many SNPs are removed for given cutoff
pairwise_ld_param = '10000 500 0.2'
output: "${input!n}.prune.out"
run:
plink --bfile ${input!n} \
--indep-pairwise ${pairwise_ld_param} \
--out ${output!nn}
[get_ld_3]
# split data by chrom
chroms = [i+1 for i in range(22)]
input: for_each = 'chroms', pattern = '{name}.prune.out'
output: expand_pattern('{_name}_chr{_chroms}.bed')
task: workdir = cwd
run:
plink --bfile ${_input!n} \
--chr ${_chroms} \
--allow-no-sex \
--make-bed \
--out ${_output!n}
[get_ld_4]
# compute LD
input: group_by = 1, pattern = '{name}.bed'
output: expand_pattern('{_name}.r2.ld.gz')
task: workdir = cwd
run:
plink --bfile ${_input!n} \
--out ${_output!nn} \
--r2 square gz
Here I executed steps 1 & 2. You'll see PLINK's log if you expand the code chunk above. But the relevant PLINK output message here is:
Pruning complete. 4204 of 13030 variants removed.
This section takes a deeper look at LD pattern using results from steps 3 & 4. Cell below loads all data and compute some summary statistics on LD strength.
setwd('~/Documents/GTEx/mash_revision/')
grid = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99)
prop = matrix(0, length(grid), 22)
snps = matrix(0, length(grid), 22)
n_snps = 0
for (i in 1:22) {
tmp = read.table(paste0('GTEx7.Imputed.extracted_chr', i, '.r2.ld.gz'))
tmp[is.na(tmp)] = 0
rownames(tmp) = colnames(tmp)
diag(tmp) = 0
tmp[upper.tri(tmp)] = 0
n_snps = n_snps + dim(tmp)[1]
for (j in 1:length(grid)) {
m = abs(tmp) > grid[j]
prop[j,i] = sum(m) / ((dim(tmp)[1]^2 - dim(tmp)[1])/ 2)
ss = c(rownames(m)[row(m)[which(m)]], colnames(m)[col(m)[which(m)]])
snps[j,i] = length(unique(ss))
}
}
Here is summary of proportion of pairs on each chromosome (row) having LD greater than thresholds (column):
prop = data.frame(t(prop))
colnames(prop) = grid
rownames(prop) = paste('chr', 1:22)
prop
apply(prop, 2, mean)
Here is summary of number of unique SNPs involved on each chromosome (row) having LD greater than thresholds (column):
snps = data.frame(t(snps))
colnames(snps) = grid
rownames(snps) = paste('chr', 1:22)
snps
Proportion of SNPs involved are:
apply(snps, 2, sum) / n_snps
%sessioninfo