Process VCF files from Michigan imputation server by removing imputed sites and fix variant IDs. We also perform MDS analysis to obtain covariates for association analysis.
[global]
original_genotypes = '~/Documents/GTEx/genotype_by_chrom/*.vcf.gz'
imputed_genotypes= '~/Documents/GTEx/gtex7/michigan_imputed/*.vcf.gz'
parameter: cwd = '~/Documents/GTEx'
phenotype = '~/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_SubjectPhenotypesDS.txt'
[vcftools: provides = executable('vcftools')]
output: os.path.join(cwd, "bin/vcftools")
task: workdir = cwd
download:
https://github.com/vcftools/vcftools/archive/master.zip
run:
unzip master.zip
cd vcftools-master
bash autogen.sh
./configure; make
mv src/cpp/vcftools ../bin
cd ../
rm -rf master.zip vcftools-master
The primary goal here is to convert genotypes from VCF format to PLINK binary format, to make it easy to obtain various statistics and perform routine analysis down the road. During the conversion, imputed sites are removed and variant ID for tri-allelic sites are fixed. Also both the original and imputed data are converted and saved separately.
Code chunk below converts from VCF to PLINK format, fixing variant ID issues along the course.
%sosrun -J 8
[umich_to_plink_1, broad_to_plink_1]
# VCF to plink format
input: group_by = 'single'
output: "${_input!nn}.a2flipped.bed"
task: workdir = cwd
run:
plink --vcf ${_input} --out ${_output!n} --make-bed
[umich_to_plink_2]
# Fix multi-allelic sites
# By changing duplicate SNP ID
input: group_by = 'single'
output: "${_input!n}.bim"
task: workdir = cwd
python:
from collections import Counter
import numpy as np
dat = np.genfromtxt(${_output!r}, dtype = '<U48')
counter = Counter()
dup_list = dat[:,1]
deduped = []
for name in dup_list:
new = name + '_' + str(counter[name]) if counter[name] else name
counter.update({name: 1})
deduped.append(new)
dat[:,1] = np.array(deduped)
np.savetxt(${_output!r}, dat, delimiter = '\t', fmt = '%s')
[umich_to_plink_3]
# Allele fix
input: group_by = 'single'
output: "${_input!nn}.bed"
task: workdir = cwd
run:
paste <(cut -f2 ${_input!n}.bim) <(zcat ${_input!nn}.vcf.gz | grep -v "^#" | cut -f4) > ${_input!nn}.a2
plink --bfile ${_input!n} \
--a2-allele ${_input!nn}.a2 2 1 '#' \
--out ${_output!n} --make-bed
rm -f ${_input!nn}.a2
[broad_to_plink_2]
# Remove indel sites
input: group_by = 'single'
output: "${_input!n}.snp_only.bed"
task: workdir = cwd
run:
awk '{if ((length($5) > 1 ) || (length($6) > 1)) {print $2}}' ${_input!n}.bim > ${_input!n}.exclude_id
plink --bfile ${_input!n} --out ${_output!n} --make-bed \
--exclude ${_input!n}.exclude_id \
rm -f ${_input!n}.exclude_id
[umich_to_plink_4, broad_to_plink_3]
# Merge all files
parameter: project_name = "GTEx7"
output: "${cwd!a}/genotype_plink/${project_name}.bed"
task: workdir = cwd
run:
echo ${input!n} | sed 's/ /\n/g' | grep -v chrX > plink.merge-list
plink --bfile ${input[0]!n} --merge-list plink.merge-list --out ${output!n} \
--chr 1-22 --make-bed
rm plink.merge-list
[default_1]
input: original_genotypes
sos_run('broad_to_plink', project_name = "GTEx7.dbGaP")
[default_2]
input: imputed_genotypes
sos_run('imputed_genotypes', project_name = "GTEx7.Imputed")
Code chunk below compares GTEx7.dbGap
and GTEx7.Imputed
and extract only genotyped variants from imputation results. Additional filters on minor allele frequency and minor allele counts are also applied. The result is genotype input for eQTL discovery, saved in PLINK format.
%sosrun variants_filter
[variants_filter_1]
# Exclude imputed variants
include = "${cwd}/genotype_plink/GTEx7.dbGaP.bed"
input: "${cwd}/genotype_plink/GTEx7.Imputed.bed"
output: "${input!n}.genotyped.bed"
task: workdir = cwd
python:
import pandas as pd
imputed = pd.read_csv("${input!n}.bim", dtype = str, usecols = (0,1,3), header = None, sep = '\t')
genotyped = pd.read_csv("${include!n}.bim", dtype = str, usecols = (0,1,3), header = None, sep = '\t')
imputed[0] = imputed[0] + '_' + imputed[3]
imputed.drop([3], axis = 1, inplace = True)
genotyped[0] = genotyped[0] + '_' + genotyped[3]
genotyped.drop([1,3], axis = 1, inplace = True)
res = pd.merge(imputed, genotyped, how='inner', on=[0,0])
res[1].to_csv("${input!n}.extract_id", header = False, index = False)
run:
plink --bfile ${input!n} --out ${output!n} --extract ${input!n}.extract_id --make-bed
rm -f ${input!n}.extract_id
[variants_filter_2]
# Filter by MAF, HWE etc
parameter: maf = 0.01
parameter: mac = 10
output: "${input!n}.filtered.bed"
task: workdir = cwd
run:
plink --bfile ${input!n} --maf ${maf} --mac ${mac} --make-bed --out ${output!n}
After LD pruning I perform MDS analysis via the king
program. We want to find out in terms of PCA plot:
See code chunk below for implementation.
[LD_pruning_1]
# Mark independent variants
parameter: pairwise_ld_param = '50 5 0.2'
parameter: maf = 0.1
depends: executable('plink')
input: pattern = '{name}.{ext}'
output: expand_pattern('{name}.prune.in')
task: workdir = cwd
run:
plink --bfile ${input!n} \
--allow-no-sex \
--maf ${maf} \
--indep-pairwise ${pairwise_ld_param} \
--out ${output!nn}
[LD_pruning_2]
# Select independent common variants
depends: executable('plink')
input: pattern = '{name}.prune.in'
output: "${cwd!a}/genotype_pca/${input!bn}.bed"
task: workdir = cwd
run:
plink --bfile ${input!nn} \
--extract ${input} \
--no-sex --no-pheno --no-parents \
--make-bed \
--out ${output!n}
[global_ancestry_1]
# global ancestry analysis via KING
# And fix family ID (replace by ancestry coding)
depends: executable('king'), phenotype
input: pattern = "{name}.{ext}"
output: expand_pattern('{name}.pc.ped')
task: workdir = cwd
run:
# king -b ${input!n}.bed --pca 20 --prefix ${input!n}.
# Would be much faster if I use MDS
# One should use PCA to be more comparable to GTEx official results, though
king -b ${input!n}.bed --mds --prefix ${input!n}.
python:
import pandas as pd
reference = pd.read_csv(${phenotype!ar}, dtype = str, usecols = (0,4), sep = '\t')
reference['RACE'] = 'POP' + reference['RACE'].astype(str)
fam = pd.read_csv(${output!r}, header = None, sep = ' ',
names = ['fid', 'sid', 'pid', 'mid', 'sex', 'phen'] + ['PC{}'.format(x+1) for x in range(20)])
dat = pd.merge(reference, fam, left_on = 'SUBJID', right_on = "sid")
dat.drop(['SUBJID', 'fid'], axis=1, inplace = True)
dat.to_csv(${output!r}, sep = ' ', header = False, index = False)
[global_ancestry_2]
# scatter plot for global ancestry analysis result
# R plotly implementation
depends: R_library("plotly")
input: pattern = "{name}.ped"
output: expand_pattern('{name}.html')
task: workdir = cwd
R:
library(plotly)
dat <- read.table(${input!r}, sep = ' ')
for (i in 1:(ncol(dat) - 6 - 1)) {
title <- paste(${input!bnr}, "PC", i, "vs", "PC", i + 1, sep = "_")
p <- plot_ly(dat, x = dat[, 6 + i], y = dat[, 7 + i], text = dat[,2], color = dat[,1],
mode = "markers", type = 'scatter', visible = 'legendonly') %>% layout(title = title)
htmlwidgets::saveWidget(as.widget(p), paste0(title, ".html"))
}
run:
foo () {
echo '<html><body>'
sed 's/^.*/<a href="&">&<\/a><br\/>/'
echo '</body></html>'
}
mkdir -p ${input!n}; mv ${input!bn}*.html ${input!n}
ls ${input!n}/*.html | foo > ${output}
[genotype_pca_broad]
input: "${cwd}/genotype_plink/GTEx7.dbGaP.bed"
sos_run("LD_pruning + global_ancestry")
[genotype_pca_umich]
input: "${cwd}/genotype_plink/GTEx7.Imputed.bed"
sos_run("LD_pruning + global_ancestry")
[genotype_pca_umich_filtered]
# Filtered imputation data removing imputed sites
input: "${cwd}/genotype_plink/GTEx7.Imputed.genotyped.filtered.bed"
sos_run("LD_pruning + global_ancestry")
%sosrun genotype_pca_broad
%sosrun genotype_pca_umich
%sosrun genotype_pca_umich_filtered
GTEx has provided in dbGaP the PCA analysis done at Broad before imputation. I want to verify it agrees with my PCA analysis for imputed data. Code chunk below matches sample label with Broad PCA results to prepare data for plot, and saved to a temporary file /tmp/GTExPCA.ped
(previewed below) to be used by the same plot routine I developed here.
%preview dat --limit 20
import pandas as pd
reference = pd.read_csv('/home/gaow/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_SubjectPhenotypesDS.txt', dtype = str, usecols = (0,4), sep = '\t')
reference['RACE'] = 'POP' + reference['RACE'].astype(str)
ped = pd.read_csv('/home/gaow/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PostVarQC_20genotPCs.txt', header = 0, sep = '\t')
ped['FID'] = ['-'.join(x.split('-')[:2]) for x in ped['FID']]
dat = pd.merge(reference, ped, left_on = 'SUBJID', right_on = "FID")
dat.drop(['SUBJID', 'FID'], axis=1, inplace = True)
dat.insert(2, 'pid', 0)
dat.insert(3, 'mid', 0)
dat.insert(4, 'sex', 0)
dat.insert(5, 'phen', 0)
dat.to_csv('/tmp/GTExPCA.ped', sep = ' ', header = False, index = False)
%sosrun pca_plot_broad
[pca_plot_broad]
input: '/tmp/GTExPCA.ped'
sos_run("global_ancestry:2")
Broad provided:
%preview img/gtex_broad_pca.png
Broad, my analysis:
%preview img/gtex_dbgap_pca_gaow.png
Imputed, my analysis:
%preview img/gtex_umich_imputed_pca.png
The results are quite consistent across different datasets.
%sessioninfo