The official release currently does not impute missing data. We use Michigan Imputation Server for the task.
[global]
genotype = "~/Documents/GTEx/gtex7/variant_calls/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.vcf.gz"
cwd = "~/Documents/GTEx"
[plink: provides = executable('plink')]
# Download and extract plink2
output: os.path.join(cwd, "bin/plink")
task: workdir = cwd
download:
https://www.cog-genomics.org/static/bin/plink170509/plink_linux_x86_64.zip
run:
unzip -o plink_linux_x86_64.zip plink
mv plink bin/plink
rm -f plink_linux_x86_64.zip
[minimac3: provides = executable('Minimac3')]
# Download and extract minimac3
# Chosen over impute2 because of results shown here
# But is not needed: turns out the imputation server works great!
# McCarthy, Shane, et al. "A reference panel of 64,976 haplotypes for genotype imputation." Nature genetics (2016).
output: os.path.join(cwd, "bin/Minimac3")
task: workdir = cwd
download:
https://github.com/Santy-8128/Minimac3/archive/master.zip
run:
unzip master.zip
cd Minimac3-master
make
mv bin/* ../bin
cd ../
rm -rf master.zip Minimac3-master
Code below calculates missing rate by loci and by individuals. Plots are made to summarize the results.
[summary_1]
# make summary statistics of genotype data
depends: executable("plink")
input: "${genotype!a}"
output: "${cwd!a}/genotype_missing/${input!bnn}.imiss.gz", "${cwd!a}/genotype_missing/${input!bnn}.lmiss.gz"
task: workdir = "${cwd!a}"
run:
plink --vcf ${input} --missing gz --memory 30000 --threads 8 --out ${output[0]!nn}
[summary_2]
# making genotype summary statistics plot
depends: Py_Module("seaborn")
input: group_by = 1, pattern = "{name}.{ext}"
output: expand_pattern("{_name}.pdf")
task: workdir = "${cwd!a}"
run:
zcat ${_input} | awk '{print $NF}' > ${_input}.fmiss.tmp
python:
import os
import matplotlib.pyplot as plt, seaborn as sns, pandas as pd
fig, axs = plt.subplots(ncols=2)
data = pd.read_csv("${_input}.fmiss.tmp", sep = '\t')
sns.distplot(data["F_MISS"], ax = axs[0], kde = False)
sns.violinplot(data["F_MISS"], ax = axs[1])
axs[0].set_title("Missing rate summary")
fig.savefig(${_output!r})
os.remove("${_input}.fmiss.tmp")
[summary_3]
# Save summary data to PNG in local folder
input: group_by = 1
output: "img/${_input[0]!bn}.png"
run:
convert -density 200 ${_input} ${_output}
%sosrun summary -b ~/Documents/GTEx/bin
%preview img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.imiss.png
%preview img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.lmiss.png
Missing rates are mostly under 1%. This should make imputation relatively straightforward.
Imputation server requires VCF input, by chromosome, each with complete header info. Code chunk below prepares input VCF file for UMich Imputation Server.
[prepare_1]
# split vcf by chroms
parameter: chrom = [x + 1 for x in range(22)] + ['X'] # there is in fact no Y chrom
input: "${genotype!a}", for_each = ['chrom']
output: ["${cwd!a}/genotype_by_chrom/${input!bnn}.chr{}.vcf.gz".format(x) for x in chrom], group_by = 1
task: workdir = "${cwd!a}"
run:
tabix ${input} ${_chrom} --print-header | bgzip > ${_output}
%sosrun prepare -J8
Imputation was done with Michigan Imputation Server (MIS) because it uses Haplotype Reference Consortium (32,914 samples) reference panel which is not publicly available otherwise. Here is how to prepare data for this service. The prepared files are uploaded to Michigan imputation server.
%preview img/UMichImputation.png
%preview img/UMichImputationResult.pdf
We want to compare the imputation result with the original data to address the following questions:
Below are answers to these points. [FIXME: to be completed]
%sessioninfo