GTEx V7 genotype data imputation¶

The official release currently does not impute missing data. We use Michigan Imputation Server for the task.

[global]
genotype = "~/Documents/GTEx/gtex7/variant_calls/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.vcf.gz"
cwd = "~/Documents/GTEx"

[plink: provides = executable('plink')]
# Download and extract plink2
output: os.path.join(cwd, "bin/plink")
task: workdir = cwd
download:
  https://www.cog-genomics.org/static/bin/plink170509/plink_linux_x86_64.zip
run:
  unzip -o plink_linux_x86_64.zip plink
  mv plink bin/plink
  rm -f plink_linux_x86_64.zip

[minimac3: provides = executable('Minimac3')]
# Download and extract minimac3
# Chosen over impute2 because of results shown here
# But is not needed: turns out the imputation server works great!
# McCarthy, Shane, et al. "A reference panel of 64,976 haplotypes for genotype imputation." Nature genetics (2016).
output: os.path.join(cwd, "bin/Minimac3")
task: workdir = cwd
download:
  https://github.com/Santy-8128/Minimac3/archive/master.zip
run:
  unzip master.zip
  cd Minimac3-master
  make
  mv bin/* ../bin
  cd ../
  rm -rf master.zip Minimac3-master

Genotype missingness summary¶

Code below calculates missing rate by loci and by individuals. Plots are made to summarize the results.

[summary_1]
# make summary statistics of genotype data
depends: executable("plink")
input: "${genotype!a}"
output: "${cwd!a}/genotype_missing/${input!bnn}.imiss.gz", "${cwd!a}/genotype_missing/${input!bnn}.lmiss.gz"
task: workdir = "${cwd!a}"
run:
    plink --vcf ${input} --missing gz --memory 30000 --threads 8 --out ${output[0]!nn}

[summary_2]
# making genotype summary statistics plot
depends: Py_Module("seaborn")
input: group_by = 1, pattern = "{name}.{ext}"
output: expand_pattern("{_name}.pdf")
task: workdir = "${cwd!a}"
run:
   zcat ${_input} | awk '{print $NF}' > ${_input}.fmiss.tmp
python:
import os
import matplotlib.pyplot as plt, seaborn as sns, pandas as pd
fig, axs = plt.subplots(ncols=2)
data = pd.read_csv("${_input}.fmiss.tmp", sep = '\t')
sns.distplot(data["F_MISS"], ax = axs[0], kde = False)
sns.violinplot(data["F_MISS"], ax = axs[1])
axs[0].set_title("Missing rate summary")
fig.savefig(${_output!r})
os.remove("${_input}.fmiss.tmp")

[summary_3]
# Save summary data to PNG in local folder
input: group_by = 1
output: "img/${_input[0]!bn}.png"
run:
    convert -density 200 ${_input} ${_output}

%sosrun summary -b ~/Documents/GTEx/bin

%preview img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.imiss.png
%preview img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.lmiss.png

Missing rates are mostly under 1%. This should make imputation relatively straightforward.

Data processing for imputation server¶

Imputation server requires VCF input, by chromosome, each with complete header info. Code chunk below prepares input VCF file for UMich Imputation Server.

[prepare_1]
# split vcf by chroms
parameter: chrom = [x + 1 for x in range(22)] + ['X'] # there is in fact no Y chrom
input: "${genotype!a}", for_each = ['chrom']
output: ["${cwd!a}/genotype_by_chrom/${input!bnn}.chr{}.vcf.gz".format(x) for x in chrom], group_by = 1
task: workdir = "${cwd!a}"
run:
  tabix ${input} ${_chrom} --print-header | bgzip > ${_output}

%sosrun prepare -J8

ab213ede13dd94ac43ed2c7ec7fdf7fd

0 sec

8c1b9a0bd56717f5642e2b213885c8cd

0 sec

d2a0c1f00db2c8c6dab622ff0353059c

0 sec

4c1aedd17d89ea83c02c19bc9048525a

0 sec

ea1e754ba297d6a36a41915ef84a90d5

0 sec

Imputation¶

Imputation was done with Michigan Imputation Server (MIS) because it uses Haplotype Reference Consortium (32,914 samples) reference panel which is not publicly available otherwise. Here is how to prepare data for this service. The prepared files are uploaded to Michigan imputation server.

Configuration¶

%preview img/UMichImputation.png

Outcome¶

%preview img/UMichImputationResult.pdf

Imputation result summary¶

We want to compare the imputation result with the original data to address the following questions:

Overlap between original and imputated, and focus on why some are filtered out
Filter imputated data removing all imputed sites; also filter by MAF
Filter out data having >5% missing in the original
Are genotype prediction done for genotypes that are called?
Summary of uncertainty in imputed SNPs

Below are answers to these points. [FIXME: to be completed]

%sessioninfo

0.9.8.10