Multivariate Bayesian variable selection regression

GTEx V7 genotype data imputation

The official release currently does not impute missing data. We use Michigan Imputation Server for the task.

In [ ]:
[global]
genotype = "~/Documents/GTEx/gtex7/variant_calls/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.vcf.gz"
cwd = "~/Documents/GTEx"

[plink: provides = executable('plink')]
# Download and extract plink2
output: os.path.join(cwd, "bin/plink")
task: workdir = cwd
download:
  https://www.cog-genomics.org/static/bin/plink170509/plink_linux_x86_64.zip
run:
  unzip -o plink_linux_x86_64.zip plink
  mv plink bin/plink
  rm -f plink_linux_x86_64.zip

[minimac3: provides = executable('Minimac3')]
# Download and extract minimac3
# Chosen over impute2 because of results shown here
# But is not needed: turns out the imputation server works great!
# McCarthy, Shane, et al. "A reference panel of 64,976 haplotypes for genotype imputation." Nature genetics (2016).
output: os.path.join(cwd, "bin/Minimac3")
task: workdir = cwd
download:
  https://github.com/Santy-8128/Minimac3/archive/master.zip
run:
  unzip master.zip
  cd Minimac3-master
  make
  mv bin/* ../bin
  cd ../
  rm -rf master.zip Minimac3-master

Genotype missingness summary

Code below calculates missing rate by loci and by individuals. Plots are made to summarize the results.

In [ ]:
[summary_1]
# make summary statistics of genotype data
depends: executable("plink")
input: "${genotype!a}"
output: "${cwd!a}/genotype_missing/${input!bnn}.imiss.gz", "${cwd!a}/genotype_missing/${input!bnn}.lmiss.gz"
task: workdir = "${cwd!a}"
run:
    plink --vcf ${input} --missing gz --memory 30000 --threads 8 --out ${output[0]!nn}

[summary_2]
# making genotype summary statistics plot
depends: Py_Module("seaborn")
input: group_by = 1, pattern = "{name}.{ext}"
output: expand_pattern("{_name}.pdf")
task: workdir = "${cwd!a}"
run:
   zcat ${_input} | awk '{print $NF}' > ${_input}.fmiss.tmp
python:
import os
import matplotlib.pyplot as plt, seaborn as sns, pandas as pd
fig, axs = plt.subplots(ncols=2)
data = pd.read_csv("${_input}.fmiss.tmp", sep = '\t')
sns.distplot(data["F_MISS"], ax = axs[0], kde = False)
sns.violinplot(data["F_MISS"], ax = axs[1])
axs[0].set_title("Missing rate summary")
fig.savefig(${_output!r})
os.remove("${_input}.fmiss.tmp")

[summary_3]
# Save summary data to PNG in local folder
input: group_by = 1
output: "img/${_input[0]!bn}.png"
run:
    convert -density 200 ${_input} ${_output}
In [ ]:
%sosrun summary -b ~/Documents/GTEx/bin
In [1]:
%preview img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.imiss.png
%preview img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.lmiss.png
> img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.lmiss.png (17.2 KiB):
> img/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.imiss.png (21.2 KiB):

Missing rates are mostly under 1%. This should make imputation relatively straightforward.

Data processing for imputation server

Imputation server requires VCF input, by chromosome, each with complete header info. Code chunk below prepares input VCF file for UMich Imputation Server.

In [2]:
[prepare_1]
# split vcf by chroms
parameter: chrom = [x + 1 for x in range(22)] + ['X'] # there is in fact no Y chrom
input: "${genotype!a}", for_each = ['chrom']
output: ["${cwd!a}/genotype_by_chrom/${input!bnn}.chr{}.vcf.gz".format(x) for x in chrom], group_by = 1
task: workdir = "${cwd!a}"
run:
  tabix ${input} ${_chrom} --print-header | bgzip > ${_output}

Imputation

Imputation was done with Michigan Imputation Server (MIS) because it uses Haplotype Reference Consortium (32,914 samples) reference panel which is not publicly available otherwise. Here is how to prepare data for this service. The prepared files are uploaded to Michigan imputation server.

Configuration

In [2]:
%preview img/UMichImputation.png
> img/UMichImputation.png (168.0 KiB):

Outcome

In [3]:
%preview img/UMichImputationResult.pdf
> img/UMichImputationResult.pdf (256.5 KiB):

Imputation result summary

We want to compare the imputation result with the original data to address the following questions:

  1. Overlap between original and imputated, and focus on why some are filtered out
  2. Filter imputated data removing all imputed sites; also filter by MAF
  3. Filter out data having >5% missing in the original
  4. Are genotype prediction done for genotypes that are called?
  5. Summary of uncertainty in imputed SNPs

Below are answers to these points. [FIXME: to be completed]

In [1]:
%sessioninfo

SoS

SoS Version
0.9.8.10

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago