Revised from old notebook 20170518_Imputation.ipynb
.
We use UMich imputation server for the task. Limitations of UMich imputation server is mostly that it cannot call indels, and that it is not well versed for hg38 coordinates.
[global]
cwd = "~/Documents/GTExV8"
parameter: genotype = "${cwd!a}/genotypes/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.vcf.gz"
parameter: chain_file = "hg38ToHg19"
[plink: provides = executable('plink')]
# Download and extract plink2
output: os.path.join(cwd, "utils/plink")
task: workdir = cwd
download:
https://www.cog-genomics.org/static/bin/plink170509/plink_linux_x86_64.zip
run:
unzip -o plink_linux_x86_64.zip plink
mv plink bin/plink
rm -f plink_linux_x86_64.zip
Code below calculates missing rate by loci and by individuals. Plots are made to summarize the results.
[summary_1]
# make summary statistics of genotype data
depends: executable("plink")
input: "${genotype!a}"
output: "${cwd!a}/genotype_missing/${input!bnn}.imiss.gz", "${cwd!a}/genotype_missing/${input!bnn}.lmiss.gz"
task: workdir = "${cwd!a}"
run:
plink --vcf ${input} --missing gz --memory 30000 --threads 8 --out ${output[0]!nn}
[summary_2]
# making genotype summary statistics plot
depends: Py_Module("seaborn")
input: group_by = 1, pattern = "{name}.{ext}"
output: expand_pattern("{_name}.pdf")
task: workdir = "${cwd!a}"
run:
zcat ${_input} | awk '{print $NF}' > ${_input}.fmiss.tmp
python:
import os
import matplotlib.pyplot as plt, seaborn as sns, pandas as pd
fig, axs = plt.subplots(ncols=2)
data = pd.read_csv("${_input}.fmiss.tmp", sep = '\t')
sns.distplot(data["F_MISS"], ax = axs[0], kde = False)
sns.violinplot(data["F_MISS"], ax = axs[1])
axs[0].set_title("Missing rate summary")
fig.savefig(${_output!r})
os.remove("${_input}.fmiss.tmp")
[summary_3]
# Save summary data to PNG in local folder
input: group_by = 1
output: "img/${_input[0]!bn}.png"
run:
convert -density 200 ${_input} ${_output}
%sosrun summary -b ~/Documents/GTExV8/utils
%preview img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.imiss.pdf
%preview img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.lmiss.pdf
Missing rates are small, but should still warrant imputation.
Imputation server requires VCF input, by chromosome, each with complete header info. Code chunk below prepares input VCF file for UMich Imputation Server.
[preprocess_0 (utils)]
parameter: lift_over_exec = "${cwd!a}/utils/liftOver"
report: workdir = cwd, output = '/tmp/lift-over.py'
import os
import sys
import tempfile
import subprocess
def lift_over(chrom, pos, strand, map_chain):
lift_over_exec = ${lift_over_exec!r}
with tempfile.NamedTemporaryFile() as query:
q = '\t'.join(['chr' + chrom if not chrom.startswith('chr') else chrom, # chrom in .chain startswith `chr`
str(pos - 1), # 1-based to 0-based
str(pos), # 1-based to 0-based
'.', # ignored
'0', # ignored
strand]) + '\n'
query.write(q.encode())
query.seek(0)
with tempfile.NamedTemporaryFile() as mapped, tempfile.NamedTemporaryFile() as unmapped:
# $ ./liftOver <oldFile> <map.chain> <newFile> <unMapped>
cmd = [lift_over_exec, query.name, map_chain, mapped.name, unmapped.name]
subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr = subprocess.DEVNULL).communicate()[0]
mapped.seek(0)
for line in mapped:
record = line.decode().strip().split('\t')
return (record[0].replace('chr', ''), record[2], record[5])
def reverse_complement(x):
reverse_complement_map = str.maketrans('ATGC', 'TACG')
return x.translate(reverse_complement_map)
for line in sys.stdin:
if line.startswith('#'):
print(line.strip())
else:
record = line.strip().split('\t')
new = lift_over(record[0],
int(record[1]), # 1-based pos
'+', # vcf is always forward to reference
"${genotype!d}/${chain_file}.over.chain.gz")
if new:
new_chrom, new_pos, new_strand = new
ref = record[3] if new_strand == '+' else reverse_complement(record[3])
alt = record[4] if new_strand == '+' else reverse_complement(record[4])
# snp_id = record[2]
snp_id = '{}_{}_{}_{}'.format(new_chrom, new_pos, ref, alt)
print('\t'.join([new_chrom, new_pos, snp_id, ref, alt] + record[5:]).strip())
[preprocess_1 (split)]
# split vcf by chroms
chrom = ['chr{}'.format(x + 1) for x in range(22)] + ['chrX'] # there is in fact no Y chrom
input: "${genotype!a}", for_each = ['chrom']
output: ["${cwd!a}/genotype_by_chrom/${input!bnn}.{}.vcf.gz".format(x) for x in chrom], group_by = 1
task: workdir = "${cwd!a}"
run:
tabix ${input} ${_chrom} --print-header | bgzip > ${_output}
[preprocess_2 (liftover)]
input: group_by = 1, pattern = "{name}.vcf.gz"
output: "${_name}.${chain_file}.vcf.gz"
task: workdir = cwd
bash:
zcat ${_input}| python /tmp/lift-over.py | gzip --best > ${_output}
%sosrun preprocess -J8
The issue with liftover is more complicated than I imagined:
https://www.biostars.org/p/266001/
Basically I see variants on one chromesome, after liftover, get moved to multiple chromesomes. An immediate issue is that it messes up the imputation pipeline that requires per-chromosome file (which can be fixed); a longer term concern is that hg19 is not good to use after all.
Imputation was done with Michigan Imputation Server (MIS) because it uses Haplotype Reference Consortium (32,914 samples) reference panel which is not publicly available otherwise. Here is how to prepare data for this service. The prepared files are uploaded to Michigan imputation server.
%preview img/UMichImputation.png
%preview img/UMichImputationResult.pdf
We want to compare the imputation result with the original data to address the following questions:
Below are answers to these points. [FIXME: to be completed]
%sessioninfo