GTEx V8 genotype data imputation¶

Revised from old notebook 20170518_Imputation.ipynb.

We use UMich imputation server for the task. Limitations of UMich imputation server is mostly that it cannot call indels, and that it is not well versed for hg38 coordinates.

[global]
cwd = "~/Documents/GTExV8"
parameter: genotype = "${cwd!a}/genotypes/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.vcf.gz"
parameter: chain_file = "hg38ToHg19"


[plink: provides = executable('plink')]
# Download and extract plink2
output: os.path.join(cwd, "utils/plink")
task: workdir = cwd
download:
  https://www.cog-genomics.org/static/bin/plink170509/plink_linux_x86_64.zip
run:
  unzip -o plink_linux_x86_64.zip plink
  mv plink bin/plink
  rm -f plink_linux_x86_64.zip

Genotype missingness summary¶

Code below calculates missing rate by loci and by individuals. Plots are made to summarize the results.

[summary_1]
# make summary statistics of genotype data
depends: executable("plink")
input: "${genotype!a}"
output: "${cwd!a}/genotype_missing/${input!bnn}.imiss.gz", "${cwd!a}/genotype_missing/${input!bnn}.lmiss.gz"
task: workdir = "${cwd!a}"
run:
    plink --vcf ${input} --missing gz --memory 30000 --threads 8 --out ${output[0]!nn}

[summary_2]
# making genotype summary statistics plot
depends: Py_Module("seaborn")
input: group_by = 1, pattern = "{name}.{ext}"
output: expand_pattern("{_name}.pdf")
task: workdir = "${cwd!a}"
run:
    zcat ${_input} | awk '{print $NF}' > ${_input}.fmiss.tmp
python:
    import os
    import matplotlib.pyplot as plt, seaborn as sns, pandas as pd
    fig, axs = plt.subplots(ncols=2)
    data = pd.read_csv("${_input}.fmiss.tmp", sep = '\t')
    sns.distplot(data["F_MISS"], ax = axs[0], kde = False)
    sns.violinplot(data["F_MISS"], ax = axs[1])
    axs[0].set_title("Missing rate summary")
    fig.savefig(${_output!r})
    os.remove("${_input}.fmiss.tmp")

[summary_3]
# Save summary data to PNG in local folder
input: group_by = 1
output: "img/${_input[0]!bn}.png"
run:
    convert -density 200 ${_input} ${_output}

%sosrun summary -b ~/Documents/GTExV8/utils

Keyboard Interrupt

%preview img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.imiss.pdf
%preview img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.lmiss.pdf

Missing rates are small, but should still warrant imputation.

Data processing for imputation server¶

Imputation server requires VCF input, by chromosome, each with complete header info. Code chunk below prepares input VCF file for UMich Imputation Server.

[preprocess_0 (utils)]
parameter: lift_over_exec = "${cwd!a}/utils/liftOver"
report: workdir = cwd, output = '/tmp/lift-over.py'
import os
import sys
import tempfile
import subprocess


def lift_over(chrom, pos, strand, map_chain):
    lift_over_exec = ${lift_over_exec!r}

    with tempfile.NamedTemporaryFile() as query:
        q = '\t'.join(['chr' + chrom if not chrom.startswith('chr') else chrom,  # chrom in .chain startswith `chr`
                       str(pos - 1),  # 1-based to 0-based
                       str(pos),      # 1-based to 0-based
                       '.',           # ignored
                       '0',           # ignored
                       strand]) + '\n'
        query.write(q.encode())
        query.seek(0)

        with tempfile.NamedTemporaryFile() as mapped, tempfile.NamedTemporaryFile() as unmapped:
            # $ ./liftOver <oldFile> <map.chain> <newFile> <unMapped>
            cmd = [lift_over_exec, query.name, map_chain, mapped.name, unmapped.name]
            subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr = subprocess.DEVNULL).communicate()[0]
            mapped.seek(0)
            for line in mapped:
                record = line.decode().strip().split('\t')
                return (record[0].replace('chr', ''), record[2], record[5])

def reverse_complement(x):
    reverse_complement_map = str.maketrans('ATGC', 'TACG')
    return x.translate(reverse_complement_map)

for line in sys.stdin:
    if line.startswith('#'):
        print(line.strip())
    else:
        record = line.strip().split('\t')
        new = lift_over(record[0],
                        int(record[1]),  # 1-based pos
                        '+',             # vcf is always forward to reference
                        "${genotype!d}/${chain_file}.over.chain.gz")
        if new:
            new_chrom, new_pos, new_strand = new
            ref = record[3] if new_strand == '+' else reverse_complement(record[3])
            alt = record[4] if new_strand == '+' else reverse_complement(record[4])
            # snp_id = record[2]
            snp_id = '{}_{}_{}_{}'.format(new_chrom, new_pos, ref, alt)
            print('\t'.join([new_chrom, new_pos, snp_id, ref, alt] + record[5:]).strip())


[preprocess_1 (split)]
# split vcf by chroms
chrom = ['chr{}'.format(x + 1) for x in range(22)] + ['chrX'] # there is in fact no Y chrom
input: "${genotype!a}", for_each = ['chrom']
output: ["${cwd!a}/genotype_by_chrom/${input!bnn}.{}.vcf.gz".format(x) for x in chrom], group_by = 1
task: workdir = "${cwd!a}"
run:
  tabix ${input} ${_chrom} --print-header | bgzip > ${_output}

[preprocess_2 (liftover)]
input: group_by = 1, pattern = "{name}.vcf.gz"
output: "${_name}.${chain_file}.vcf.gz"
task: workdir = cwd
bash:
    zcat ${_input}| python /tmp/lift-over.py | gzip --best > ${_output}

%sosrun preprocess -J8

The issue with liftover is more complicated than I imagined:

https://www.biostars.org/p/266001/

Basically I see variants on one chromesome, after liftover, get moved to multiple chromesomes. An immediate issue is that it messes up the imputation pipeline that requires per-chromosome file (which can be fixed); a longer term concern is that hg19 is not good to use after all.

Imputation¶

Imputation was done with Michigan Imputation Server (MIS) because it uses Haplotype Reference Consortium (32,914 samples) reference panel which is not publicly available otherwise. Here is how to prepare data for this service. The prepared files are uploaded to Michigan imputation server.

Configuration¶

%preview img/UMichImputation.png

Outcome¶

%preview img/UMichImputationResult.pdf

Imputation result summary¶

We want to compare the imputation result with the original data to address the following questions:

Overlap between original and imputated, and focus on why some are filtered out
Filter imputated data removing all imputed sites; also filter by MAF
Filter out data having >5% missing in the original
Are genotype prediction done for genotypes that are called?
Summary of uncertainty in imputed SNPs

Below are answers to these points. [FIXME: to be completed]

%sessioninfo

0.9.8.10