Multivariate Bayesian variable selection regression

GTEx V8 genotype data imputation

Revised from old notebook 20170518_Imputation.ipynb.

We use UMich imputation server for the task. Limitations of UMich imputation server is mostly that it cannot call indels, and that it is not well versed for hg38 coordinates.

In [ ]:
[global]
cwd = "~/Documents/GTExV8"
parameter: genotype = "${cwd!a}/genotypes/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.vcf.gz"
parameter: chain_file = "hg38ToHg19"


[plink: provides = executable('plink')]
# Download and extract plink2
output: os.path.join(cwd, "utils/plink")
task: workdir = cwd
download:
  https://www.cog-genomics.org/static/bin/plink170509/plink_linux_x86_64.zip
run:
  unzip -o plink_linux_x86_64.zip plink
  mv plink bin/plink
  rm -f plink_linux_x86_64.zip

Genotype missingness summary

Code below calculates missing rate by loci and by individuals. Plots are made to summarize the results.

In [ ]:
[summary_1]
# make summary statistics of genotype data
depends: executable("plink")
input: "${genotype!a}"
output: "${cwd!a}/genotype_missing/${input!bnn}.imiss.gz", "${cwd!a}/genotype_missing/${input!bnn}.lmiss.gz"
task: workdir = "${cwd!a}"
run:
    plink --vcf ${input} --missing gz --memory 30000 --threads 8 --out ${output[0]!nn}

[summary_2]
# making genotype summary statistics plot
depends: Py_Module("seaborn")
input: group_by = 1, pattern = "{name}.{ext}"
output: expand_pattern("{_name}.pdf")
task: workdir = "${cwd!a}"
run:
    zcat ${_input} | awk '{print $NF}' > ${_input}.fmiss.tmp
python:
    import os
    import matplotlib.pyplot as plt, seaborn as sns, pandas as pd
    fig, axs = plt.subplots(ncols=2)
    data = pd.read_csv("${_input}.fmiss.tmp", sep = '\t')
    sns.distplot(data["F_MISS"], ax = axs[0], kde = False)
    sns.violinplot(data["F_MISS"], ax = axs[1])
    axs[0].set_title("Missing rate summary")
    fig.savefig(${_output!r})
    os.remove("${_input}.fmiss.tmp")

[summary_3]
# Save summary data to PNG in local folder
input: group_by = 1
output: "img/${_input[0]!bn}.png"
run:
    convert -density 200 ${_input} ${_output}
In [1]:
%sosrun summary -b ~/Documents/GTExV8/utils
Keyboard Interrupt
In [3]:
%preview img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.imiss.pdf
%preview img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.lmiss.pdf
> img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.lmiss.pdf (14.6 KiB):
> img/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.imiss.pdf (13.9 KiB):

Missing rates are small, but should still warrant imputation.

Data processing for imputation server

Imputation server requires VCF input, by chromosome, each with complete header info. Code chunk below prepares input VCF file for UMich Imputation Server.

In [2]:
[preprocess_0 (utils)]
parameter: lift_over_exec = "${cwd!a}/utils/liftOver"
report: workdir = cwd, output = '/tmp/lift-over.py'
import os
import sys
import tempfile
import subprocess


def lift_over(chrom, pos, strand, map_chain):
    lift_over_exec = ${lift_over_exec!r}

    with tempfile.NamedTemporaryFile() as query:
        q = '\t'.join(['chr' + chrom if not chrom.startswith('chr') else chrom,  # chrom in .chain startswith `chr`
                       str(pos - 1),  # 1-based to 0-based
                       str(pos),      # 1-based to 0-based
                       '.',           # ignored
                       '0',           # ignored
                       strand]) + '\n'
        query.write(q.encode())
        query.seek(0)

        with tempfile.NamedTemporaryFile() as mapped, tempfile.NamedTemporaryFile() as unmapped:
            # $ ./liftOver <oldFile> <map.chain> <newFile> <unMapped>
            cmd = [lift_over_exec, query.name, map_chain, mapped.name, unmapped.name]
            subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr = subprocess.DEVNULL).communicate()[0]
            mapped.seek(0)
            for line in mapped:
                record = line.decode().strip().split('\t')
                return (record[0].replace('chr', ''), record[2], record[5])

def reverse_complement(x):
    reverse_complement_map = str.maketrans('ATGC', 'TACG')
    return x.translate(reverse_complement_map)

for line in sys.stdin:
    if line.startswith('#'):
        print(line.strip())
    else:
        record = line.strip().split('\t')
        new = lift_over(record[0],
                        int(record[1]),  # 1-based pos
                        '+',             # vcf is always forward to reference
                        "${genotype!d}/${chain_file}.over.chain.gz")
        if new:
            new_chrom, new_pos, new_strand = new
            ref = record[3] if new_strand == '+' else reverse_complement(record[3])
            alt = record[4] if new_strand == '+' else reverse_complement(record[4])
            # snp_id = record[2]
            snp_id = '{}_{}_{}_{}'.format(new_chrom, new_pos, ref, alt)
            print('\t'.join([new_chrom, new_pos, snp_id, ref, alt] + record[5:]).strip())


[preprocess_1 (split)]
# split vcf by chroms
chrom = ['chr{}'.format(x + 1) for x in range(22)] + ['chrX'] # there is in fact no Y chrom
input: "${genotype!a}", for_each = ['chrom']
output: ["${cwd!a}/genotype_by_chrom/${input!bnn}.{}.vcf.gz".format(x) for x in chrom], group_by = 1
task: workdir = "${cwd!a}"
run:
  tabix ${input} ${_chrom} --print-header | bgzip > ${_output}

[preprocess_2 (liftover)]
input: group_by = 1, pattern = "{name}.vcf.gz"
output: "${_name}.${chain_file}.vcf.gz"
task: workdir = cwd
bash:
    zcat ${_input}| python /tmp/lift-over.py | gzip --best > ${_output}
In [ ]:
%sosrun preprocess -J8

The issue with liftover is more complicated than I imagined:

https://www.biostars.org/p/266001/

Basically I see variants on one chromesome, after liftover, get moved to multiple chromesomes. An immediate issue is that it messes up the imputation pipeline that requires per-chromosome file (which can be fixed); a longer term concern is that hg19 is not good to use after all.

Imputation

Imputation was done with Michigan Imputation Server (MIS) because it uses Haplotype Reference Consortium (32,914 samples) reference panel which is not publicly available otherwise. Here is how to prepare data for this service. The prepared files are uploaded to Michigan imputation server.

Configuration

In [2]:
%preview img/UMichImputation.png
> img/UMichImputation.png (168.0 KiB):

Outcome

In [3]:
%preview img/UMichImputationResult.pdf
> img/UMichImputationResult.pdf (256.5 KiB):

Imputation result summary

We want to compare the imputation result with the original data to address the following questions:

  1. Overlap between original and imputated, and focus on why some are filtered out
  2. Filter imputated data removing all imputed sites; also filter by MAF
  3. Filter out data having >5% missing in the original
  4. Are genotype prediction done for genotypes that are called?
  5. Summary of uncertainty in imputed SNPs

Below are answers to these points. [FIXME: to be completed]

In [1]:
%sessioninfo

SoS

SoS Version
0.9.8.10

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago