GTEx V8 genotypes¶

Save V8 genotype data to HDF5 for association analysis.

Genotype here already have been imputed and filtered for MAF > 1%. I only need to convert VCF to HDF5 format for per-gene data.

[global]
cwd = "~/Documents/GTExV8"
parameter: genotype =  "${cwd!a}/genotypes/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze_MAF001_GTonly.vcf.gz"

Convert data to PLINK format¶

%sosrun vcf_to_plink -b ~/Documents/GTExV8/utils
[vcf_to_plink_1]
# VCF to plink rough conversion
missing_geno_filter = 0.05
depends: executable("plink")
input: genotype
output: "${input!nn}.bed"
task: workdir = cwd
run:
  plink --vcf ${input} --a1-allele <(zcat ${input}) 4 3 '#' --out ${output!n} --make-bed \
        --geno ${missing_geno_filter}

[vcf_to_plink_2]
# Fix multi-allelic sites
# By replacing ID
output: "${input!n}.bim"
task: workdir = cwd
python:
    from collections import Counter
    counter = Counter()
    lines = [x.strip().split() for x in open(${output!r}).readlines()]
    for idx, item in enumerate(lines):
        if not lines[idx][1].endswith('b38'):
            counter.update({lines[idx][1]: 1})
            lines[idx][1] = (lines[idx][1] + '_' + str(counter[lines[idx][1]])) if counter[lines[idx][1]] else lines[idx][1]
    with open(${output!r}, 'w') as f:
        f.write('\n'.join(['\t'.join(x) for x in lines]))

Create cisSNP genotype data files¶

This is largely copied from 20170530_CisSNP.ipynb, with fixes on string coding -- as it now assumes the first allele in bim file in fact is the reference allele. Previously it was not the case. In the current version the PLINK/BED file I generate uses reference allele in VCF as allele 1.

Annotate genes¶

Code chunk below annotates genes to chromosomal positions. Genes are taken from the GTEx expression file.

%sosrun ensembl_annotation
[ensembl_annotation]
# Input is GTEx expression file with first column the ENSG IDs
depends: R_library("biomaRt")
#hg_version = "grch37."
hg_version = '' # use version b38
input: "${cwd!a}/rna_seq/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz"
output: "${cwd!a}/${input!bn}.annotation"
task: workdir = cwd
bash:
    zcat ${input} | cut -f 1 | tail -n+4 | cut -f 1 -d '.' > ${input!n}.gene_id

R:
    values = unlist(read.table("${input!n}.gene_id", stringsAsFactors=FALSE)) 
    ensembl = biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl", host="www.${hg_version}ensembl.org")
    res = biomaRt::getBM(attributes = c("chromosome_name", "start_position", "end_position", "ensembl_gene_id", "hgnc_symbol"), filters = c("ensembl_gene_id"), values = values,  mart = ensembl)
    write.table(res, file = ${output!r}, row.names=FALSE, na= "<NA>", col.names = FALSE)

bash:
    cat ${output} | sed -e 's/"//g' | sort -g -o ${output}
    rm -f ${input!n}.gene_id

Make cis-SNP genotype tables¶

Using position annotations previously obtained, we work though the genotype data in PLINK format, , creating for each gene a table containing its cis-SNPs. Variants are first annotated to genes, then cis-SNPs to extract are defined as SNPs 1,000,000 bp up/downstream of the gene's TSS site. The outcome is a single analysis ready file in HDF5 format containing ~50K groups of genotype data (gene-names). Code chunk below implements the procedure.

%sosrun cis_groups_hdf5
[cis_groups_hdf5]
# Convert plink genotype to HDF5 in batches of genes
# Only autosomal genes are considered
# cis: ${cis} up/downstream from TSS of each gene
depends: Py_Module("pandas-plink"), Py_Module("dask"), 
    "${cwd!a}/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.annotation"
parameter: cis = 1000000
parameter: resume = 1
input: "${genotype!nn}.bed"
output: "${cwd!a}/GTExV8.genotype.cis.h5"
#task: workdir = cwd
python:
    import os
    import pandas as pd, numpy as np
    from pandas_plink import read_plink
    keys = []
    if os.path.isfile(${output!r}):
       if ${resume} == 1:
          print("loading existing database ${output!b}")
          keys = pd.HDFStore(${output!r}).keys()
          print("{} existing genes will be skipped".format(len(keys)))
       else:
          os.remove(${output!r})

    (bim, fam, bed) = read_plink(${input!nr})
    ann = [x.strip().split() for x in open("${cwd!a}/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.annotation").readlines()]
    n = len(ann)
    empty_genes = []
    for i, line in enumerate(ann):
        if(i % 100 == 0):
            print('[percent completed] %.2f' % ((float(i)/n)*100))
        if line[0] not in list(map(str, [x+1 for x in range(22)])):
            continue
        if '/chr{}/{}'.format(line[0], line[3]) in keys:
            continue
        snps = bim.query("chrom == '{}' and (pos >= {} and pos <= {})".\
                                format(line[0], int(line[1]) - ${cis}, int(line[1]) + ${cis}))
        if (snps.empty):
            empty_genes.append(line[3])
            continue
        X = pd.DataFrame(bed[snps.i.tolist(),:].compute(), 
                         index = snps.snp.tolist(), 
                         columns = fam.iid.tolist(), dtype = np.float16).T
        X = X.fillna(X.mean()).T
        X.to_hdf(${output!r}, 'chr{}/{}'.format(line[0], line[3]), mode = 'a', 
                 complevel = 9, complib = 'zlib')
    if len(empty_genes):
        with open("${output!n}.log", 'w') as f:
            f.write('# Empty genes\n' + '\n'.join(empty_genes))