Extract cis-SNP¶

We annotate genotype data by gene positions and extract cis-SNP. Data are extracted from PLINK files and saved to HDF5 format.

[global]
rna_cnts = "~/Documents/GTEx/gtex7/rna-seq/GTEx_Data_2016-01-15_v7_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz"
cwd = "~/Documents/GTEx"

Annotate genes¶

Code chunk below annotates genes to chromosomal positions. Genes are taken from the GTEx expression file.

%sosrun ensembl_annotation
[ensembl_annotation]
# Input is GTEx expression file with first column the ENSG IDs
depends: R_library("biomaRt")
input: rna_cnts
output: "${cwd}/h5_formatted/${input!bn}.annotation"
task: workdir = cwd
run:
    zcat ${input} | cut -f 1 | tail -n+4 | cut -f 1 -d '.' > ${input!n}.gene_id

R:
    values = unlist(read.table("${input!n}.gene_id", stringsAsFactors=FALSE)) 
    ensembl = biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl", host="www.ensembl.org")
    res = biomaRt::getBM(attributes = c("chromosome_name", "start_position", "end_position", "ensembl_gene_id", "hgnc_symbol"), filters = c("ensembl_gene_id"), values = values,  mart = ensembl)
    write.table(res, file = ${output!r}, row.names=FALSE, na= "<NA>", col.names = FALSE)

run:
  cat ${output} | sed -e 's/"//g' | sort -g -o ${output}
  rm -f ${input!n}.gene_id

INFO: Workflow 1d1b23bf7101ebf3 has been completed.

Make cis-SNP tables¶

Using position annotations previously obtained, we work though the genotype data GTEx7.Imputed.genotyped.filtered, creating for each gene a table containing its cis-SNPs. Variants are first annotated to genes, then cis-SNPs to extract are defined as SNPs 1,000,000 bp up/downstream of the gene's TSS site. The outcome is a single analysis ready file in HDF5 format containing ~50K groups of genotype data (gene-names). Code chunk below implements the procedure.

%sosrun cis_groups_hdf5
[cis_groups_hdf5]
# Convert plink genotype to HDF5 in batches of genes
# Only autosomal genes are considered
# cis: ${cis} up/downstream from TSS of each gene
depends: Py_Module("pandas-plink"), Py_Module("dask"), "${cwd!a}/h5_formatted/${rna_cnts!bn}.annotation"
parameter: cis = 1000000
parameter: resume = 1
input: "${cwd}/genotype_plink/GTEx7.Imputed.genotyped.filtered.bed"
output: "${input!n}.cis.h5"
task: workdir = cwd
python:
    import os
    import pandas as pd, numpy as np
    from pandas_plink import read_plink

    keys = []
    if os.path.isfile(${output!r}):
       if ${resume} == 1:
          print("loading existing database ${output!b}")
          keys = pd.HDFStore(${output!r}).keys()
          print("{} existing genes will be skipped".format(len(keys)))
       else:
          os.remove(${output!r})

    (bim, fam, bed) = read_plink(${input!nr})
    ann = [x.strip().split() for x in open("${cwd!a}/h5_formatted/${rna_cnts!bn}.annotation").readlines()]
    n = len(ann)
    empty_genes = []
    for i, line in enumerate(ann):
        if(i % 100 == 0):
            print('[percent completed] %.2f' % ((float(i)/n)*100))
        if line[0] not in list(map(str, [x+1 for x in range(22)])):
            continue
        if '/chr{}/{}'.format(line[0], line[3]) in keys:
            continue
        snps = bim.query("chrom == '{}' and (pos >= {} and pos <= {})".\
                                format(line[0], int(line[1]) - ${cis}, int(line[1]) + ${cis}))
        if (snps.empty):
            empty_genes.append(line[3])
            continue
        X = pd.DataFrame(2 - bed[snps.i,:].compute(), 
                         index = [':'.join((x.split('_', 1)[0], y,z)) for x, y, z in zip(snps.snp, snps.a1, snps.a0)], 
                         columns = fam.iid, dtype = np.uint8)
        X.to_hdf(${output!r}, 'chr{}/{}'.format(line[0], line[3]), mode = 'a', complevel = 9, complib = 'zlib')
    with open("${output!n}.empty_genes", 'w') as f:
         f.write('\n'.join(empty_genes))

e5b3ea4e7149a45a2be10706e58f0a36

0 sec

%sessioninfo

0.9.8.10