We annotate genotype data by gene positions and extract cis-SNP. Data are extracted from PLINK
files and saved to HDF5 format.
[global]
rna_cnts = "~/Documents/GTEx/gtex7/rna-seq/GTEx_Data_2016-01-15_v7_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz"
cwd = "~/Documents/GTEx"
Code chunk below annotates genes to chromosomal positions. Genes are taken from the GTEx expression file.
%sosrun ensembl_annotation
[ensembl_annotation]
# Input is GTEx expression file with first column the ENSG IDs
depends: R_library("biomaRt")
input: rna_cnts
output: "${cwd}/h5_formatted/${input!bn}.annotation"
task: workdir = cwd
run:
zcat ${input} | cut -f 1 | tail -n+4 | cut -f 1 -d '.' > ${input!n}.gene_id
R:
values = unlist(read.table("${input!n}.gene_id", stringsAsFactors=FALSE))
ensembl = biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl", host="www.ensembl.org")
res = biomaRt::getBM(attributes = c("chromosome_name", "start_position", "end_position", "ensembl_gene_id", "hgnc_symbol"), filters = c("ensembl_gene_id"), values = values, mart = ensembl)
write.table(res, file = ${output!r}, row.names=FALSE, na= "<NA>", col.names = FALSE)
run:
cat ${output} | sed -e 's/"//g' | sort -g -o ${output}
rm -f ${input!n}.gene_id
Using position annotations previously obtained, we work though the genotype data GTEx7.Imputed.genotyped.filtered
, creating for each gene a table containing its cis-SNPs. Variants are first annotated to genes, then cis-SNPs to extract are defined as SNPs 1,000,000 bp up/downstream of the gene's TSS site. The outcome is a single analysis ready file in HDF5 format containing ~50K groups of genotype data (gene-names). Code chunk below implements the procedure.
%sosrun cis_groups_hdf5
[cis_groups_hdf5]
# Convert plink genotype to HDF5 in batches of genes
# Only autosomal genes are considered
# cis: ${cis} up/downstream from TSS of each gene
depends: Py_Module("pandas-plink"), Py_Module("dask"), "${cwd!a}/h5_formatted/${rna_cnts!bn}.annotation"
parameter: cis = 1000000
parameter: resume = 1
input: "${cwd}/genotype_plink/GTEx7.Imputed.genotyped.filtered.bed"
output: "${input!n}.cis.h5"
task: workdir = cwd
python:
import os
import pandas as pd, numpy as np
from pandas_plink import read_plink
keys = []
if os.path.isfile(${output!r}):
if ${resume} == 1:
print("loading existing database ${output!b}")
keys = pd.HDFStore(${output!r}).keys()
print("{} existing genes will be skipped".format(len(keys)))
else:
os.remove(${output!r})
(bim, fam, bed) = read_plink(${input!nr})
ann = [x.strip().split() for x in open("${cwd!a}/h5_formatted/${rna_cnts!bn}.annotation").readlines()]
n = len(ann)
empty_genes = []
for i, line in enumerate(ann):
if(i % 100 == 0):
print('[percent completed] %.2f' % ((float(i)/n)*100))
if line[0] not in list(map(str, [x+1 for x in range(22)])):
continue
if '/chr{}/{}'.format(line[0], line[3]) in keys:
continue
snps = bim.query("chrom == '{}' and (pos >= {} and pos <= {})".\
format(line[0], int(line[1]) - ${cis}, int(line[1]) + ${cis}))
if (snps.empty):
empty_genes.append(line[3])
continue
X = pd.DataFrame(2 - bed[snps.i,:].compute(),
index = [':'.join((x.split('_', 1)[0], y,z)) for x, y, z in zip(snps.snp, snps.a1, snps.a0)],
columns = fam.iid, dtype = np.uint8)
X.to_hdf(${output!r}, 'chr{}/{}'.format(line[0], line[3]), mode = 'a', complevel = 9, complib = 'zlib')
with open("${output!n}.empty_genes", 'w') as f:
f.write('\n'.join(empty_genes))
%sessioninfo