RNA-Seq data preprocessing¶

This workflow includes data normalization and PEER factor analysis.

Update: on Aug 14 and Aug 20 2017 updated eQTL pipeline for V7 & 8 are released. Major changes involve normalization procedure (relevant here) and choice of number of PEER factors.

[global]
rna_rpkm = "~/Documents/GTEx/gtex7/rna-seq/GTEx_Data_2016-01-15_v7_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz"
rna_cnts = "~/Documents/GTEx/gtex7/rna-seq/GTEx_Data_2016-01-15_v7_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz"
genotype = "~/Documents/GTEx/gtex7/variant_calls/GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.PIR.vcf.gz"
sample_attr = "~/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_SampleAttributesDS.txt"
cwd = "~/Documents/GTEx"

Data file format conversion¶

Code chunk below is prototyping codes to convert RNA-seq file to HDF5 format. It is not needed when the normalization workflow is executed (next section) because normalization workflow performs format conversion on original data. But it is useful to have here, in case the original data will have to be processed separately.

[RNASeq_to_HDF5]
# Convert RNASeq data to HDF5
parameter: dtype = 'np.uint32'
output: "${input[0]!nb}.hdf5"
task: workdir = cwd
python:
    import pandas as pd
    import numpy as np
    import re, os
    def load_data(fdata, fsample, dtype = np.float32):
        '''First col of expression data is ENCODE gene name, 2nd col is HUGO name'''
        head = pd.read_csv(fdata, skiprows = 2, sep = '\t', nrows = 1)
        dt = {'Description': str, 'Name': str}
        dt.update({x: dtype for x in head.columns if x not in dt})
        data = pd.read_csv(fdata, compression='gzip', skiprows=2, 
                           index_col=0, header=0, dtype = dt, sep='\t').drop('Description', 1)
        samples = pd.read_csv(fsample, dtype=str, delimiter='\t', header=0)
        sample_dict = {}
        for row in samples[['SAMPID', 'SMTSD', 'SMAFRZE']].values:
            if row[2] == 'EXCLUDE':
                continue
            if row[1] not in sample_dict:
                sample_dict[row[1]] = []
            if row[0] in data.columns:
                sample_dict[row[1]].append(row[0])
        return data, dict((re.sub("[\W\d]+", "_", k.strip()).strip('_'), v) for k, v in sample_dict.items() if len(v))
    #
    data, sample = load_data(${input[0]!r}, ${input[1]!r}, dtype = ${dtype})
    data = {k: data.loc[:, sample[k]] for k in sample}
    if os.path.isfile(${output!r}):
        os.remove(${output!r})
    for k in data:
        data[k].to_hdf(${output!r}, k, mode = 'a', complevel = 9, complib = 'zlib')

Data file format conversion & normalization¶

Quantile normalization of RNA-seq data

input are rpkm file (for normalization), count file (for QC) and vcf file (for removing samples that do not have genotypes)
Expression values are quantile normalized to the average empirical distribution observed across samples
For each gene, expression values are inverse quantile normalized to a standard normal distribution across samples
- genes are selected based on expression thresholds of >0.1 RPKM in >10 samples and >5 reads in >10 samples

It results in 4 analysis ready expression data files in HDF5 format of different versions / organizations of the same information: emperical quantile normalized and standard normal quantile normalized, saved as flat tables or grouped by tissues. Additionally 2 original Count and RPKM files are converted to HDF5 format grouped by tissues.

See code chunk below for an implementation.

[rnaseq_1]
# Normalization and format conversion
parameter: rpkm_cutoff = 0.1
parameter: read_cutoff = 5
parameter: sample_cutoff = 10
input: rna_rpkm, rna_cnts, genotype, sample_attr
output: "${cwd!a}/rna_processed/${input[0]!nnb}.qnorm.std.flat.h5",
        "${cwd!a}/rna_processed/${input[0]!nnb}.qnorm.flat.h5", 
        "${cwd!a}/rna_processed/${input[0]!nnb}.qnorm.std.h5", 
        "${cwd!a}/rna_processed/${input[0]!nnb}.qnorm.h5",
        "${cwd!a}/rna_processed/${input[0]!nnb}.h5".replace('rpkm', 'reads'),
        "${cwd!a}/rna_processed/${input[0]!nnb}.h5"
task: workdir = cwd, queue = "mstephens", walltime = "20:00:00", cores = 1, mem = "90G"
python:

    # Adopted by Gao Wang from:
    # https://github.com/broadinstitute/gtex-pipeline
    # Originally authored by Francois Aguet

    import numpy as np
    import pandas as pd
    import gzip
    import subprocess
    import scipy.stats as stats
    import re, os

    def annotate_tissue_data(data, fsample):
        '''Save data to tissue specific tables'''
        samples = pd.read_csv(fsample, dtype=str, delimiter='\t', header=0)
        sample_dict = {}
        for row in samples[['SAMPID', 'SMTSD', 'SMAFRZE']].values:
            if row[2] == 'EXCLUDE':
                continue
            if row[1] not in sample_dict:
                sample_dict[row[1]] = []
            if row[0] in data.columns:
                sample_dict[row[1]].append(row[0])
        sample = dict((re.sub("[\W\d]+", "_", k.strip()).strip('_'), v) for k, v in sample_dict.items() if len(v))
        data = {k: data.loc[:, sample[k]] for k in sample}
        return data

    def write_per_tissue_data(data, output):
        if os.path.isfile(output):
            os.remove(output)
        for k in data:
            data[k].to_hdf(output, k, mode = 'a', complevel = 9, complib = 'zlib')

    def get_donors_from_vcf(vcfpath):
        """
        Extract donor IDs from VCF
        """
        with gzip.open(vcfpath) as vcf:
            for line in vcf:
                if line.decode()[:2]=='##': continue
                break
        return line.decode().strip().split('\t')[9:]

    def read_gct(gct_file, donor_ids, dtype):
        """
        Load GCT as DataFrame
        First col of expression data is ENCODE gene name, 2nd col is HUGO name
        ======================================================================
        A more memory friendly version:

        head = pd.read_csv(fdata, skiprows = 2, sep = '\t', nrows = 1)
        dt = {'Description': str, 'Name': str}
        dt.update({x: dtype for x in head.columns if x not in dt})
        data = pd.read_csv(fdata, compression='gzip', skiprows=2,
                           index_col=0, header=0, dtype = dt, sep='\t').drop('Description', 1)

        """
        df = pd.read_csv(gct_file, sep='\t', skiprows=2, index_col=0)
        df.drop('Description', axis=1, inplace=True)
        df.index.name = 'gene_id'
        return df[[i for i in df.columns if '-'.join(i.split('-')[:2]) in donor_ids]].astype(dtype, copy = True)

    def normalize_quantiles(M, inplace=False):
        """
        Note: replicates behavior of R function normalize.quantiles from library("preprocessCore")

        Reference:
         [1] Bolstad et al., Bioinformatics 19(2), pp. 185-193, 2003

        Adapted from https://github.com/andrewdyates/quantile_normalize
        """
        if not inplace:
            M = M.copy()

        Q = M.argsort(axis=0)
        m,n = M.shape

        # compute quantile vector
        quantiles = np.zeros(m)
        for i in range(n):
            quantiles += M[Q[:,i],i]
        quantiles = quantiles / n

        for i in range(n):
            # Get equivalence classes; unique values == 0
            dupes = np.zeros(m, dtype=np.int)
            for j in range(m-1):
                if M[Q[j,i],i]==M[Q[j+1,i],i]:
                    dupes[j+1] = dupes[j]+1

            # Replace column with quantile ranks
            M[Q[:,i],i] = quantiles

            # Average together equivalence classes
            j = m-1
            while j >= 0:
                if dupes[j] == 0:
                    j -= 1
                else:
                    idxs = Q[j-dupes[j]:j+1,i]
                    M[idxs,i] = np.median(M[idxs,i])
                    j -= 1 + dupes[j]
            assert j == -1

        if not inplace:
            return M

    def inverse_quantile_normalization(M):
        """
        After quantile normalization of samples, standardize expression of each gene
        """
        R = stats.mstats.rankdata(M,axis=1)  # ties are averaged
        Q = stats.norm.ppf(R/(M.shape[1]+1))
        return Q

    def normalize_expression(expression_df, counts_df, expression_threshold=0.1, count_threshold=5, min_samples=10, dtype = np.float32):
        """
        Genes are thresholded based on the following expression rules:
          >=min_samples with >expression_threshold expression values
          >=min_samples with >count_threshold read counts
        """
        # donor_ids = ['-'.join(i.split('-')[:2]) for i in expression_df.columns]
        donor_ids = expression_df.columns

        # expression thresholds
        mask = ((np.sum(expression_df>expression_threshold,axis=1)>=min_samples) & (np.sum(counts_df>count_threshold,axis=1)>=min_samples)).values

        # apply normalization
        M = normalize_quantiles(expression_df.loc[mask].values, inplace=False)
        R = inverse_quantile_normalization(M)

        quant_std_df = pd.DataFrame(data=R, columns=donor_ids, index=expression_df.loc[mask].index, dtype = dtype)
        quant_df = pd.DataFrame(data=M, columns=donor_ids, index=expression_df.loc[mask].index, dtype = dtype)
        return quant_std_df, quant_df

    class Environment:
        def __init__(self):
            self.expression_gct = ${input[0]!ar}
            self.counts_gct = ${input[1]!ar}
            self.vcf = ${input[2]!ar}
            self.attributes = ${input[3]!ar}
            self.prefix = ${input[0]!nnbr}
            self.output_dir = ${output[0]!dr}
            self.expression_threshold = ${rpkm_cutoff} 
            self.count_threshold = ${read_cutoff} 
            self.min_samples = ${sample_cutoff}

    args = Environment()
    print('Generating normalized expression files ... ', end='', flush=True)
    donor_ids = get_donors_from_vcf(args.vcf)
    expression_df = read_gct(args.expression_gct, donor_ids, np.float32)
    counts_df = read_gct(args.counts_gct, donor_ids, np.uint32)
    quant_std_df, quant_df = normalize_expression(expression_df, counts_df,
                                                  expression_threshold=args.expression_threshold, 
                                                  count_threshold=args.count_threshold, 
                                                  min_samples=args.min_samples)
    print('Save to HDF5 format, full matrix and per tissue data ...', end='', flush=True)
    prefix = os.path.join(args.output_dir, args.prefix)
    quant_std_per_tissue = annotate_tissue_data(quant_std_df, args.attributes)
    quant_per_tissue = annotate_tissue_data(quant_df, args.attributes)
    expression_per_tissue = annotate_tissue_data(expression_df, args.attributes)
    counts_per_tissue = annotate_tissue_data(counts_df, args.attributes)
    quant_std_df.to_hdf(prefix + ".qnorm.std.flat.h5",  'GTExV7', mode = 'w', complevel = 9, complib = 'zlib')
    quant_df.to_hdf(prefix + ".qnorm.flat.h5", 'GTExV7', mode = 'w', complevel = 9, complib = 'zlib')
    write_per_tissue_data(quant_per_tissue, prefix + ".qnorm.h5")
    write_per_tissue_data(quant_std_per_tissue, prefix + ".qnorm.std.h5")
    write_per_tissue_data(expression_per_tissue, prefix + ".h5")
    write_per_tissue_data(counts_per_tissue, prefix.replace('rpkm', 'reads') + ".h5")
    print('done.')

A glance at normalized expression data¶

Code chunk below shows how to load data in HDF5 to R, and how the information is organized. For example here are all data groups in the HDF5 file of standardized QN transformed expression data:

# source("http://bioconductor.org/biocLite.R")
# biocLite("rhdf5")
library(rhdf5)
fdata = '~/Documents/GTEx/rna-processed/GTEx_v7_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.qnorm.std.h5'
meta = h5ls(fdata)
groups = unique(meta$group)
groups = groups[which(groups != '/')]
groups

and here is preview of table /Lung:

mydata <- h5read(fdata, "/Lung")
str(mydata)

List of 4
 $ axis0        : chr [1:414(1d)] "GTEX-111CU-0326-SM-5GZXO" "GTEX-111FC-1126-SM-5GZWU" "GTEX-111VG-0726-SM-5GIDC" "GTEX-111YS-0626-SM-5GZXV" ...
 $ axis1        : chr [1:43624(1d)] "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
 $ block0_items : chr [1:414(1d)] "GTEX-111CU-0326-SM-5GZXO" "GTEX-111FC-1126-SM-5GZWU" "GTEX-111VG-0726-SM-5GIDC" "GTEX-111YS-0626-SM-5GZXV" ...
 $ block0_values: num [1:414, 1:43624] 0.181 0.012 -0.665 -0.739 0.377 ...

So /Lung has attribute axis0 for sample names, axis1 for gene names, and block0_values the 414 * 43624 data matrix. One can make t(block0_values) a separate matrix and set its rownames to axis1 and colnames to axis0.

PEER factor analysis¶

Code chunk below installs PEER R packages.

[peer]
output: os.path.join(cwd, "R_peer_source_1.3.tgz")
task: workdir = cwd
download:
    https://github.com/downloads/PMBio/peer/R_peer_source_1.3.tgz
run:
    R CMD INSTALL R_peer_source_1.3.tgz
    rm -f R_peer_source_1.3.tgz

PEER factor analysis package has a number of configuable parameters. For this analysis I use default values hard-coded into the script (see code chunk below, step rnaseq_2). Therefore these settings cannot be configured from input parameter though it is straightforward to implement it otherwise.

For every tissue I compute PEER factor using the top 10,000 expressed genes. It takes 1hr to 3hrs to complete each tissue.

[rnaseq_2]
# PEER analysis
depends: R_library('rhdf5'), R_library('peer')
parameter: tissues = get_output("h5ls ${input[2]} | awk '{print $1}'").strip().split('\n')
input: for_each = ['tissues']
output: "${cwd!a}/peer_analysis/${_tissues}_PEER_covariates.txt", 
        "${cwd!a}/peer_analysis/${_tissues}_PEER_alpha.txt", 
        "${cwd!a}/peer_analysis/${_tissues}_PEER_residuals.txt"
task: workdir = cwd, walltime = "30:00:00", cores = 1, mem = "8G", trunk_size=1, trunk_workers=1
R:
    alphaprior_a=0.001
    alphaprior_b=0.01
    epsprior_a=0.1
    epsprior_b=10
    max_iter=1000
    use_genes = 10000
    expr.h5 = ${input[2]!r}

    library(peer, quietly=TRUE)  # https://github.com/PMBio/peer
    library(rhdf5, quietly=TRUE)

    WriteTable <- function(data, filename, index.name) {
        datafile <- file(filename, open = "wt")
        on.exit(close(datafile))
        header <- c(index.name, colnames(data))
        writeLines(paste0(header, collapse="\t"), con=datafile, sep="\n")
        write.table(data, datafile, sep="\t", col.names=FALSE, quote=FALSE)
    }

    loadTable <- function(filename, group, auto_transpose = FALSE) {
      obj <- h5read(filename, group)
      dat <- obj$block0_values
      rownames(dat) <- obj$axis0
      colnames(dat) <- obj$axis1
      if (ncol(dat) > nrow(dat) && auto_transpose) dat <- t(dat)
      return(dat)
    }

    getNumPeer <- function(ss) {
      if (ss<150) return (min(15, ceiling(ss / 5)))
      else if (ss >=150 && ss < 250) return(30)
      else return(35)
    }

    whichpart <- function(x, n) {
      nx <- length(x)
      p <- nx-n
      xp <- sort(x, partial=p)[p]
      which(x > xp)
    }

    getTopGenes <- function(gmat, num = 1000) {
        if (ncol(M) <= num) {
            return(gmat)
        } else {
            top.index <- whichpart(total.expr <- colSums(gmat), num)
            return(gmat[,top.index])
        }
    }

    cat("PEER: loading expression data ... ")
    # rows are number of samples. columns are number of genes
    M <- as.matrix(loadTable(expr.h5, "/${_tissues}"))
    M <- getTopGenes(M, use_genes)
    n = getNumPeer(nrow(M))
    cat("done.\n")

    # run PEER
    cat(paste0("PEER: estimating hidden confounders (", n, " for tissue ", ${_tissues!r} , ")\n"))
    model <- PEER()
    invisible(PEER_setNk(model, n))
    invisible(PEER_setPhenoMean(model, M))
    invisible(PEER_setPriorAlpha(model, alphaprior_a, alphaprior_b))
    invisible(PEER_setPriorEps(model,epsprior_a, epsprior_b))
    invisible(PEER_setNmax_iterations(model, max_iter))
    # if(!is.null(covs)) {
    #   invisible(PEER_setCovariates(model, covs))
    # }
    time <- system.time(PEER_update(model))

    X <- PEER_getX(model)  # samples x PEER factors
    A <- PEER_getAlpha(model)  # PEER factors x 1
    R <- t(PEER_getResiduals(model))  # genes x samples

    # add relevant row/column names
    c <- paste0("InferredCov",1:ncol(X))
    rownames(X) <- rownames(M)
    colnames(X) <- c
    rownames(A) <- c
    colnames(A) <- "Alpha"
    A <- as.data.frame(A)
    A$Relevance <- 1.0 / A$Alpha
    rownames(R) <- colnames(M)
    colnames(R) <- rownames(M)

    # write results
    cat("PEER: writing results ... ")
    WriteTable(t(X), ${_output[0]!r}, "ID")  # format(X, digits=6)
    WriteTable(A, ${_output[1]!r}, "ID")
    WriteTable(R, ${_output[2]!r}, "ID")
    cat("done.\n")

Execute workflows¶

See execution cells below.

%sosrun rnaseq

INFO: 9b995fe7c3b980d751892e558f116c10 received dict_keys(['/home/gaow/Documents/GTEx/rn...

%sessioninfo

0.9.8.10

ir

R

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: BunsenLabs GNU/Linux 8.7 (Hydrogen)

Matrix products: default
BLAS: /usr/lib64/microsoft-r/3.4/lib64/R/lib/libRblas.so
LAPACK: /usr/lib/libopenblasp-r0.2.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
 [1] compiler_3.4.0      R6_2.2.0            magrittr_1.5       
 [4] RevoUtils_10.0.4    IRdisplay_0.4.4     pbdZMQ_0.2-5       
 [7] tools_3.4.0         crayon_1.3.2        uuid_0.1-2         
[10] stringi_1.1.5       IRkernel_0.8.7.9000 jsonlite_1.4       
[13] stringr_1.2.0       digest_0.6.12       repr_0.12.0        
[16] evaluate_0.10