Multivariate Bayesian variable selection regression

GTEx V7 eQTL data analysis procedure

See this page for analysis outline of other releases.

Preprocessing

What we have

  • Genotype data, filted and phased, yet not imputed
  • RNA-Seq data of gene expression for all sample tissues, available both the counts and RPKM
  • Sample phenotype / covariates

What we need to do

Genotype data

  • We need to impute missing genotypes. We use UMichigen Imputation Panel minimac3 + HRC reference data
  • Do some PLINK QC, including
    • Keep variants with at least 10 samples having the minor allele (more stringent that 1% MAF filter)
  • Genotype samples have to match RNA-Seq samples by ID
    • Do it to RNA-Seq data when creating HDF5 matrices

RNA-Seq data

  • Genes should have at least 10 samples with RPKM > 0.1 and read counts > 6
  • Use quantile normalization

Covariates

  • 3 PCs, gender, genotyping platform, and PEER factors
  • PEER factors are generated using the top 10000 expressed genes per tissue, after normalization.
    • In V6P guideline it seems to vaguely imply using all genes from all tissues. But it seems PEER cannot handle data of that size (too slow to process).
    • In V6 guideline it suggests using top 10K expressed genes per tissue. Since we will remove these covariates via conventional multiple regression anyways it is Ok to correct for it this way.
  • Number of PEER depends on sample size N.
    • $N < 150$, use 15 PEERs, $150 \le N < 250$, use 30 PEERs, $N \ge 250$ use 35 PEERs
  • Regress out these factors separately for each tissue, save the residual to HDF5 as the Y for later analysis

FastQTL analysis

Adapted from Broad FastQTL wrapper tool


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago