GTEx V7 eQTL data analysis procedure¶

See this page for analysis outline of other releases.

Preprocessing¶

Genotype data, filted and phased, yet not imputed
RNA-Seq data of gene expression for all sample tissues, available both the counts and RPKM
Sample phenotype / covariates

We need to impute missing genotypes. We use UMichigen Imputation Panel minimac3 + HRC reference data
Do some PLINK QC, including
- Keep variants with at least 10 samples having the minor allele (more stringent that 1% MAF filter)
Genotype samples have to match RNA-Seq samples by ID
- Do it to RNA-Seq data when creating HDF5 matrices

3 PCs, gender, genotyping platform, and PEER factors
PEER factors are generated using the top 10000 expressed genes per tissue, after normalization.
- In V6P guideline it seems to vaguely imply using all genes from all tissues. But it seems PEER cannot handle data of that size (too slow to process).
- In V6 guideline it suggests using top 10K expressed genes per tissue. Since we will remove these covariates via conventional multiple regression anyways it is Ok to correct for it this way.
Number of PEER depends on sample size N.
- $N < 150$, use 15 PEERs, $150 \le N < 250$, use 30 PEERs, $N \ge 250$ use 35 PEERs
Regress out these factors separately for each tissue, save the residual to HDF5 as the Y for later analysis