What we have¶
- Genotype data, filted and phased, yet not imputed
- RNA-Seq data of gene expression for all sample tissues, available both the counts and RPKM
- Sample phenotype / covariates
What we need to do¶
Genotype data¶
- We need to impute missing genotypes. We use UMichigen Imputation Panel
minimac3
+ HRC reference data
- Do some PLINK QC, including
- Keep variants with at least 10 samples having the minor allele (more stringent that 1% MAF filter)
- Genotype samples have to match RNA-Seq samples by ID
- Do it to RNA-Seq data when creating HDF5 matrices
RNA-Seq data¶
- Genes should have at least 10 samples with RPKM > 0.1 and read counts > 6
- Use quantile normalization
Covariates¶
- 3 PCs, gender, genotyping platform, and PEER factors
- PEER factors are generated using the top 10000 expressed genes per tissue, after normalization.
- In V6P guideline it seems to vaguely imply using all genes from all tissues. But it seems PEER cannot handle data of that size (too slow to process).
- In V6 guideline it suggests using top 10K expressed genes per tissue. Since we will remove these covariates via conventional multiple regression anyways it is Ok to correct for it this way.
- Number of PEER depends on sample size N.
- $N < 150$, use 15 PEERs, $150 \le N < 250$, use 30 PEERs, $N \ge 250$ use 35 PEERs
- Regress out these factors separately for each tissue, save the residual to HDF5 as the Y for later analysis
FastQTL analysis¶
Adapted from Broad FastQTL wrapper tool