This include input max Z score from univariate analysis and training data, for both before and after LD pruning.
This is in part responding the reviewer's request that we should use independent subset of SNPs to fit mash model and analyze, and compare with results ignoring the LD (as done in Urbut 2017).
The required data has previously been extracted from GTEx V6 stored in midway:
/project/mstephens/data/internal_supp/gtex-v6-sumstat-hdf5/MatrixEQTLSumStats.Portable.h5
It's a 58MB file containing only max
and null
summary statistics. For convenience I copy it to my local computer and process from there (see cell below).
[global]
cwd = '~/Documents/GTEx/mash_revision'
input_db = "${cwd!a}/MatrixEQTLSumStats.Portable.h5"
%sosrun get_input
[get_input: provides = input_db]
output: input_db
task:
run:
rsync -auzP mw:/project/mstephens/data/internal_supp/gtex-v6-sumstat-hdf5/MatrixEQTLSumStats.Portable.h5 ${input_db}
The cell below loads the data, compute z-score, get a training set (and its correlation estimate $\hat{V}$) and save to an RDS file for use with mash
analysis. Parameter exclude_list
specifies path to the file of rownames to exclude while extracting: by default it uses all SNPs but if a list is provided (eg list after LD pruning) it will only extract results for those SNPs.
%sosrun extract_zscore
[extract_zscore]
parameter: exclude_list = 'NULL'
parameter: num_train = 20000
depends: R_library("rhdf5")
input: input_db
output: "${input_db!n}.Z.rds" if exclude_list == 'NULL' else "${input_db!n}.${exclude_list!bn}.Z.rds"
task: workdir = cwd
R:
ConvertP2Z <- function(pval, beta) {
z <- abs(qnorm(pval / 2))
z[which(beta < 0)] <- -1 * z[which(beta < 0)]
return(z)
}
GetSS <- function(gene, db) {
dat <- rhdf5::h5read(db, gene)
dat$"z-score" <- ConvertP2Z(dat$"p-value", dat$"beta")
for (name in c("beta", "t-stat", "p-value", "z-score")) {
dat[[name]] <- t(dat[[name]])
colnames(dat[[name]]) <- dat$colnames
rownames(dat[[name]]) <- dat$rownames
}
dat$colnames <- dat$rownames <- NULL
return(dat)
}
load_data = function(table) {
# load data
mdat = GetSS('max', ${input!r})[[table]]
ndat = GetSS('null', ${input!r})[[table]]
# select rows to keep
num_train = ${num_train}
if (num_train >= nrow(ndat)) {
num_train = floor(nrow(ndat) / 2)
}
train = ndat[1:num_train,]
validate = ndat[(num_train+1):nrow(ndat),]
if (${exclude_list!r} != "NULL") {
pout = scan(${exclude_list!r}, what="character", sep=NULL)
names = apply(sapply(strsplit(rownames(mdat), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
mdat = mdat[!(names %in% pout),]
names = apply(sapply(strsplit(rownames(ndat), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
ndat = ndat[!(names %in% pout),]
names = apply(sapply(strsplit(rownames(train), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
train = train[!(names %in% pout),]
names = apply(sapply(strsplit(rownames(validate), "_"), `[`, c(2,3)), 2, function(x) paste(x, collapse = ':'))
validate = validate[!(names %in% pout),]
}
# get vhat (SVS)
vhat = NULL
if (table == 'z-score') {
max_absz = apply(abs(ndat),1, max)
nullish = which(max_absz < 2)
nz = ndat[nullish,]
vhat = cor(nz)
}
return(list(train = train,
validate = validate,
test = mdat, vhat = vhat))
}
ztable = load_data("z-score")
btable = load_data("beta")
# save output
saveRDS(list(train.z = ztable$train,
validate.z = ztable$validate,
test.z = ztable$test,
train.b = btable$train,
validate.b = btable$validate,
test.b = btable$test,
vhat = ztable$vhat), ${output!r})
R:
dat = readRDS("${input_db!n}.Z.rds")
str(dat)
The default procedure above extracts all SNPs to RDS. Here running previously established workflows we create lists of LD pruned SNPs. To prune SNPs based on what we have in the database and the actual genotypes, first we extract SNP names the database (done previously) and feed them to LD pruning routine.
For test SNPs
sos run analysis/20170828_Compute_LD.ipynb get_ld:1-2 -b ~/Documents/GTEx/bin \
--input_list $HOME/Documents/GTEx/mash_revision/snp_eqtl.txt
Pruning complete. 4204 of 13030 variants removed.
For train SNPs
sos run analysis/20170828_Compute_LD.ipynb get_ld:1-2 -b ~/Documents/GTEx/bin \
--input_list $HOME/Documents/GTEx/mash_revision/snp_random.txt
Pruning complete. 19405 of 41038 variants removed.
We end up with two lists: snp_eqtls.extracted.prune.out
and snp_random.extracted.prune.out
. We consolidate them to one list ld2.out
.
!cat /home/gaow/Documents/GTEx/mash_revision/snp_eqtls.extracted.prune.out \
/home/gaow/Documents/GTEx/mash_revision/snp_random.extracted.prune.out > \
/home/gaow/Documents/GTEx/mash_revision/ld2.out
To extract data excluding ld2.out
simply execute the workflow above with exclude_list
set to ld2.out
:
sos run analysis/20170829_MASH_Input_Preparation extract_zscore \
--exclude_list /home/gaow/Documents/GTEx/mash_revision/ld2.out
R:
dat = readRDS("${input_db!n}.ld2.Z.rds")
str(dat)
As a result of pruning the training set dropped from 20K to 12K, the test set dropped from 16K to 11.6K.
%sessioninfo