Merge covariates info from multiple sources and find orthonormal basis for covariate matrix.
Covariates for analysis we've got so far include sample phenotypes (sex), sample attributes (genotyping platform), top principle components for population structure, and PEER factors. All saved in various files.
Workflow below consolidates these files and generates a single analysis ready covariate file in HDF5 format. There are two versions I save: the orginal covar matrix and its orthonormal basis (which will be used as input to association analysis down the line).
%run format_covariates
parameter: cwd = '~/Documents/GTEx'
sample_attr = '~/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_SampleAttributesDS.txt'
phenotype = '~/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_SubjectPhenotypesDS.txt'
platform_info = "${cwd!a}/h5_formatted/${sample_attr!bn}.platform_info"
[format_covariates]
# Consolidate covariates to HDF5 format
# Covariates are: sex, platform, 3 PC and PEER factors
depends: platform_info
parameter: peer_factors = glob.glob("${cwd!a}/peer_analysis/*_PEER_covariates.txt")
parameter: pc_file = "${cwd!a}/genotype_plink/GTEx7.Imputed.prune.pc.ped"
parameter: covar_file = "${phenotype!a}"
input: peer_factors, pc_file, covar_file, platform_info
output: "${cwd!a}/h5_formatted/GTEx7.covariates.raw.h5", "${cwd!a}/h5_formatted/GTEx7.covariates.orth.h5"
task: workdir = cwd
python:
import os
import pandas as pd
import scipy.linalg
if os.path.isfile(${output[0]!ar}):
os.remove(${output[0]!ar})
if os.path.isfile(${output[1]!ar}):
os.remove(${output[1]!ar})
pc = pd.read_csv(${pc_file!r}, header = None, sep = ' ', index_col = 1,
names = ['fid','pid','mid','sex','phen'] + ["PC{}".format(i+1) for i in range(20)])[["PC1", "PC2", "PC3"]]
platform = pd.read_csv(${platform_info!r}, header = 0, sep = ',', index_col = 0)
covar = pd.read_csv(${covar_file!r}, header = 0, sep = '\t', index_col = 0)['SEX'].to_frame()
dat = covar.merge(platform, left_index = True, right_index = True)
dat = dat.merge(pc, left_index = True, right_index = True)
# Add PEER
for item in [${peer_factors!ar,}]:
peer = pd.read_csv(item, header = 0, sep = '\t', index_col = 0).transpose()
samples = {}
for x in peer.index:
samples[x] = dat.loc['-'.join(x.split('-')[:2])].tolist()
samples = pd.DataFrame(samples).transpose()
samples.columns = dat.columns
samples = samples.merge(peer, left_index = True, right_index = True)
samples.to_hdf(${output[0]!ar}, '/{}'.format(os.path.basename(item[:-20])), mode = 'a', complevel = 9, complib = 'zlib')
samples_orth = pd.DataFrame(scipy.linalg.orth(samples), index = samples.index)
samples_orth.to_hdf(${output[1]!ar}, '/{}'.format(os.path.basename(item[:-20])), mode = 'a', complevel = 9, complib = 'zlib')
[recode_platform: provides = platform_info]
# Covariate "platform" needs to be recoded to numeric
input: sample_attr
output: platform_info
task: workdir = cwd
python:
import pandas as pd
samples = pd.read_csv(${input!r}, dtype=str, delimiter='\t', header=0)
res = [('SUBJID', 'GENO_PLATFORM')]
platform = []
for row in samples[['SAMPID', 'SMGEBTCHT', 'SMAFRZE']].values:
if row[2] == 'WGS':
row[0] = '-'.join(row[0].split('-')[:2])
if not row[1] in platform:
platform.append(row[1])
res.append((row[0], str(platform.index(row[1]))))
with open(${output!r}, "w") as f:
f.write('\n'.join([','.join(x) for x in res]))
In my initial analysis I've got error message from varbvs
package:
> varbvs::varbvsmix(X, Z, y, sa = c(0,mixsd^2))
Error in solve.default(crossprod(Z), c(y %*% Z)) :
system is computationally singular: reciprocal condition number = 3.53577e-17
The reason crossprod(Z)
is near singular is because some colums of Z
are highly correlated (cor(Z)
near singular). Since PEER factors are not orthogonal it may create such problem. Since we use an improper prior on Z
and we do not need to interpret the Z
part of the model, we will use such orthogonal basis that spans the same space as Z
with reduced rank to avoid the numerical problem. That way we do not have to improve prior on Z
.
%sessioninfo