Multivariate Bayesian variable selection regression

Association mapping covariates

Merge covariates info from multiple sources and find orthonormal basis for covariate matrix.

Covariates for analysis we've got so far include sample phenotypes (sex), sample attributes (genotyping platform), top principle components for population structure, and PEER factors. All saved in various files.

Workflow below consolidates these files and generates a single analysis ready covariate file in HDF5 format. There are two versions I save: the orginal covar matrix and its orthonormal basis (which will be used as input to association analysis down the line).

In [1]:
%run format_covariates

parameter: cwd = '~/Documents/GTEx'
sample_attr = '~/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_SampleAttributesDS.txt'
phenotype = '~/Documents/GTEx/gtex7/sample_annotations/GTEx_Analysis_2016-01-15_v7_SubjectPhenotypesDS.txt'
platform_info = "${cwd!a}/h5_formatted/${sample_attr!bn}.platform_info"

[format_covariates]
# Consolidate covariates to HDF5 format
# Covariates are: sex, platform, 3 PC and PEER factors
depends: platform_info
parameter: peer_factors = glob.glob("${cwd!a}/peer_analysis/*_PEER_covariates.txt")
parameter: pc_file = "${cwd!a}/genotype_plink/GTEx7.Imputed.prune.pc.ped"
parameter: covar_file = "${phenotype!a}"
input: peer_factors, pc_file, covar_file, platform_info
output: "${cwd!a}/h5_formatted/GTEx7.covariates.raw.h5", "${cwd!a}/h5_formatted/GTEx7.covariates.orth.h5"
task: workdir = cwd
python:
    import os
    import pandas as pd
    import scipy.linalg
    if os.path.isfile(${output[0]!ar}):
       os.remove(${output[0]!ar})
    if os.path.isfile(${output[1]!ar}):
       os.remove(${output[1]!ar})
    pc = pd.read_csv(${pc_file!r}, header = None, sep = ' ', index_col = 1,
         names = ['fid','pid','mid','sex','phen'] + ["PC{}".format(i+1) for i in range(20)])[["PC1", "PC2", "PC3"]]
    platform = pd.read_csv(${platform_info!r}, header = 0, sep = ',', index_col = 0)
    covar = pd.read_csv(${covar_file!r}, header = 0, sep = '\t', index_col = 0)['SEX'].to_frame()
    dat = covar.merge(platform, left_index = True, right_index = True)
    dat = dat.merge(pc, left_index = True, right_index = True)
    # Add PEER
    for item in [${peer_factors!ar,}]:
        peer = pd.read_csv(item, header = 0, sep = '\t', index_col = 0).transpose()
        samples = {}
        for x in peer.index:
            samples[x] = dat.loc['-'.join(x.split('-')[:2])].tolist()
        samples = pd.DataFrame(samples).transpose()
        samples.columns = dat.columns
        samples = samples.merge(peer, left_index = True, right_index = True)
        samples.to_hdf(${output[0]!ar}, '/{}'.format(os.path.basename(item[:-20])), mode = 'a', complevel = 9, complib = 'zlib')
        samples_orth = pd.DataFrame(scipy.linalg.orth(samples), index = samples.index)
        samples_orth.to_hdf(${output[1]!ar}, '/{}'.format(os.path.basename(item[:-20])), mode = 'a', complevel = 9, complib = 'zlib')

[recode_platform: provides = platform_info]
# Covariate "platform" needs to be recoded to numeric
input: sample_attr
output: platform_info
task: workdir = cwd
python:
    import pandas as pd
    samples = pd.read_csv(${input!r}, dtype=str, delimiter='\t', header=0)
    res = [('SUBJID', 'GENO_PLATFORM')]
    platform = []
    for row in samples[['SAMPID', 'SMGEBTCHT', 'SMAFRZE']].values:
        if row[2] == 'WGS':
           row[0] = '-'.join(row[0].split('-')[:2])
           if not row[1] in platform:
              platform.append(row[1])
           res.append((row[0], str(platform.index(row[1]))))
    with open(${output!r}, "w") as f:
        f.write('\n'.join([','.join(x) for x in res]))
1 task completed: 9b25

About orthonormal basis

In my initial analysis I've got error message from varbvs package:

> varbvs::varbvsmix(X, Z, y, sa = c(0,mixsd^2))
Error in solve.default(crossprod(Z), c(y %*% Z)) : 
  system is computationally singular: reciprocal condition number = 3.53577e-17

The reason crossprod(Z) is near singular is because some colums of Z are highly correlated (cor(Z) near singular). Since PEER factors are not orthogonal it may create such problem. Since we use an improper prior on Z and we do not need to interpret the Z part of the model, we will use such orthogonal basis that spans the same space as Z with reduced rank to avoid the numerical problem. That way we do not have to improve prior on Z.

In [2]:
%sessioninfo

SoS

SoS Version
0.9.8.10

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago