Multivariate Bayesian variable selection regression

Molecular QTL workflow

In [1]:
%revisions -s -n 10

Data

Molecular QTL data from Yang et al (2016) Science. Input are genotypes of ~100 YRI samples with their molecular QTL data measured in LCL. Alternative splicing (AS) data is of the primary interest here although this workflow can be used to analyze all phenotypes.

Genotypes

Genotype data for YRI is the conventional VCF format but has dosage for genotypes.

Phenotypes

Phenotype data has fastqtl format:

#Chr    start   end ID  18486   18487   18488   18489   18498   18499
chr1    880180  880422  chr1:880180:880422:clu_15502    0.201694364955  0.665990212763  -1.21881815589  -0.342480185427 0.165404160483  -1.58524292941

The first 4 columns are genomic coordinates info. Others are molecular QTL in samples. The original data (eg the Alternative splicing) may not have this exact format but should be easy to format -- in fact Kevin has formatted all these data to fastqtl. I'll just take from there.

Analysis plan

  • For each analysis unit (gene, or intron cluster for AS), get the 100Kb to 1MB up/down-stream variants in genotypes
  • Remove top phenotype PC from phenotype data
  • Fine-mapping using SuSiE and DAP

Workflow overview

In [2]:
!sos run 20180704_MolecularQTL_Workflow.ipynb -h
usage: sos run 20180704_MolecularQTL_Workflow.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  preprocess
  index_vcf
  SuSiE
  SuSiE_summary
  DAP
  CAVIAR_follow_up
  lm_follow_up

Global Workflow Options:
  --x-data . (as path)
                        X data, the genotype VCF file path
  --y-data . (as path)
                        Y data, the phenotype file paths
  --trait 'unnamed_phenotype'
                        Trait name
  --cwd  f'{y_data:d}/{trait}_output'

                        Specify work / output directory
  --max-dist 100000 (as int)
                        Maximum distance to site of interest, set to eg. 100Kb
                        or 1Mb up/downstream to start site of analysis unit

Sections
  preprocess_1:         PCA on phenotype and remove top PCs
    Workflow Options:
      --num-pcs 3 (as int)
                        Num. PC to remove from phenotype
      --colname-pattern '^[0-9]+'
                        column name patter for `grep` in R to select phenotype
                        columns eg. "^NA[0-9]+" to extract sample names
  index_vcf:            this step provides VCF file index
  preprocess_2:         Extract cis-SNPs and make fine-mapping datasets
  preprocess_3:         Summary statistics and PVE estimates
  SuSiE:                Run SuSiE
    Workflow Options:
      --maxL 5 (as int)
                        SuSiE parameter: L
      --prior-var 0.0 (as float)
                        SuSiE parameter: prior variance; set to 0 to use per-
                        dataset PVE estimates
  SuSiE_summary:        Make SuSiE result plots for significant results
  DAP_1:                Convert to DAP format
  DAP_2:                Run DAP-g
  CAVIAR_follow_up_1:   Run CAVIAR on SuSiE results of interest
    Workflow Options:
      --caviar-args '-c 2'
  CAVIAR_follow_up_2:   Plot CAVIAR vs SuSiE results
  lm_follow_up_1:       Run linear regression conditional on SuSiE CS
    Workflow Options:
      --use-top-pip TRUE
                        Only use top pip from each cluster rather than using all
                        variables from the cluster
  lm_follow_up_2:       Summarize the outcome
    Workflow Options:
      --pval-cutoff '1E-4'
In [ ]:
[global]
# X data, the genotype VCF file path
parameter: x_data = path()
# Y data, the phenotype file paths
parameter: y_data = path()
# Trait name
parameter: trait = "unnamed_phenotype"
# Specify work / output directory
parameter: cwd = f'{y_data:d}/{trait}_output'
# Maximum distance to site of interest, set to eg. 100Kb or 1Mb up/downstream to start site of analysis unit
parameter: max_dist = 100000
fail_if(not x_data.is_file(), msg = 'Please provide valid ``--x-data``!')
fail_if(not y_data.is_file(), msg = 'Please provide valid ``--y-data``!')
pop = 'YRI'

and some bash variables

xdata=/project/compbio/jointLCLs/genotype/hg19/YRI/genotypesYRI.gen.txt.gz
ydata=/project/compbio/jointLCLs/phenotypes/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz
ncpu=16
trait=AS
cwd=/project/compbio/jointLCLs/results/SuSiE

Preprocessing analysis

The preprocessing pipeline can be executed locally, takes 3hrs on my 40-thread machine:

sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess \
    --x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
    --max-dist 100000 --num-pcs 3

Regressing out top PCs on phenotype

There may well be better approach to control for covariates etc, but here is workflow from the authors and was deemed sufficient. See their supplemental table of Yang et al 2016 for how many PC to use for each molecular QTLs.

Need to cope with missing phenotype data here. See na.omit function call in prcomp and na.actions=na.exclude.

In [ ]:
# PCA on phenotype and remove top PCs
[preprocess_1 (Remove top phenotype PC)]
# Num. PC to remove from phenotype
parameter: num_pcs = 3 # Table S2 of NIHMS835311-supplement-supplement.pdf
# column name patter for `grep` in R to select phenotype columns
# eg. "^NA[0-9]+" to extract sample names
parameter: colname_pattern = '^[0-9]+'
input: y_data
output: f"{cwd}/{_input:bn}.PC{num_pcs}.removed.gz"
R: expand = "${ }", workdir = cwd, stdout = f"{_output:n}.stdout"
    num_pcs = ${num_pcs}
    dat <- read.table(${_input:r}, header=T, comment.char='', check.names=F)
    phenotype.matrix <- dat[,5:ncol(dat)]
    # extract columns of interest
    phenotype.matrix <- phenotype.matrix[,grep("${colname_pattern}", colnames(phenotype.matrix), value = T)]
    # perform principal component analysis
    pca <- prcomp(na.omit(phenotype.matrix))
    # PCA summary
    print(summary(pca))
    cat("output", num_pcs, "PCs \n")
    # remove top PC from phenotype; takes a while
    cov_pcs <- pca$rotation[, 1:num_pcs]
    new.phenotype.matrix <- do.call(rbind, lapply(1:nrow(phenotype.matrix), function(i) residuals(lm(t(phenotype.matrix[i,]) ~ as.matrix(cov_pcs), na.action=na.exclude))))
    colnames(new.phenotype.matrix) <- colnames(phenotype.matrix)
    new.dat <- cbind(dat[,1:4], new.phenotype.matrix)
    colnames(new.dat)[1] <- 'chr'
    write.table(new.dat, gzfile(${_output:r}), sep="\t", quote=F, col.names=T, row.names=F)

Extract per unit variables

In [ ]:
# this step provides VCF file index
[index_vcf: provides = '{filename}.gz.tbi']
depends: executable('tabix')
input: f"{filename}.gz"
bash: expand=True
   tabix -p vcf {_input}

# Extract cis-SNPs and make fine-mapping datasets
[preprocess_2 (Get per-unit dataset): shared = {'unit_dataset': "step_output"}]
depends: Py_Module('pysam'), Py_Module('pandas'), Py_Module('feather'), f"{x_data}.tbi"
chroms = [f'chr{x+1}' for x in range(22)]
input: for_each = 'chroms', concurrent = True
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/{_chroms}/MANIFEST'
python: workdir = cwd, expand = "${ }"
    def read_header(gzfile):
        import gzip
        with gzip.open(gzfile,'r') as f:
            for line in f:
                res = [x.decode() for x in line.split()]
                break
        return res
    chrom = "${_chroms}"
    phenotype_id = [f'${pop}_{x}' for x in read_header(${_input:r})[4:]]
    vcf_id = [f'${pop}_{x}' for x in read_header(${x_data:r})[9:]]
    from pathlib import Path
    import pysam
    tbx = pysam.TabixFile(${x_data:r})    
    import pandas as pd, numpy as np
    from feather import write_dataframe
    qts = pd.read_csv(${_input:r}, sep = '\t')
    qts = qts.loc[qts['chr'] == chrom]
    #
    import re, time
    start_time = time.time()
    i = 0
    out_files = []
    for site in sorted(set(qts['start'].tolist())):
        if(i % 100 == 0):
            print('[chrom %s percent completed] %.1f (%.1f sec elapsed)' % (chrom, (float(i+1)/qts.shape[0])*100, time.time() - start_time))        
        unit = qts.loc[qts['start'] == site]
        i += unit.shape[0]
        basename = f'${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/{chrom}/{chrom}_{site}_{max(unit["end"].tolist())}'
        start = max(site - ${max_dist}, 0)
        end = site + ${max_dist}
        genotypes = np.array([row for row in tbx.fetch(chrom, start, end, parser=pysam.asTuple())])
        if len(genotypes) == 0:
            continue
        Y_data = unit.drop(["chr", "start", "end", "ID"], axis=1).T
        Y_data.columns = ["${trait}_" + re.sub('[^0-9a-zA-Z]+', '_', x).strip('_') for x in unit['ID']]
        Y_data.index = phenotype_id
        X_data = pd.DataFrame(genotypes[:,9:].T,
                              columns = ['_'.join(x) for x in genotypes[:,[2,0,1,3,4]]], 
                              index = vcf_id)
        merged = Y_data.join(X_data, how='inner').astype(np.float32)
        Path(f'${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/{chrom}').mkdir(exist_ok=True, parents=True)
        # FIXME: this will be a large file because feather format is not yet compressed ... 
        # This is being discussed by the feather development group and hopefully compression will be supported in later 2018
        write_dataframe(merged, basename + '.feather')
        out_files.append(basename + '.feather')
    with open(${_output:r}, 'w') as f:
        f.write('\n'.join(out_files))

Summary statistics and PVE estimate

We want to estimate PVE from the top signal in each unit, to have an idea of how SuSiE prior variance should be configured. We can also compute summary statistics at this stage. We save only z-scores.

In [ ]:
# Summary statistics and PVE estimates
[preprocess_3 (PVE)]
depends: R_library('feather'), R_library('susieR')
input: paths([get_output(f'cat {x}').strip().split('\n') for x in unit_dataset]), group_by = 1, concurrent = True
output: f'{_input:n}.rds'
R: expand = '${ }'
    library(feather)
    dat = read_feather(${_input:r})
    n_y = length(grep("^${trait}", colnames(dat), value = T))
    Y = as.matrix(dat[,1:n_y,drop=F])
    X = as.matrix(dat[,(n_y+1):ncol(dat),drop=F])
    storage.mode(X) = 'double'
    storage.mode(Y) = 'double'
    bad = which(sapply(1:ncol(X), function(i) all(is.na(X[,i]))))
    if (length(bad) >= 1) {
      snps = colnames(X)[-bad]
      X = X[,-bad,drop=F]
    } else {
      snps = colnames(X)
    }
    res = list()
    for (r in 1:ncol(Y)) {
      keep_rows = which(!is.na(Y[,r]))
      x = X[keep_rows,,drop=F]
      y = Y[,r][keep_rows]
      reg = susieR:::univariate_regression(x,y)
      z_score = reg$betahat/reg$sebetahat
      names(z_score) = snps
      top_idx = which.max(abs(z_score))
      pve = var(x[,top_idx] * reg$betahat[top_idx]) / var(y)
      res[[r]] = list(X=x, y=y, z_score=z_score, pve=pve, uname=colnames(Y)[[r]])
    }
    saveRDS(res, ${_output:r})
_input.zap()

PVE can be assessed by

> files = Sys.glob('chr*/*.rds')
> length(files)
[1] 44068
> pve = sapply(1:length(files), function(i) readRDS(files[i])[[1]]$pve)
> summary(pve)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.008189 0.065382 0.085535 0.096619 0.111102 0.956613

Finemapping with SuSiE

I make two sets of analysis, one with --prior_var set to using region level PVE based value, another to --prior_var 0.096 the averaged value computed as above.

This step takes 3hrs on my 40-thread machine:

sos run analysis/20180704_MolecularQTL_Workflow.ipynb SuSiE \
    --x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
    --max-dist 100000 --prior_var 0.096

and not very much recommanded but what I've tried anyways is to use region level PVE based value (it will override previous command so please empty previous output if you'd like to try this!):

sos run analysis/20180704_MolecularQTL_Workflow.ipynb SuSiE \
    --x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
    --max-dist 100000

we only plot the ones SuSiE reports at least 1 CS:

sos run analysis/20180704_MolecularQTL_Workflow.ipynb SuSiE_summary \
    --x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
    --max-dist 100000
sos run analysis/20180704_MolecularQTL_Workflow.ipynb signal_summary \
    --x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
    --max-dist 100000
In [ ]:
# Run SuSiE
[SuSiE (SuSiE analysis)]
depends: R_library('susieR')
# SuSiE parameter: L
parameter: maxL = 5
# SuSiE parameter: prior variance; set to 0 to use per-dataset PVE estimates
parameter: prior_var = 0.0
input: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/chr*/*.rds', group_by = 1, concurrent = True
# here I do not keep track of output because it is otherwise too computationally 
# intensive for dynamic output (SoS issue #1028)
#output: dynamic(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_*/*.rds')
#task: trunk_workers = 1, queue = 'midway2_head', walltime = '10m', trunk_size = 1000, mem = '2G', cores = 1, workdir = cwd, concurrent = True
R: expand = "${ }"
    library(susieR)
    set.seed(1)
    dat = readRDS(${_input:r})
    for (r in 1:length(dat)) {
      fitted = susie(dat[[r]]$X, dat[[r]]$y,
               L=${maxL},
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=${prior_var if prior_var > 0 else "dat[[r]]$pve"},
               tol=1e-3)
      sets = susie_get_CS(fitted,
                    coverage = 0.95,
                    X = dat[[r]]$X, 
                    min_abs_corr = 0.4)
      pip = susie_get_PIP(fitted, sets$cs_index)
      names(pip) = colnames(dat[[r]]$X)
      dirname = paste0('${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/SuSiE_CS_', length(sets$cs_index), '/')
      system(paste("mkdir -p", dirname))
      if (length(sets$cs_index) > 0) {
          saveRDS(list(input=${_input:r},idx=r,fitted=fitted,sets=sets,pip=pip), paste0(dirname, dat[[r]]$uname, '.rds'))
      } else {
          saveRDS(list(input=${_input:r},idx=r,fitted=fitted), paste0(dirname, dat[[r]]$uname, '.rds'))       
      }
    }
In [ ]:
# Make SuSiE result plots for significant results
[SuSiE_summary (plot SuSiE results)]
depends: R_library('susieR')
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:n}.png'
R: expand = '${ }', stdout = f'{_output:n}.log'
    library(susieR)
    dat = readRDS(${_input:r})
    z_score = readRDS(dat$input)[[dat$idx]]$z_score
    b = rep(0,length(z_score))
    b[which.max(abs(z_score))] = 1
    png(${_output:r}, 12, 6, units = 'in', res = 500)
    par(mfrow=c(1,2))
    susie_pplot(z_score, dtype='z', b=b, main = 'Per-feature regression p-values')
    susie_pplot(dat$pip, res=dat$fitted, CS=dat$sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE CS purity:', paste(round(dat$sets$purity[,1],2), collapse=';')))
    dev.off()
In [ ]:
# Count discoveries
[signal_summary]
pip_cutoff = 0.8
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds')
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_Summary.rds'
R: expand = '${ }'
    files = c(${_input:r,})
    results = lapply(1:length(files), function(i) readRDS(files[i])[c("sets","pip")])
    cs = lapply(1:length(results), function(i) results[[i]]$sets$cs)
    L = sapply(1:length(results), function(i) length(cs[[i]]))
    size = sapply(1:length(results), function(i) sapply(1:length(cs[[i]]), function(j) length(cs[[i]][[j]])))
    purity = lapply(1:length(results), function(i) results[[i]]$sets$purity[,1])
    high_pips = sapply(1:length(results), function(i) sum(results[[i]]$pip > ${pip_cutoff}))
    saveRDS(list(L=L, size=size, purity=purity,high_pips=high_pips,cs=cs),${_output:r})
In [1]:
%cd /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb
/project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb
In [4]:
dat = readRDS('SuSiE_CS_Summary.rds')
In [2]:
print(length(dat$L))
cnt = 0
for (i in 1:6) {
    tmp = sum(dat$L==i)
    print(tmp)
    cnt = cnt + tmp * i
}
print(cnt)
[1] 2496
[1] 2357
[1] 129
[1] 5
[1] 3
[1] 2
[1] 0
[1] 2652

Number of CS with different sizes

In [3]:
sum(unlist(dat$size) == 1)
457
In [4]:
sum(unlist(dat$size) == 2)
239
In [5]:
median(unlist(dat$size))
7
In [6]:
median(unlist(dat$purity))
0.942132875772012
In [7]:
mean(unlist(dat$purity))
0.874715633902827
In [8]:
sum(dat$high_pips)
619

CS overlapping status

There is no overlapping CS in our results.

In [6]:
for (s in 1:length(dat$cs)) {
    for (i in 1:length(dat$cs[[s]])) {
        for (j in 1:i) {
            if (i==j) next
            status = intersect(dat$cs[[s]][[i]], dat$cs[[s]][[j]])
            if (length(status)) print(status)
        }
    }
}
print(length(dat$cs))
[1] 2496

Finemapping with DAP

I made this for Kevin's analysis. Here is a list of analysis I've ran:

sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess+DAP \
    --x-data $xdata \
    --y-data ~/GIT/LargeFiles/JointLCL/fastqtl_m6A_merged_peaks_IPcounts_YangVCF.txt.gz \
    --max-dist 100000 --num-pcs 7 --trait m6A -j 38 \
    -b ~/GIT/github/mvarbvs/dsc/modules/linux/
sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess+DAP \
    --x-data $xdata \
    --y-data ~/GIT/LargeFiles/JointLCL/fastqtl_ribo_normExpr_YangVCF.txt.gz \
    --max-dist 100000 --num-pcs 2 --trait ribo -j 38 \
    -b ~/GIT/github/mvarbvs/dsc/modules/linux/
sos run analysis/20180704_MolecularQTL_Workflow.ipynb DAP \
    --x-data $xdata \
    --y-data ~/GIT/LargeFiles/JointLCL/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
    --max-dist 100000 --num-pcs 3 --trait AS -j 38 \
    -b ~/GIT/github/mvarbvs/dsc/modules/linux/
sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess+DAP \
    --x-data $xdata \
    --y-data ~/GIT/LargeFiles/JointLCL/fastqtl_apa_distProxRatio_YangVCF.txt.gz \
    --max-dist 100000 --num-pcs 7 --trait APA -j 38 \
    -b ~/GIT/github/mvarbvs/dsc/modules/linux/
In [ ]:
# Convert to DAP format
[DAP_1 (convert to DAP data format)]
input: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/chr*/*.rds', group_by = 1, concurrent = True
output: dynamic(f'{_input:dd}/DAP-g/*.{_input:bn}.gz')
R: expand = '${ }'
    dat = readRDS(${_input:r})
    dirname = paste0(${_input:ddr}, '/DAP-g/')
    system(paste('mkdir -p', dirname))
    for (r in 1:length(dat)) {
      pheno = c('pheno', '${y_data:bnn}', dat[[r]]$uname, dat[[r]]$y)
      geno = cbind(rep('geno', ncol(dat[[r]]$X)), colnames(dat[[r]]$X), rep(dat[[r]]$uname, ncol(dat[[r]]$X)), t(dat[[r]]$X))
      write.table(rbind(pheno, geno), gzfile(paste0(dirname, dat[[r]]$uname, '.${_input:bn}.gz')), quote=F,col.names=F,row.names=F)
  }
In [ ]:
# Run DAP-g
[DAP_2 (run DAP-g)]
depends: Py_Module('dsc'), Py_Module('rpy2'), executable('dap-g')
input: group_by = 1, concurrent = True
output: f'{_input:n}.output.rds'
bash: expand = True, stderr = f'{_output:n}.log', workdir = '/tmp'
    dap-g -d <(zcat {_input}) -o {_output:n} -t 1 --all
    exit 0

python: expand = '${ }'
    import pandas as pd
    def extract_dap_output(fn):
        out = [x.strip().split() for x in open(fn).readlines()]
        pips = []
        clusters = []
        still_pip = True
        for line in out:
            if len(line) == 0:
                continue
            if len(line) > 2 and line[2] == 'cluster_pip':
                still_pip = False
                continue
            if still_pip and (not line[0].startswith('((')):
                continue
            if still_pip:
                pips.append([line[1], float(line[2]), float(line[3]), int(line[4])])
            else:
                clusters.append([len(clusters) + 1, float(line[2]), float(line[3])])
        pips = pd.DataFrame(pips, columns = ['snp', 'snp_prob', 'snp_log10bf', 'cluster'])
        clusters = pd.DataFrame(clusters, columns = ['cluster', 'cluster_prob', 'cluster_avg_r2'])
        clusters = pd.merge(clusters, pips.groupby(['cluster'])['snp'].apply(','.join).reset_index(), on = 'cluster')
        return {'snp': pips, 'set': clusters}

    from dsc.dsc_io import save_rds
    save_rds(extract_dap_output(${_output:nr}), ${_output:r})
_input.zap()

Follow-up: verify with CAVIAR

I found ~1600 splicing QTLs. Only about 150 reports more than 1 CS. But most of these 2+ QTLs are from a scenarios that there seemingly is one signal cluster from the z-score but SuSiE decides to use two CS to capture that cluster.

Another observation from the SuSiE analysis is that in a number of cases the smallest p-value is not picked up.

To verify how many signals are in there when there is seemingly one cluster visually, we run CAVIAR for those data. Input for CAVIAR are z-scores and LD matrices.

sos run analysis/20180704_MolecularQTL_Workflow.ipynb CAVIAR_follow_up \
    --x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
    --max-dist 100000 \
    -b ~/GIT/github/mvarbvs/dsc/modules/linux/
In [ ]:
# Run CAVIAR on SuSiE results of interest
[CAVIAR_follow_up_1]
parameter: caviar_args = '-c 2'
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[2-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:dd}/CAVIAR_follow_up/{_input:bn}.CAVIAR.rds'
R: expand = '${ }', stdout = False, stderr = False
    library('dplyr')
    library('magrittr')
    #' CAVIAR I/O
    write_caviar_sumstats <- function(z, prefix) {
      cfg = list(z=paste0(prefix,".z"),
                 set=paste0(prefix,"_set"),
                 post=paste0(prefix,"_post"),
                 log=paste0(prefix,".log"))
      write.table(z,cfg$z,quote=F,col.names=F)
      return(cfg)
    }

    #' Run CAVIAR
    #' https://github.com/fhormoz/caviar
    run_caviar <- function(z, X, args = "", prefix="data")
    {
      cfg = write_caviar_sumstats(z, prefix)
      LD_file = paste0(prefix, '.ld')
      write.table(cor(X), LD_file,quote=F,col.names=F,row.names=F)
      cmd = paste("CAVIAR", "-z", cfg$z, "-l", LD_file, "-o", prefix, args)
      dscrutils::run_cmd(cmd)
      if(!all(file.exists(cfg$post, cfg$set, cfg$log))) {
          stop("Cannot find one of the post, set, and log files")
      }

      log <- readLines(cfg$log)

      # read output tables
      snp <- read.delim(cfg$post)
      stopifnot(ncol(snp) == 3)
      names(snp) <- c("snp", "snp_prob_set", "snp_prob")
      snp$snp <- as.character(snp$snp)
      snp <- rank_snp(snp)

      # `set` of snps
      set <- readLines(cfg$set)
      set_ordered <- left_join(data_frame(snp = set), snp, by = "snp") %>% 
        arrange(rank) %$% snp
      return(list(snp=snp, set=set_ordered))
    }

    rank_snp <- function(snp) {
      snp <- arrange(snp, -snp_prob) %>%
        mutate(
            rank = seq(1, n()),
            snp_prob_cumsum = cumsum(snp_prob) / sum(snp_prob)) %>%
        select(rank, snp, snp_prob, snp_prob_cumsum, snp_prob_set)
      return(snp)    
    }
    #
    susie_res = readRDS(${_input:r})
    dat = readRDS(susie_res$input)[[susie_res$idx]]
    # CAVIAR CODE
    caviar = run_caviar(dat$z_score, dat$X, args = '${caviar_args}', prefix = paste0(${_output:nr}, '.tmp'))
    saveRDS(list(susie_res=${_input:r}, pip=caviar$snp,sets=caviar$set_ordered), ${_output:r})
    system('rm -f ${_output:n}.tmp.*')
In [ ]:
# Plot CAVIAR vs SuSiE results
[CAVIAR_follow_up_2]
input: group_by = 1, concurrent = True
output: f'{_input:n}.png'
R: expand = '${ }'
    library(susieR)
    caviar_res = readRDS(${_input:r})
    susie_res = readRDS(caviar_res$susie_res)
    z_score = readRDS(susie_res$input)[[susie_res$idx]]$z_score
    b = rep(0,length(z_score))
    b[which.max(abs(z_score))] = 1
    # re-organize caviar
    caviar_pip = caviar_res$pip$snp_prob
    names(caviar_pip) = caviar_res$pip$snp
    caviar_pip = caviar_pip[names(z_score)]
    png(${_output:r}, 8, 8, units = 'in', res = 500)
    par(mfrow=c(2,2),mar=c(4,4,4,4))
    susie_pplot(z_score, dtype='z', b=b, main = 'Per-feature regression p-values' )
    susie_pplot(susie_res$pip, res=susie_res$fitted, CS=susie_res$sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE CS purity:', paste(round(susie_res$sets$purity[,1],2), collapse=';')))
    plot(susie_res$pip, caviar_pip, pch = 16, xlab='SuSiE_PIP', ylab = 'CAVIAR_PIP')
    abline(a=0,b=1,col='red')
    susie_pplot(caviar_pip, dtype='PIP', b=b, main = 'CAVIAR PIP')
    dev.off()

Follow-up: verify with multiple regression

For SuSiE 1 or more CS results we can construct a multiple regression model using variables (or top signal) from each CS, then get the residual, and regress it with the SNP that has top z-score, to see if the effect on the top SNP disappears.

sos run analysis/20180704_MolecularQTL_Workflow.ipynb lm_follow_up \
    --x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
    --max-dist 100000
In [ ]:
# Run linear regression conditional on SuSiE CS
[lm_follow_up_1]
# Only use top pip from each cluster rather than using all variables from the cluster
parameter: use_top_pip = 'TRUE'
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:dd}/lm_follow_up/{_input:bn}.lm.rds'
R: expand = '${ }'
  # Load data
  susie_res = readRDS(${_input:r})
  dat = readRDS(susie_res$input)[[susie_res$idx]]
  # lm code
  if (${use_top_pip}) {
      top_susie = sapply(1:length(susie_res$sets$cs), function(i) susie_res$sets$cs[[i]][which.max(susie_res$pip[susie_res$sets$cs[[i]]])])
  } else {
      top_susie = unique(unlist(susie_res$sets$cs))
  }
  residual = residuals(lm(dat$y~dat$X[,top_susie,drop=F]))
  top_z = which.max(dat$z_score)
  top_pval = summary(lm(residual~dat$X[,top_z]))$coef[2,4]
  names(top_pval) = names(dat$z_score)[top_z]
  top_pip = susie_res$pip[top_susie]
  names(top_pip) = names(dat$z_score)[top_susie]
  saveRDS(list(top_pip=top_pip, top_pval=top_pval), ${_output:r})
In [ ]:
# Summarize the outcome
[lm_follow_up_2]
parameter: pval_cutoff = "1E-4"
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/lm_follow_up/top_cond_pval.rds'
R: expand = '${ }'
    input = c(${_input:r,})
    pvals = sapply(1:length(input), function(i) readRDS(input[i])$top_pval)
    saveRDS(pvals, ${_output:r})
    png("${_output:n}.png", 6, 6, units = 'in', res = 500)
    plot(-log10(pvals),pch=16,main = paste("Conditional on SuSiE results", sum(pvals<${pval_cutoff}), "/", length(pvals), "reports p-value < ${pval_cutoff}"))
    abline(a=4,b=0,col='red')
    dev.off()

Here are the top z-scores that remains significant after removing them from linear regression model:

In [1]:
%preview /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/lm_follow_up/top_cond_pval.png
%preview /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/lm_follow_up/top_cond_pval.png
> /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/lm_follow_up/top_cond_pval.png (168.2 KiB):

Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago