%revisions -s -n 10
Molecular QTL data from Yang et al (2016) Science. Input are genotypes of ~100 YRI samples with their molecular QTL data measured in LCL. Alternative splicing (AS) data is of the primary interest here although this workflow can be used to analyze all phenotypes.
Genotype data for YRI is the conventional VCF format but has dosage for genotypes.
Phenotype data has fastqtl
format:
#Chr start end ID 18486 18487 18488 18489 18498 18499
chr1 880180 880422 chr1:880180:880422:clu_15502 0.201694364955 0.665990212763 -1.21881815589 -0.342480185427 0.165404160483 -1.58524292941
The first 4 columns are genomic coordinates info. Others are molecular QTL in samples. The original data (eg the Alternative splicing) may not have this exact format but should be easy to format -- in fact Kevin has formatted all these data to fastqtl
. I'll just take from there.
!sos run 20180704_MolecularQTL_Workflow.ipynb -h
[global]
# X data, the genotype VCF file path
parameter: x_data = path()
# Y data, the phenotype file paths
parameter: y_data = path()
# Trait name
parameter: trait = "unnamed_phenotype"
# Specify work / output directory
parameter: cwd = f'{y_data:d}/{trait}_output'
# Maximum distance to site of interest, set to eg. 100Kb or 1Mb up/downstream to start site of analysis unit
parameter: max_dist = 100000
fail_if(not x_data.is_file(), msg = 'Please provide valid ``--x-data``!')
fail_if(not y_data.is_file(), msg = 'Please provide valid ``--y-data``!')
pop = 'YRI'
and some bash
variables
xdata=/project/compbio/jointLCLs/genotype/hg19/YRI/genotypesYRI.gen.txt.gz
ydata=/project/compbio/jointLCLs/phenotypes/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz
ncpu=16
trait=AS
cwd=/project/compbio/jointLCLs/results/SuSiE
The preprocessing pipeline can be executed locally, takes 3hrs on my 40-thread machine:
sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess \
--x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
--max-dist 100000 --num-pcs 3
There may well be better approach to control for covariates etc, but here is workflow from the authors and was deemed sufficient. See their supplemental table of Yang et al 2016 for how many PC to use for each molecular QTLs.
Need to cope with missing phenotype data here. See na.omit
function call in prcomp
and na.actions=na.exclude
.
# PCA on phenotype and remove top PCs
[preprocess_1 (Remove top phenotype PC)]
# Num. PC to remove from phenotype
parameter: num_pcs = 3 # Table S2 of NIHMS835311-supplement-supplement.pdf
# column name patter for `grep` in R to select phenotype columns
# eg. "^NA[0-9]+" to extract sample names
parameter: colname_pattern = '^[0-9]+'
input: y_data
output: f"{cwd}/{_input:bn}.PC{num_pcs}.removed.gz"
R: expand = "${ }", workdir = cwd, stdout = f"{_output:n}.stdout"
num_pcs = ${num_pcs}
dat <- read.table(${_input:r}, header=T, comment.char='', check.names=F)
phenotype.matrix <- dat[,5:ncol(dat)]
# extract columns of interest
phenotype.matrix <- phenotype.matrix[,grep("${colname_pattern}", colnames(phenotype.matrix), value = T)]
# perform principal component analysis
pca <- prcomp(na.omit(phenotype.matrix))
# PCA summary
print(summary(pca))
cat("output", num_pcs, "PCs \n")
# remove top PC from phenotype; takes a while
cov_pcs <- pca$rotation[, 1:num_pcs]
new.phenotype.matrix <- do.call(rbind, lapply(1:nrow(phenotype.matrix), function(i) residuals(lm(t(phenotype.matrix[i,]) ~ as.matrix(cov_pcs), na.action=na.exclude))))
colnames(new.phenotype.matrix) <- colnames(phenotype.matrix)
new.dat <- cbind(dat[,1:4], new.phenotype.matrix)
colnames(new.dat)[1] <- 'chr'
write.table(new.dat, gzfile(${_output:r}), sep="\t", quote=F, col.names=T, row.names=F)
# this step provides VCF file index
[index_vcf: provides = '{filename}.gz.tbi']
depends: executable('tabix')
input: f"{filename}.gz"
bash: expand=True
tabix -p vcf {_input}
# Extract cis-SNPs and make fine-mapping datasets
[preprocess_2 (Get per-unit dataset): shared = {'unit_dataset': "step_output"}]
depends: Py_Module('pysam'), Py_Module('pandas'), Py_Module('feather'), f"{x_data}.tbi"
chroms = [f'chr{x+1}' for x in range(22)]
input: for_each = 'chroms', concurrent = True
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/{_chroms}/MANIFEST'
python: workdir = cwd, expand = "${ }"
def read_header(gzfile):
import gzip
with gzip.open(gzfile,'r') as f:
for line in f:
res = [x.decode() for x in line.split()]
break
return res
chrom = "${_chroms}"
phenotype_id = [f'${pop}_{x}' for x in read_header(${_input:r})[4:]]
vcf_id = [f'${pop}_{x}' for x in read_header(${x_data:r})[9:]]
from pathlib import Path
import pysam
tbx = pysam.TabixFile(${x_data:r})
import pandas as pd, numpy as np
from feather import write_dataframe
qts = pd.read_csv(${_input:r}, sep = '\t')
qts = qts.loc[qts['chr'] == chrom]
#
import re, time
start_time = time.time()
i = 0
out_files = []
for site in sorted(set(qts['start'].tolist())):
if(i % 100 == 0):
print('[chrom %s percent completed] %.1f (%.1f sec elapsed)' % (chrom, (float(i+1)/qts.shape[0])*100, time.time() - start_time))
unit = qts.loc[qts['start'] == site]
i += unit.shape[0]
basename = f'${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/{chrom}/{chrom}_{site}_{max(unit["end"].tolist())}'
start = max(site - ${max_dist}, 0)
end = site + ${max_dist}
genotypes = np.array([row for row in tbx.fetch(chrom, start, end, parser=pysam.asTuple())])
if len(genotypes) == 0:
continue
Y_data = unit.drop(["chr", "start", "end", "ID"], axis=1).T
Y_data.columns = ["${trait}_" + re.sub('[^0-9a-zA-Z]+', '_', x).strip('_') for x in unit['ID']]
Y_data.index = phenotype_id
X_data = pd.DataFrame(genotypes[:,9:].T,
columns = ['_'.join(x) for x in genotypes[:,[2,0,1,3,4]]],
index = vcf_id)
merged = Y_data.join(X_data, how='inner').astype(np.float32)
Path(f'${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/{chrom}').mkdir(exist_ok=True, parents=True)
# FIXME: this will be a large file because feather format is not yet compressed ...
# This is being discussed by the feather development group and hopefully compression will be supported in later 2018
write_dataframe(merged, basename + '.feather')
out_files.append(basename + '.feather')
with open(${_output:r}, 'w') as f:
f.write('\n'.join(out_files))
We want to estimate PVE from the top signal in each unit, to have an idea of how SuSiE prior variance should be configured. We can also compute summary statistics at this stage. We save only z-scores.
# Summary statistics and PVE estimates
[preprocess_3 (PVE)]
depends: R_library('feather'), R_library('susieR')
input: paths([get_output(f'cat {x}').strip().split('\n') for x in unit_dataset]), group_by = 1, concurrent = True
output: f'{_input:n}.rds'
R: expand = '${ }'
library(feather)
dat = read_feather(${_input:r})
n_y = length(grep("^${trait}", colnames(dat), value = T))
Y = as.matrix(dat[,1:n_y,drop=F])
X = as.matrix(dat[,(n_y+1):ncol(dat),drop=F])
storage.mode(X) = 'double'
storage.mode(Y) = 'double'
bad = which(sapply(1:ncol(X), function(i) all(is.na(X[,i]))))
if (length(bad) >= 1) {
snps = colnames(X)[-bad]
X = X[,-bad,drop=F]
} else {
snps = colnames(X)
}
res = list()
for (r in 1:ncol(Y)) {
keep_rows = which(!is.na(Y[,r]))
x = X[keep_rows,,drop=F]
y = Y[,r][keep_rows]
reg = susieR:::univariate_regression(x,y)
z_score = reg$betahat/reg$sebetahat
names(z_score) = snps
top_idx = which.max(abs(z_score))
pve = var(x[,top_idx] * reg$betahat[top_idx]) / var(y)
res[[r]] = list(X=x, y=y, z_score=z_score, pve=pve, uname=colnames(Y)[[r]])
}
saveRDS(res, ${_output:r})
_input.zap()
PVE can be assessed by
> files = Sys.glob('chr*/*.rds')
> length(files)
[1] 44068
> pve = sapply(1:length(files), function(i) readRDS(files[i])[[1]]$pve)
> summary(pve)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.008189 0.065382 0.085535 0.096619 0.111102 0.956613
I make two sets of analysis, one with --prior_var
set to using region level PVE based value, another to --prior_var 0.096
the averaged value computed as above.
This step takes 3hrs on my 40-thread machine:
sos run analysis/20180704_MolecularQTL_Workflow.ipynb SuSiE \
--x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
--max-dist 100000 --prior_var 0.096
and not very much recommanded but what I've tried anyways is to use region level PVE based value (it will override previous command so please empty previous output if you'd like to try this!):
sos run analysis/20180704_MolecularQTL_Workflow.ipynb SuSiE \
--x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
--max-dist 100000
we only plot the ones SuSiE reports at least 1 CS:
sos run analysis/20180704_MolecularQTL_Workflow.ipynb SuSiE_summary \
--x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
--max-dist 100000
sos run analysis/20180704_MolecularQTL_Workflow.ipynb signal_summary \
--x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
--max-dist 100000
# Run SuSiE
[SuSiE (SuSiE analysis)]
depends: R_library('susieR')
# SuSiE parameter: L
parameter: maxL = 5
# SuSiE parameter: prior variance; set to 0 to use per-dataset PVE estimates
parameter: prior_var = 0.0
input: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/chr*/*.rds', group_by = 1, concurrent = True
# here I do not keep track of output because it is otherwise too computationally
# intensive for dynamic output (SoS issue #1028)
#output: dynamic(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_*/*.rds')
#task: trunk_workers = 1, queue = 'midway2_head', walltime = '10m', trunk_size = 1000, mem = '2G', cores = 1, workdir = cwd, concurrent = True
R: expand = "${ }"
library(susieR)
set.seed(1)
dat = readRDS(${_input:r})
for (r in 1:length(dat)) {
fitted = susie(dat[[r]]$X, dat[[r]]$y,
L=${maxL},
estimate_residual_variance=TRUE,
scaled_prior_variance=${prior_var if prior_var > 0 else "dat[[r]]$pve"},
tol=1e-3)
sets = susie_get_CS(fitted,
coverage = 0.95,
X = dat[[r]]$X,
min_abs_corr = 0.4)
pip = susie_get_PIP(fitted, sets$cs_index)
names(pip) = colnames(dat[[r]]$X)
dirname = paste0('${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/SuSiE_CS_', length(sets$cs_index), '/')
system(paste("mkdir -p", dirname))
if (length(sets$cs_index) > 0) {
saveRDS(list(input=${_input:r},idx=r,fitted=fitted,sets=sets,pip=pip), paste0(dirname, dat[[r]]$uname, '.rds'))
} else {
saveRDS(list(input=${_input:r},idx=r,fitted=fitted), paste0(dirname, dat[[r]]$uname, '.rds'))
}
}
# Make SuSiE result plots for significant results
[SuSiE_summary (plot SuSiE results)]
depends: R_library('susieR')
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:n}.png'
R: expand = '${ }', stdout = f'{_output:n}.log'
library(susieR)
dat = readRDS(${_input:r})
z_score = readRDS(dat$input)[[dat$idx]]$z_score
b = rep(0,length(z_score))
b[which.max(abs(z_score))] = 1
png(${_output:r}, 12, 6, units = 'in', res = 500)
par(mfrow=c(1,2))
susie_pplot(z_score, dtype='z', b=b, main = 'Per-feature regression p-values')
susie_pplot(dat$pip, res=dat$fitted, CS=dat$sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE CS purity:', paste(round(dat$sets$purity[,1],2), collapse=';')))
dev.off()
# Count discoveries
[signal_summary]
pip_cutoff = 0.8
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds')
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_Summary.rds'
R: expand = '${ }'
files = c(${_input:r,})
results = lapply(1:length(files), function(i) readRDS(files[i])[c("sets","pip")])
cs = lapply(1:length(results), function(i) results[[i]]$sets$cs)
L = sapply(1:length(results), function(i) length(cs[[i]]))
size = sapply(1:length(results), function(i) sapply(1:length(cs[[i]]), function(j) length(cs[[i]][[j]])))
purity = lapply(1:length(results), function(i) results[[i]]$sets$purity[,1])
high_pips = sapply(1:length(results), function(i) sum(results[[i]]$pip > ${pip_cutoff}))
saveRDS(list(L=L, size=size, purity=purity,high_pips=high_pips,cs=cs),${_output:r})
%cd /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb
dat = readRDS('SuSiE_CS_Summary.rds')
print(length(dat$L))
cnt = 0
for (i in 1:6) {
tmp = sum(dat$L==i)
print(tmp)
cnt = cnt + tmp * i
}
print(cnt)
sum(unlist(dat$size) == 1)
sum(unlist(dat$size) == 2)
median(unlist(dat$size))
median(unlist(dat$purity))
mean(unlist(dat$purity))
sum(dat$high_pips)
There is no overlapping CS in our results.
for (s in 1:length(dat$cs)) {
for (i in 1:length(dat$cs[[s]])) {
for (j in 1:i) {
if (i==j) next
status = intersect(dat$cs[[s]][[i]], dat$cs[[s]][[j]])
if (length(status)) print(status)
}
}
}
print(length(dat$cs))
I made this for Kevin's analysis. Here is a list of analysis I've ran:
sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess+DAP \
--x-data $xdata \
--y-data ~/GIT/LargeFiles/JointLCL/fastqtl_m6A_merged_peaks_IPcounts_YangVCF.txt.gz \
--max-dist 100000 --num-pcs 7 --trait m6A -j 38 \
-b ~/GIT/github/mvarbvs/dsc/modules/linux/
sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess+DAP \
--x-data $xdata \
--y-data ~/GIT/LargeFiles/JointLCL/fastqtl_ribo_normExpr_YangVCF.txt.gz \
--max-dist 100000 --num-pcs 2 --trait ribo -j 38 \
-b ~/GIT/github/mvarbvs/dsc/modules/linux/
sos run analysis/20180704_MolecularQTL_Workflow.ipynb DAP \
--x-data $xdata \
--y-data ~/GIT/LargeFiles/JointLCL/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
--max-dist 100000 --num-pcs 3 --trait AS -j 38 \
-b ~/GIT/github/mvarbvs/dsc/modules/linux/
sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess+DAP \
--x-data $xdata \
--y-data ~/GIT/LargeFiles/JointLCL/fastqtl_apa_distProxRatio_YangVCF.txt.gz \
--max-dist 100000 --num-pcs 7 --trait APA -j 38 \
-b ~/GIT/github/mvarbvs/dsc/modules/linux/
# Convert to DAP format
[DAP_1 (convert to DAP data format)]
input: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/chr*/*.rds', group_by = 1, concurrent = True
output: dynamic(f'{_input:dd}/DAP-g/*.{_input:bn}.gz')
R: expand = '${ }'
dat = readRDS(${_input:r})
dirname = paste0(${_input:ddr}, '/DAP-g/')
system(paste('mkdir -p', dirname))
for (r in 1:length(dat)) {
pheno = c('pheno', '${y_data:bnn}', dat[[r]]$uname, dat[[r]]$y)
geno = cbind(rep('geno', ncol(dat[[r]]$X)), colnames(dat[[r]]$X), rep(dat[[r]]$uname, ncol(dat[[r]]$X)), t(dat[[r]]$X))
write.table(rbind(pheno, geno), gzfile(paste0(dirname, dat[[r]]$uname, '.${_input:bn}.gz')), quote=F,col.names=F,row.names=F)
}
# Run DAP-g
[DAP_2 (run DAP-g)]
depends: Py_Module('dsc'), Py_Module('rpy2'), executable('dap-g')
input: group_by = 1, concurrent = True
output: f'{_input:n}.output.rds'
bash: expand = True, stderr = f'{_output:n}.log', workdir = '/tmp'
dap-g -d <(zcat {_input}) -o {_output:n} -t 1 --all
exit 0
python: expand = '${ }'
import pandas as pd
def extract_dap_output(fn):
out = [x.strip().split() for x in open(fn).readlines()]
pips = []
clusters = []
still_pip = True
for line in out:
if len(line) == 0:
continue
if len(line) > 2 and line[2] == 'cluster_pip':
still_pip = False
continue
if still_pip and (not line[0].startswith('((')):
continue
if still_pip:
pips.append([line[1], float(line[2]), float(line[3]), int(line[4])])
else:
clusters.append([len(clusters) + 1, float(line[2]), float(line[3])])
pips = pd.DataFrame(pips, columns = ['snp', 'snp_prob', 'snp_log10bf', 'cluster'])
clusters = pd.DataFrame(clusters, columns = ['cluster', 'cluster_prob', 'cluster_avg_r2'])
clusters = pd.merge(clusters, pips.groupby(['cluster'])['snp'].apply(','.join).reset_index(), on = 'cluster')
return {'snp': pips, 'set': clusters}
from dsc.dsc_io import save_rds
save_rds(extract_dap_output(${_output:nr}), ${_output:r})
_input.zap()
I found ~1600 splicing QTLs. Only about 150 reports more than 1 CS. But most of these 2+ QTLs are from a scenarios that there seemingly is one signal cluster from the z-score but SuSiE decides to use two CS to capture that cluster.
Another observation from the SuSiE analysis is that in a number of cases the smallest p-value is not picked up.
To verify how many signals are in there when there is seemingly one cluster visually, we run CAVIAR for those data. Input for CAVIAR are z-scores and LD matrices.
sos run analysis/20180704_MolecularQTL_Workflow.ipynb CAVIAR_follow_up \
--x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
--max-dist 100000 \
-b ~/GIT/github/mvarbvs/dsc/modules/linux/
# Run CAVIAR on SuSiE results of interest
[CAVIAR_follow_up_1]
parameter: caviar_args = '-c 2'
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[2-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:dd}/CAVIAR_follow_up/{_input:bn}.CAVIAR.rds'
R: expand = '${ }', stdout = False, stderr = False
library('dplyr')
library('magrittr')
#' CAVIAR I/O
write_caviar_sumstats <- function(z, prefix) {
cfg = list(z=paste0(prefix,".z"),
set=paste0(prefix,"_set"),
post=paste0(prefix,"_post"),
log=paste0(prefix,".log"))
write.table(z,cfg$z,quote=F,col.names=F)
return(cfg)
}
#' Run CAVIAR
#' https://github.com/fhormoz/caviar
run_caviar <- function(z, X, args = "", prefix="data")
{
cfg = write_caviar_sumstats(z, prefix)
LD_file = paste0(prefix, '.ld')
write.table(cor(X), LD_file,quote=F,col.names=F,row.names=F)
cmd = paste("CAVIAR", "-z", cfg$z, "-l", LD_file, "-o", prefix, args)
dscrutils::run_cmd(cmd)
if(!all(file.exists(cfg$post, cfg$set, cfg$log))) {
stop("Cannot find one of the post, set, and log files")
}
log <- readLines(cfg$log)
# read output tables
snp <- read.delim(cfg$post)
stopifnot(ncol(snp) == 3)
names(snp) <- c("snp", "snp_prob_set", "snp_prob")
snp$snp <- as.character(snp$snp)
snp <- rank_snp(snp)
# `set` of snps
set <- readLines(cfg$set)
set_ordered <- left_join(data_frame(snp = set), snp, by = "snp") %>%
arrange(rank) %$% snp
return(list(snp=snp, set=set_ordered))
}
rank_snp <- function(snp) {
snp <- arrange(snp, -snp_prob) %>%
mutate(
rank = seq(1, n()),
snp_prob_cumsum = cumsum(snp_prob) / sum(snp_prob)) %>%
select(rank, snp, snp_prob, snp_prob_cumsum, snp_prob_set)
return(snp)
}
#
susie_res = readRDS(${_input:r})
dat = readRDS(susie_res$input)[[susie_res$idx]]
# CAVIAR CODE
caviar = run_caviar(dat$z_score, dat$X, args = '${caviar_args}', prefix = paste0(${_output:nr}, '.tmp'))
saveRDS(list(susie_res=${_input:r}, pip=caviar$snp,sets=caviar$set_ordered), ${_output:r})
system('rm -f ${_output:n}.tmp.*')
# Plot CAVIAR vs SuSiE results
[CAVIAR_follow_up_2]
input: group_by = 1, concurrent = True
output: f'{_input:n}.png'
R: expand = '${ }'
library(susieR)
caviar_res = readRDS(${_input:r})
susie_res = readRDS(caviar_res$susie_res)
z_score = readRDS(susie_res$input)[[susie_res$idx]]$z_score
b = rep(0,length(z_score))
b[which.max(abs(z_score))] = 1
# re-organize caviar
caviar_pip = caviar_res$pip$snp_prob
names(caviar_pip) = caviar_res$pip$snp
caviar_pip = caviar_pip[names(z_score)]
png(${_output:r}, 8, 8, units = 'in', res = 500)
par(mfrow=c(2,2),mar=c(4,4,4,4))
susie_pplot(z_score, dtype='z', b=b, main = 'Per-feature regression p-values' )
susie_pplot(susie_res$pip, res=susie_res$fitted, CS=susie_res$sets, dtype='PIP', b=b, max_cs=0.4, main = paste('SuSiE CS purity:', paste(round(susie_res$sets$purity[,1],2), collapse=';')))
plot(susie_res$pip, caviar_pip, pch = 16, xlab='SuSiE_PIP', ylab = 'CAVIAR_PIP')
abline(a=0,b=1,col='red')
susie_pplot(caviar_pip, dtype='PIP', b=b, main = 'CAVIAR PIP')
dev.off()
For SuSiE 1 or more CS results we can construct a multiple regression model using variables (or top signal) from each CS, then get the residual, and regress it with the SNP that has top z-score, to see if the effect on the top SNP disappears.
sos run analysis/20180704_MolecularQTL_Workflow.ipynb lm_follow_up \
--x-data $xdata --y-data $ydata --trait $trait --cwd $cwd -j $ncpu \
--max-dist 100000
# Run linear regression conditional on SuSiE CS
[lm_follow_up_1]
# Only use top pip from each cluster rather than using all variables from the cluster
parameter: use_top_pip = 'TRUE'
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:dd}/lm_follow_up/{_input:bn}.lm.rds'
R: expand = '${ }'
# Load data
susie_res = readRDS(${_input:r})
dat = readRDS(susie_res$input)[[susie_res$idx]]
# lm code
if (${use_top_pip}) {
top_susie = sapply(1:length(susie_res$sets$cs), function(i) susie_res$sets$cs[[i]][which.max(susie_res$pip[susie_res$sets$cs[[i]]])])
} else {
top_susie = unique(unlist(susie_res$sets$cs))
}
residual = residuals(lm(dat$y~dat$X[,top_susie,drop=F]))
top_z = which.max(dat$z_score)
top_pval = summary(lm(residual~dat$X[,top_z]))$coef[2,4]
names(top_pval) = names(dat$z_score)[top_z]
top_pip = susie_res$pip[top_susie]
names(top_pip) = names(dat$z_score)[top_susie]
saveRDS(list(top_pip=top_pip, top_pval=top_pval), ${_output:r})
# Summarize the outcome
[lm_follow_up_2]
parameter: pval_cutoff = "1E-4"
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/lm_follow_up/top_cond_pval.rds'
R: expand = '${ }'
input = c(${_input:r,})
pvals = sapply(1:length(input), function(i) readRDS(input[i])$top_pval)
saveRDS(pvals, ${_output:r})
png("${_output:n}.png", 6, 6, units = 'in', res = 500)
plot(-log10(pvals),pch=16,main = paste("Conditional on SuSiE results", sum(pvals<${pval_cutoff}), "/", length(pvals), "reports p-value < ${pval_cutoff}"))
abline(a=4,b=0,col='red')
dev.off()
Here are the top z-scores that remains significant after removing them from linear regression model:
%preview /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/lm_follow_up/top_cond_pval.png