For molecular QTL analysis results we obtained we'd like to see if:
We focus only on the results that has 1 or more CS identified.
We use the following types of annotations:
I will first test for enrichment with the primary CS (defined as the set with highest purity), then test for the non-primary CS. The idea is to discover enrichment of annotations in the primary CS, then also show that additional CS has similar (or interestingly different!) pattern of enrichment.
%revisions -s -n 10
! sos run 20180712_Enrichment_Workflow.ipynb -h
Annotation files are in bed format (for example: Coding_UCSC.bed
).
chr1 69090 70008
chr1 367658 368597
chr1 621095 622034
......
chr9 141121357 141121553
chr9 141124188 141124276
chr9 141134069 141134172
There are many annotations one can use. For this workflow one should prepare a list of annotations for input, like this:
# my annotation
Coding_UCSC
CTCF_binding
E116-DNase.macs2
E116-H2A.Z
E116-H3K27ac
E116-H3K27me3
E116-H3K36me3
E116-H3K4me1
E116-H3K4me2
E116-H3K4me3
E116-H3K79me2
E116-H3K9ac
E116-H3K9me3
E116-H4K20me1
E117-DNase.macs2
Intron_UCSC
PolII_binding
comment symbol #
is allowed.
Need to format the data to:
Variant_ID PIP z_score CS_ID
Where CS_ID
is 0 if variant is not in SuSiE CS, 1 if in CS 1, 2 in CS 2, etc. Variant_ID
carries information of chrom and pos. eg, rs10131831_chr14_20905250_G_A
[global]
# Y data, the phenotype file paths
parameter: y_data = path()
# Trait name
parameter: trait = None
# Specify work / output directory
parameter: cwd = path(f'{y_data:d}/{trait}_output')
# Path to directory of annotation files
parameter: annotation_dir = path('/project/compbio/jointLCLs/annotations/susie_paper_annotations')
# Path to list of single annotations to use
parameter: single_annot = path() #parameter: single_annot = path("data/all_annotations.txt")
# Maximum distance to site of interest, set to eg. 100Kb or 1Mb up/downstream to start site of analysis unit
parameter: max_dist = 100000
fail_if(not y_data.is_file(), msg = 'Please provide valid ``--y-data``!')
parameter: z_score = path(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/SuSiE_loci.sumstats.gz')
out_dir = f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/{z_score:bn}'.replace('.', '_')
try:
single_anno = [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(single_annot).readlines() if not x.startswith('#')]
except (FileNotFoundError, IsADirectoryError):
single_anno = []
and some bash variables:
xdata=/project/compbio/jointLCLs/genotype/hg19/YRI/genotypesYRI.gen.txt.gz
ydata=/project/compbio/jointLCLs/phenotypes/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz
ncpu=16
trait=AS
cwd=/project/compbio/jointLCLs/results/SuSiE
Initially I've got all my tools dockerized. Unfortunately it is not straightforward to run docker (or singularity, yet) on UChicago cluster. I will add in here the notes to install needed software to markdown cells near relevant workflow steps.
Histone marks
sos run analysis/20180712_Enrichment_Workflow.ipynb download_histone_annotation \
--trait $trait --y-data $ydata
Need to install bedops
:
conda install -c bioconda bedops
[download_histone_annotation]
depends: executable('sort-bed')
dataset = paths(["E116-DNase.macs2.narrowPeak.gz",
"E116-H2A.Z.narrowPeak.gz",
"E116-H3K4me1.narrowPeak.gz",
"E116-H3K4me2.narrowPeak.gz",
"E116-H3K4me3.narrowPeak.gz",
"E116-H3K9ac.narrowPeak.gz",
"E116-H3K9me3.narrowPeak.gz",
"E116-H3K27ac.narrowPeak.gz",
"E116-H3K27me3.narrowPeak.gz",
"E116-H3K36me3.narrowPeak.gz",
"E116-H3K79me2.narrowPeak.gz",
"E116-H4K20me1.narrowPeak.gz"])
input: for_each = 'dataset', concurrent = True
output: f'{annotation_dir}/{_dataset:nn}.bed'
download: dest_dir = annotation_dir, expand = True
https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/{_dataset}
bash: expand = True, workdir = annotation_dir
zcat {_dataset} | cut -f1-3 | grep -v chrM | sort-bed - > {_output}
CTCF binding:
sos run analysis/20180712_Enrichment_Workflow.ipynb download_ctcf_annotation \
--trait $trait --y-data $ydata
[download_ctcf_annotation]
depends: executable('sort-bed')
output: f'{annotation_dir}/CTCF_binding.bed'
download: dest_dir = annotation_dir
https://www.ebi.ac.uk/birney-srv/CTCF-QTL/phenotypes/bindings.csv
bash: expand = True, workdir = annotation_dir
cut -f1-3 -d "," bindings.csv | tail -n+2 | sed 's/,/\t/g' | sed 's/^/chr/g' | grep -v chrM | sort-bed - > {_output}
PolyII binding:
sos run analysis/20180712_Enrichment_Workflow.ipynb download_polII_annotation \
--trait $trait --y-data $ydata
Need to liftOver
:
conda config --add channels bioconda
conda install ucsc-liftover
[download_polII_annotation]
depends: executable('sort-bed'), executable('liftOver')
dataset = "GSM487431_GM12878_Pol2_narrowPeak.bed.gz"
output: f'{annotation_dir}/PolII_binding.bed'
download: dest_dir = annotation_dir, expand=True
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM487nnn/GSM487431/suppl/{dataset}
download: dest_dir = annotation_dir
http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18ToHg19.over.chain.gz
bash: expand = True, workdir = annotation_dir
zcat {dataset} | cut -f1-3 | grep -v chrM | sort-bed - > {_output:n}.hg18.bed
bash: expand = True, workdir = annotation_dir
liftOver {_output:n}.hg18.bed hg18ToHg19.over.chain.gz {_output} unmapped
General annotations:
sos run analysis/20180712_Enrichment_Workflow.ipynb download_general_annotation --trait $trait \
--y-data $ydata
[download_general_annotation]
output: [f'{annotation_dir}/{x}.bed' for x in ['Coding_UCSC', 'Intron_UCSC']]
download: dest_dir = annotation_dir, decompress = True
https://stephenslab.github.io/susie-paper/downloads/Coding_UCSC.bed.gz
https://stephenslab.github.io/susie-paper/downloads/Intron_UCSC.bed.gz
Gene regions:
sos run analysis/20180712_Enrichment_Workflow.ipynb gene_regions \
--trait $trait --y-data $ydata
[gene_regions_refgene]
depends: Py_Module('pandas')
output: f'{annotation_dir}/genes.bed'
download: dest_dir = annotation_dir
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
python: expand = '${ }', workdir = annotation_dir
import pandas as pd
ref_gene = pd.read_table('refGene.txt.gz', compression="gzip", sep="\t",
header = None, usecols=(1,2,4,5,12),
names = ["tx_name", "chrom", "tx_start", "tx_end", "gene_name"])
ref_gene = ref_gene.drop_duplicates(subset=("chrom", "tx_start", "tx_end"))
ref_gene = pd.DataFrame(ref_gene.groupby(['chrom','gene_name']).agg({'tx_start': 'min', 'tx_end': 'max'}).to_records())
ref_gene.iloc[:,[0,2,3]].to_csv(${_output:r}, sep = "\t", header = False, index = False)
[gene_regions_1]
output: f'{annotation_dir}/gene_map.txt'
download: dest_dir = annotation_dir, decompress = True
https://stephenslab.github.io/susie-paper/downloads/gene_map.txt.gz
[gene_regions_2]
depends: executable('sort-bed')
output: f'{_input:n}.bed'
bash: expand = True, workdir = annotation_dir
cut {_input} -f2-4 | sed 's/^/chr/g' | grep -v chrM | sort-bed - > {_output}
Extended splice sites:
sos run analysis/20180712_Enrichment_Workflow.ipynb extended_splice_site \
--trait $trait --y-data $ydata
[extended_splice_site]
depends: executable('sort-bed')
input: f"{annotation_dir}/Intron_UCSC.bed"
output: f"{annotation_dir}/Extended_splice_site.bed"
python: expand = "${ }", workdir = annotation_dir
lines = [x.strip().split() for x in open(${_input:r}).readlines()]
out = []
for line in lines:
out.append(f'{line[0]}\t{int(line[1])-5}\t{int(line[1])+5}')
out.append(f'{line[0]}\t{int(line[2])-15}\t{int(line[2])+15}')
with open(${_output:r}, 'w') as f:
f.write('\n'.join(out) + '\n')
bash: expand = True, workdir = annotation_dir
sort-bed {_output} > {_output:n}.sorted.txt
mv {_output:n}.sorted.txt {_output}
sos run analysis/20180712_Enrichment_Workflow.ipynb extract_sumstats \
--trait $trait --y-data $ydata -j $ncpu --cwd $cwd
# Extract summary stats from RDS files to plain text
[extract_sumstats_1]
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:dd}/enrichment/{_input:bn}.sumstats.gz'
R: expand = '${ }'
dat = readRDS(${_input:r})
pip = dat$pip
names = names(readRDS(dat$input)[[dat$idx]]$z_score)
zscore = readRDS(dat$input)[[dat$idx]]$z_score
# For earlier version of susieR
ordering = order(dat$sets$purity$min.abs.corr, decreasing=T)
dat$sets$cs = dat$sets$cs[ordering]
dat$sets$purity = dat$sets$purity[ordering,]
cs_id = rep(0, length(pip))
j = 1
for (i in names(dat$sets$cs)) {
cs_id[dat$sets$cs[[i]]] = j
j = j + 1
}
write.table(cbind(names,pip,zscore,cs_id), gzfile(${_output:r}), quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")
# Consolidate results to one file
[extract_sumstats_2]
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/SuSiE_loci.sumstats.gz'
bash: expand = True
zcat {_input} | gzip --best > {_output}
_input.zap()
[MW] zcat /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.gz | wc -l
1370381
[MW] zcat /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.gz | cut -f 4 | grep 1 | wc -l
38557
[MW] zcat /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.gz | cut -f 4 | grep 2 | wc -l
1868
[MW] zcat /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.gz | cut -f 4 | grep 3 | wc -l
85
So we have total of 1370381 variants, 38557 in CS 1, 1868 in CS 2, etc.
# Auxiliary step to get variant in bed format based on variant ID in z-score file
[zscore2bed_1]
depends: R_library('readr'), R_library('stringr'), R_library('dplyr')
parameter: in_file = path()
parameter: chr_prefix = ""
input: in_file
output: f'{_input:n}.bed.unsorted'
R: expand = "${ }", workdir = cwd, stdout = f'{_output:n}.stdout'
library(readr)
library(stringr)
library(dplyr)
var_file <- ${_input:r}
out_file <- ${_output:r}
variants <- read_tsv(var_file)
colnames(variants) = c('variant', 'pip', 'zscore', 'cs')
var_info <- str_split(variants$variant, "_")
variants <- mutate(variants, chr = paste0("${chr_prefix}", sapply(var_info, function(x){x[2]})),
pos = sapply(var_info, function(x){x[3]})) %>%
mutate(start = as.numeric(pos), stop=as.numeric(pos) + 1) %>%
select(chr, start, stop, variant)
options(scipen=1000) # So that positions are always fully written out)
write.table(variants, file=out_file, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")
[zscore2bed_2]
depends: executable('sort-bed')
output: f'{_input:n}'
bash: expand = True, workdir = cwd
sort-bed {_input} > {_output}
_input.zap()
[get_variants: provides = '{data}.bed']
output: f'{data}.bed'
sos_run('zscore2bed', in_file = f'{_output:n}.gz')
sos run analysis/20180712_Enrichment_Workflow.ipynb range2var_annotation \
--trait $trait --y-data $ydata -j $ncpu \
--single-annot data/annotation.list
# Get variants in data that falls in target region
[range2var_annotation_1]
depends: executable('bedops')
depends: f'{z_score:n}.bed'
input: set(paths(single_anno)), group_by = 1, concurrent = True
output: f'{out_dir}/{_input:bn}.{z_score:bn}.bed'
bash: expand = True, workdir = cwd
bedops -e {z_score:n}.bed {_input} > {_output}
# Make binary annotation file
[range2var_annotation_2]
depends: z_score, R_library('readr'), R_library('stringr'), R_library('dplyr')
input: group_by = 1, concurrent = True
output: f'{_input:n}.gz'
R: expand = "${ }", workdir = cwd, stderr = f'{_output:n}.stderr'
library(readr)
library(dplyr)
library(stringr)
variant_tsv <- ${z_score:r}
annotation_var_bed <- ${_input:r}
annot_name <- ${_input:bnr} %>% str_replace(paste0(".",${z_score:bnr}), "")
out_name <- ${_output:r}
vars <- read_tsv(variant_tsv, col_names=FALSE)[,1]
annot_vars = read_tsv(annotation_var_bed, col_names=FALSE)
names(vars) <- "SNP"
vars <- vars %>%
mutate(annot_d = case_when(SNP %in% annot_vars$X4 ~ 1,
TRUE ~ 0))
# remove possible duplicate rows
vars <- unique(vars)
names(vars)[2] <- annot_name
write.table(vars, file=gzfile(out_name),
col.names=TRUE, row.names=FALSE, sep="\t", quote=FALSE)
To properly perform enrichment analysis we want to match the control SNPs with the SNPs of interest -- that is, SNPs inside CS -- in terms of LD, distance to nearest gene and MAF. The GREGOR software can generate list of matched SNPs. I will use SNPs inside CS as input and expect a list of output SNPs matching these inputs.
GREGOR is release under University of Michigan license so I'll not make it into a docker image. So path to GREGOR directory is required. Also we need reference files, prepared by:
cat \
GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.00 \
GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.01 \
> GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
tar zxvf GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
MD5SUM check:
AFR.part.0 ( MD5: 9926904128dd58d6bf1ad4f1e90638af )
AFR.part.1 ( MD5: c1d30aff89a584bfa8c1fa1bdc197f21 )
The input file can be a list of rs
ID's. In SuSiE_loci.sumstats.gz
we do have this information.
sos run analysis/20180712_Enrichment_Workflow.ipynb gregor --cs 1 \
--trait $trait --y-data $ydata -j $ncpu \
--single-annot data/annotation.list
sos run analysis/20180712_Enrichment_Workflow.ipynb gregor --cs 2 \
--trait $trait --y-data $ydata -j $ncpu \
--single-annot data/annotation.list
[gregor_1 (make SNP index)]
# which set to test. 1 to test primary test, otherwise non-primary set
parameter: cs = 1
input: z_score
output: f'{_input:nn}.set{cs}.txt', f'{_input:nn}.annotations.list'
bash: expand = '${ }'
zcat ${_input} | awk '{if (${"$4==1 && $2 > 0.05" if cs == 1 else "$4>1 && $2 > 0.05"}) print $0}' | cut -f 2,3 -d "_" | sed 's/_/:/g' | sort -u > ${_output[0]}
with open(_output[1], 'w') as f:
f.write('\n'.join(single_anno))
[gregor_2 (make configuration file)]
parameter: gregor_db = path('/data/GREGOR_DB')
parameter: pop = 'AFR'
parameter: ncpu = 10
output: f'{_input[0]:n}.gregor.conf'
report: output = f'{_output}', expand = True
##############################################################################
# CHIPSEQ ENRICHMENT CONFIGURATION FILE
# This configuration file contains run-time configuration of
# CHIP_SEQ ENRICHMENT
###############################################################################
## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
###############################################################################
INDEX_SNP_FILE = {_input[0]}
BED_FILE_INDEX = {_input[1]}
REF_DIR = {gregor_db}
R2THRESHOLD = 0.7 ## must be greater than 0.7
LDWINDOWSIZE = 10000 ## must be less than 1MB; these two values define LD buddies
OUT_DIR = {_output:nn}_gregor_output
MIN_NEIGHBOR_NUM = 10 ## define the size of neighborhood
BEDFILE_IS_SORTED = true ## false, if the bed files are not sorted
POPULATION = {pop} ## define the population, you can specify EUR, AFR, AMR or ASN
TOPNBEDFILES = 2
JOBNUMBER = {ncpu}
###############################################################################
#BATCHTYPE = mosix ## submit jobs on MOSIX
#BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
###############################################################################
#BATCHTYPE = slurm ## submit jobs on SLURM
#BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
###############################################################################
BATCHTYPE = local ## run jobs on local machine
bash: expand = True
sed -i '/^$/d' {_output}
GREGOR is written in perl
. Some libraries are required to run GREGOR:
sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl
[gregor_3 (run gregor)]
parameter: gregor_path = path('/opt/GREGOR')
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True
rm -rf {_input:nn}_gregor_output && perl {gregor_path}/script/GREGOR.pl --conf {_input} && touch {_output}
[gregor_4 (format output)]
output: f'{_input:n}.csv'
bash: expand = True
sed 's/\t/,/g' {_input} > {_output}
dat = read.table('/home/gaow/Documents/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci_gregor_output/StatisticSummaryFile.csv', head=T, sep=',')
dat[order(dat$PValue),]
In Li 2016 manuscript the analysis was not matched by LD, MAF and TSS for sQTL. The only constraint for control SNPs was that they should be within intron clusters. To focus analysis on these SNPs:
sos run analysis/20180712_Enrichment_Workflow.ipynb overlap_cluster \
--trait $trait --y-data $ydata
# Overlap z-scores with intron clusters
[overlap_cluster]
depends: executable('bedops')
input: f"{z_score:n}.bed"
output: f'{out_dir}/{y_data:bnn}.{z_score:bn}.overlap.bed'
bash: expand = True, workdir = cwd
# bedops -e {_input} {annotation_dir}/gene_map.bed | sort -u > {_output}
bedops -e {_input} <(cat {annotation_dir}/Coding_UCSC.bed {annotation_dir}/Intron_UCSC.bed | sort-bed -) | sort -u > {_output}
We ended up having 369K variants:
[GW] wc -l /home/gaow/Documents/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci_sumstats/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.SuSiE_loci.sumstats.overlap.bed
369134
Test for larger PIP in annotation vs outside it.
sos run analysis/20180712_Enrichment_Workflow.ipynb pip_rank_test \
--trait $trait --y-data $ydata -j $ncpu \
--single-annot data/annotation.list
# Subset data
[pip_rank_test_1, cs_fisher_test_1]
depends: z_score, f'{out_dir}/{y_data:bnn}.{z_score:bn}.overlap.bed', R_library('readr'), R_library('stringr'), R_library('dplyr')
input_files = [f'{out_dir}/{value:bn}.{z_score:bn}.gz' for value in paths(single_anno)]
input: input_files, group_by = 1, concurrent = True
output: f'{_input:n}.data.rds'
R: expand = '${ }', workdir = cwd, stderr = f'{_output:n}.stderr'
set.seed(1)
library(readr)
library(dplyr)
variants <- read_tsv(${z_score:r}, col_names=FALSE)
colnames(variants) = c('SNP', 'PIP', 'Z', 'CS')
annotation <- read_tsv(${_input:r}, col_names=TRUE)
name = colnames(annotation)[2]
colnames(annotation) = c('SNP', 'GROUP')
cluster_subset <- read_tsv('${out_dir}/${y_data:bnn}.${z_score:bn}.overlap.bed', col_names=FALSE)
colnames(cluster_subset) = c("CHR", "START", "END", "SNP")
# restrict analysis to within genes
annotation = inner_join(cluster_subset, annotation, by = "SNP")[,c("SNP", "GROUP")]
# remove duplicates in variants but keep the largest CS label when possible
variants = as.data.frame(variants)
variants = variants[order(variants[,'SNP'],-variants[,'CS']),]
variants = variants[!duplicated(variants[,'SNP']),]
variants = inner_join(annotation, variants, by = "SNP")
saveRDS(list(name=name, variants=variants), ${_output:r})
# Test if PIP is larger in annotations
[pip_rank_test_2]
input: group_by = 1, concurrent = True
output: f'{_input:n}.pip_rank_test.csv'
R: expand = '${ }', workdir = cwd, stdout = f'{_output:n}.stdout'
dat = readRDS(${_input:r})
variants = dat$variants
name = dat$name
if (length(unique(variants$GROUP)) == 1) {
write(paste(name, NA, NA, NA, NA, sep=','), file = ${_output:r})
} else {
test = wilcox.test(PIP ~ GROUP, data=variants, conf.int=TRUE)
write(paste(name, test$p.value, test$estimate, test$conf.int[1], test$conf.int[2], sep=','), file = ${_output:r})
}
# Consolidate results to one file
[pip_rank_test_3,cs_fisher_test_3]
output: f"{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/SuSiE_loci.sumstats.{step_name.rsplit('_', 1)[0]}.csv"
bash: expand = True
cat {_input} > {_output}
_input.zap()
dat = read.table('/home/gaow/Documents/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.pip_rank_test.csv', head=F, sep=',')
colnames(dat) = c('annotation', 'p.value', 'estimate', 'conf.int.1', "conf.int.2")
dat[order(dat[,2]),]
Test for enrichment of annotation in CS.
sos run analysis/20180712_Enrichment_Workflow.ipynb cs_fisher_test \
--trait $trait --y-data $ydata -j $ncpu \
--single-annot data/annotation.list
# Test if CS is enriched with annotations
[cs_fisher_test_2]
parameter: pip_cutoff = 0.2
input: group_by = 1, concurrent = True
output: f'{_input:n}.cs_fisher_test.csv'
R: expand = '${ }', workdir = cwd, stdout = f'{_output:n}.stdout'
run_test = function(dat) {
d1 = dat[which(dat$CS < 2),]
d1 = d1[-which(d1$CS == 1 & d1$PIP < ${pip_cutoff}), ]
d1$PIP = NULL
d1 = table(d1)
test.d1 = fisher.test(d1)
# test for non-first CS only
d2 = dat[which(dat$CS != 1),]
d2$CS[which(d2$CS>0)] = 1
d2 = d2[-which(d2$CS == 1 & d2$PIP < ${pip_cutoff}), ]
d2$PIP = NULL
d2 = table(d2)
test.d2 = fisher.test(d2)
return(c(test.d1$p.value, test.d1$estimate, test.d1$conf.int[1], test.d1$conf.int[2], test.d2$p.value, test.d2$estimate, test.d2$conf.int[1], test.d2$conf.int[2]))
}
dat = readRDS(${_input:r})
variants = dat$variants
name = dat$name
if (length(unique(variants$GROUP)) == 1) {
write(paste(name, NA, NA, NA, NA, NA, NA, NA, NA, sep=','), file = ${_output:r})
} else {
# test against all CS
test = run_test(variants[,c('CS', 'GROUP', 'PIP')])
write(paste(c(name, test), collapse=','), file = ${_output:r})
}
# Plot enrichement results
[cs_fisher_test_4]
max_or = 10
min_or = 0
depends: R_library("ggplot2")
output: f"{_input:n}.png"
pval_cutoff = 0.0001
R: expand = '${ }', workdir = cwd
library(ggplot2)
library(cowplot)
plot_or = function(dat, title, xlab= "Odds ratio", ylab="Annotations",color='#800000') {
p1 <- ggplot(dat, aes(y=annotation)) +
geom_point(aes(x=est), colour=color) + #plot OR
geom_segment(aes(x=lower,xend=upper,yend=annotation), alpha=0.6,color=color) + #plot confint
geom_text(aes(x=est,label=pvalue),hjust=-0.2,vjust=0,size=3)
p1 <- p1 + geom_vline(xintercept=1) + theme_bw() + labs(title=title, x=xlab, y=ylab) + xlim(${min_or},${max_or})
p1
}
dat = read.table(${_input:r}, head=F, sep=',')
colnames(dat) = c('annotation', 'pvalue.1', 'est.1', 'conf.1a', "conf.1b", 'pvalue.2', 'est.2', 'conf.2a', "conf.2b")
dat$annotation = gsub("_", " ", dat$annotation)
dat = dat[which(dat$pvalue.1 < ${pval_cutoff}), ]
dat1 = dat[, c('annotation', 'pvalue.1', 'est.1', 'conf.1a', "conf.1b")]
dat1 = dat1[order(dat1$pvalue.1),]
colnames(dat1) = c('annotation', 'pvalue', 'est', 'lower', "upper")
dat1$lower[which(dat1$lower < ${min_or})] = ${min_or}
dat1$upper[which(dat1$upper > ${max_or})] = ${max_or}
dat1$pvalue = paste0("p=", format.pval(dat1$pvalue,digits=1))
dat2 = dat[, c('annotation', 'pvalue.2', 'est.2', 'conf.2a', "conf.2b")]
dat2 = dat2[order(dat2$pvalue.2),]
colnames(dat2) = c('annotation', 'pvalue', 'est', 'lower', "upper")
dat2$lower[which(dat2$lower < ${min_or})] = ${min_or}
dat2$upper[which(dat2$upper > ${max_or})] = ${max_or}
dat2$pvalue = paste0("p=", format.pval(dat2$pvalue,digits=1))
png("${_output:n}_1.png", 4, 4, units = 'in', res = 500)
print(plot_or(dat1, "Primary signals"))
dev.off()
png("${_output:n}_2.png", 4, 4, units = 'in', res = 500)
print(plot_or(dat2, "Secondary signals", ylab="",color="#002b36"))
dev.off()
system("convert +append ${_output:n}_1.png ${_output:n}_2.png ${_output}")
# Plot enrichement results
[cs_fisher_test_new_attempt]
depends: R_library("ggplot2")
output: f"{_input:n}.png"
pval_cutoff = 0.0001
R: expand = '${ }', workdir = cwd
library(ggplot2)
library(cowplot)
plot_or = function(dat, title, xlab= "Odds ratio", ylab="Annotations") {
p1 <- ggplot(dat, aes(y=annotation)) +
geom_point(aes(x=est,color=cs), colour="#800000") + #plot OR
geom_segment(aes(x=lower,xend=upper,yend=annotation),alpha=0.8,color='#800000') + #plot confint
geom_text(aes(x=est,label=pvalue),hjust=-0.2,vjust=0,size=3)
p1 <- p1 + geom_vline(xintercept=1) + theme_cowplot() + labs(title=title, x=xlab, y=ylab)
p1
}
dat = read.table(${_input:r}, head=F, sep=',')
colnames(dat) = c('annotation', 'pvalue.1', 'est.1', 'conf.1a', "conf.1b", 'pvalue.2', 'est.2', 'conf.2a', "conf.2b")
dat$annotation = gsub("_", " ", dat$annotation)
dat = dat[which(dat$pvalue.1 < ${pval_cutoff}), ]
dat1 = dat[, c('annotation', 'pvalue.1', 'est.1', 'conf.1a', "conf.1b")]
dat1 = dat1[order(dat1$pvalue.1),]
dat1 = cbind(dat1, 'Primary')
colnames(dat1) = c('annotation', 'pvalue', 'est', 'lower', "upper", "cs")
dat1$pvalue = paste0("p=", format.pval(dat1$pvalue,digits=1))
dat2 = dat[, c('annotation', 'pvalue.2', 'est.2', 'conf.2a', "conf.2b")]
dat2 = dat2[order(dat2$pvalue.2),]
dat2 = cbind(dat2, 'Primary')
colnames(dat2) = c('annotation', 'pvalue', 'est', 'lower', "upper", "cs")
dat2$pvalue = paste0("p=", format.pval(dat2$pvalue,digits=1))
pdf("${_output:n}.pdf", width=5, height=5, pointsize=15)
print(plot_or(rbind(dat1, dat2), "", ylab=""))
dev.off()
system("convert -density 120 ${_output:n}.pdf ${_output}")
dat = read.table('/project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.cs_fisher_test.csv', head=F, sep=',')
colnames(dat) = c('annotation', 'p.value.1', 'est.1', 'conf.1a', "conf.1b", 'p.value.2', 'est.2', 'conf.2a', "conf.2b")
dat[order(dat[,2]),]
%preview /project/compbio/jointLCLs/results/SuSiE/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.cs_fisher_test.png
Data download from the paper's companion data website.
sos run analysis/20180712_Enrichment_Workflow.ipynb signal_overlap \
--trait $trait --y-data $ydata -j $ncpu --cwd $cwd
[signal_overlap_1 (download Li 2016 results)]
output: f'{cwd}/output_ASintron_PC3.txt'
download: dest_dir = f"{_output:d}"
http://eqtl.uchicago.edu/jointLCL/output_ASintron_PC3.txt
[signal_overlap_2]
depends: z_score
parameter: fdr = 0.0001
python: expand = "${ }"
from sos.utils import get_output
susie = set(get_output("zcat ${z_score} | cut -f2,3 -d'_' | sed 's/_/./g'").strip().split("\n"))
yang = get_output("awk '{if ($5 < ${fdr}) print $1,$2}' ${_input}").strip().split('\n')[1:]
yang_dict = dict()
for line in yang:
line = line.split()
if line[1] not in yang_dict:
yang_dict[line[1]] = []
yang_dict[line[1]].append(line[0])
intersects = []
for key in yang_dict:
common = susie.intersection(yang_dict[key])
intersects.append(len(common) > 0)
print(len(intersects))
print(sum(intersects))