This workflow document has several pipelines, all written in SoS Workflow language. The torus
pipeline is based on some snakemake
pipelines originally written by Jean.
%revisions -s -n 10
sos run gwas_enrichment.ipynb -h
git clone https://github.com/gaow/fine-mapping.git
bed
format annotation)¶sos run workflow/gwas_enrichment.ipynb range2var_annotation --z-score ... --single-annot ... --multi-annot ...
sos run workflow/gwas_enrichment.ipynb enrichment --z-score ... --single-annot ... --multi-annot ...
SoS
¶SoS introduction webpage
Installation:
pip install sos sos-notebook
SoS basic usage webpage
docker
¶Some steps in this pipeline uses docker images we prepare. If you do not have docker installed, you can install it via:
curl -fsSL get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
In this pipeline, bedops
and torus
is executed via docker image we prepared and hosted. The pipeline should automatically download the image and run the docker container instance.
bed
format annotation input¶If you use this type of annotation file, you need to run step 1 and 2.
Annotation files are in bed
format (for example: Promoter_UCSC.bed).
chr1 9873 16361
chr1 32610 36610
chr1 67090 71090
...
If you use this type of annotation file, you only need to run step 2.
0/1 binary indicated. For example, a few lines in Coding_UCSC_annotation.torus.gz
.
SNP Coding_UCSC_d
9:140192349:G:C 0
9:140192576:C:T 0
9:140194020:G:A 0
9:140194735:G:T 1
9:140194793:A:G 1
9:140195019:C:A 1
There are many annotations one can use. For this workflow you should prepare two lists:
torus
, put their file name prefix in data/single_annotations.txt
torus
, put their file name prefix in data/multi_annotations.txt
If you want to remove one annotation just use #
to comment it out in the list files.
%preview ../data/general_annotations.txt -l 10
Example GWAS data (European)
The last column indicates LD chunk. LD chunk blocks depend on chromosome and position. There are 1703 LD chunks in European human genome.
chr1 10583 1892607 1
chr1 1892607 3582736 2
...
chr22 49824534 51243298 1703
Format chr:pos:alt:ref ld_label z-score
. 3 columns in total: chr:pos:alt:ref
(chromosome, position, alternative allele and reference allele), ld_label
(LD chunk label) and z-score
(summary statistics).
Example data:
1:10583:A:G 1 0.116319
...
22:51228910:A:G 1703 -0.866894
All GWAS summary statistics have to be converted to this format in order to use.
GWAS SNP map. Identical number of SNPs with GWAS.
Format chrom.pos chr pos
. 3 columns in total: chrom.pos
(example: chr1.729679
), chr
and pos
.
chr1.729679 1 729679
chr1.731718 1 731718
chr1.734349 1 734349
...
GWAS example link
1:729679:C:G 1 -1.2583
1:731718:T:C 1 0.9745
1:734349:T:C 1 1.0247
...
SNP map example link
chr1.729679 1 729679
chr1.731718 1 731718
chr1.734349 1 734349
...
Example: Promoter_UCSC.bed
download link. 22,436 lines in total. First few lines are as follows:
chr1 9873 16361
chr1 32610 36610
chr1 67090 71090
...
You can edit and change the following 5 directories and file pathes. First edit and run the following 5 command to define bash variables.
work_dir=/home/min/Documents/GWAS_ATAC
anno_dir=/home/min/Documents/GWAS_ATAC/bed_annotations
z_file=/home/min/Documents/GWAS_ATAC/SCZGWAS_zscore/SCZGWAS.zscore.gz
single=/home/min/GIT/fine-mapping/data/general_annotations.txt
git_path=/home/min/GIT/fine-mapping
Then run following commands.
cd $git_path
sos run workflow/gwas_enrichment.ipynb range2var_annotation --cwd $work_dir --annotation_dir $anno_dir --z-score $z_file --single-annot $single
sos run workflow/gwas_enrichment.ipynb enrichment --cwd $work_dir --annotation_dir $anno_dir --z-score $z_file --single-annot $single
[global]
# working directory
parameter: cwd = path() # 1
# hg19 path
parameter: hg19 = path()
# Deepsea model for use with `deepsee_apply`
parameter: deepsea_model = path()
# Path to directory of annotation files
parameter: annotation_dir = path() # 2
# Path to z-score file
parameter: z_score = path() # 3
# Path to list of single annotations to use
parameter: single_annot = path() # 4
# Path to lists of multi-annotations to use
parameter: multi_annot = paths() # 5
try:
single_anno = [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(single_annot).readlines() if not x.startswith('#')]
multi_anno = dict([(f'{y:bn}', [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(y).readlines() if not x.startswith('#')]) for y in multi_annot])
except (FileNotFoundError, IsADirectoryError):
single_anno = []
multi_anno = dict()
out_dir = f'{cwd}/{z_score:bn}'.replace('.', '_')
# Auxiliary step to get variant in bed format based on variant ID in z-score file
[zscore2bed_1]
parameter: in_file = path()
input: in_file
output: f'{_input:n}.bed.unsorted'
R: expand = "${ }", container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
library(readr)
library(stringr)
library(dplyr)
var_file <- ${_input:r}
out_file <- ${_output:r}
variants <- read_tsv(var_file, col_names=FALSE)
colnames(variants) = c('variant', 'ld', 'zscore')
var_info <- str_split(variants$variant, ":")
variants <- mutate(variants, chr = paste0("chr", sapply(var_info, function(x){x[1]})),
pos = sapply(var_info, function(x){x[2]})) %>%
mutate(start = as.numeric(pos), stop=as.numeric(pos) + 1) %>%
select(chr, start, stop, variant)
options(scipen=1000) # So that positions are always fully written out)
write.table(variants, file=out_file, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")
[zscore2bed_2]
output: f'{_input:n}'
bash: expand = True, container = 'gaow/atac-gwas', workdir = cwd
sort-bed {_input} > {_output}
[get_variants: provides = '{data}.bed']
output: f'{data}.bed'
sos_run('zscore2bed', in_file = f'{_output:n}.gz')
# Auxiliary step to merge multiple annotations
[merge_annotations]
parameter: out_file=path()
parameter: data_files=paths()
input: data_files
output: out_file
R: expand = '${ }', container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
library(readr)
library(dplyr)
library(stringr)
out_name <- ${_output:r}
annots <- c(${_input:r,})
var_annot <- read_tsv(annots[1], col_names=T)
for(i in 2:length(annots)){
var_annot2 <- read_tsv(annots[i], col_names=T)
stopifnot(all(var_annot$SNP == var_annot2$SNP))
var_annot <- cbind(var_annot, var_annot2[,2])
}
write.table(var_annot, file=gzfile(out_name),
row.names=FALSE, quote=FALSE, sep="\t")
torus
format binary annotations¶# Get variants in data that falls in target region
[range2var_annotation_1]
depends: f'{z_score:n}.bed'
input: set(paths(single_anno + list(multi_anno.values()))), group_by = 1
output: f'{out_dir}/{_input:bn}.{z_score:bn}.bed'
bash: expand = True, container = 'gaow/atac-gwas', workdir = cwd
bedops -e {z_score:n}.bed {_input} > {_output}
# Make binary annotation file
[range2var_annotation_2]
parameter: discrete = 1
depends: z_score
output: f'{_input:n}.gz'
R: expand = "${ }", container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
library(readr)
library(dplyr)
library(stringr)
variant_tsv <- ${z_score:r}
annotation_var_bed <- ${_input:r}
annot_name <- ${_input:bnr} %>% str_replace(paste0(".",${z_score:bnr}), "")
out_name <- ${_output:r}
vars <- read_tsv(variant_tsv, col_names=FALSE)[,1]
annot_vars = read_tsv(annotation_var_bed, col_names=FALSE)
names(vars) <- "SNP"
vars <- vars %>%
mutate(annot_d = case_when(SNP %in% annot_vars$X4 ~ 1,
TRUE ~ 0))
names(vars)[2] <- paste0(annot_name, '${"_d" if discrete else "_c"}')
write.table(vars, file=gzfile(out_name),
col.names=TRUE, row.names=FALSE, sep="\t", quote=FALSE)
# Obtain multiple annotations per SNP for enrichment analysis
[range2var_annotation_3]
for k, value in multi_anno.items():
sos_run('merge_annotations', out_file = f'{out_dir}/{k}.{z_score:bn}.gz', data_files = [f'{out_dir}/{v:bn}.{z_score:bn}.gz' for v in paths(value)])
torus
¶# Run torus with annotation + z-score file
[enrichment_1 (run torus)]
depends: z_score
parameter: gmap = path(f"{cwd}/gene_map/gmap.gz")
input_files = [f'{out_dir}/{value:bn}.{z_score:bn}.gz' for value in paths(single_anno)] + [f'{out_dir}/{k}.{z_score:bn}.gz' for k in multi_anno]
fail_if(len(input_files) == 0, msg = "No annotations to use! Please use ``--single-annot`` and ``--multi-annot`` to pass annotation lists")
input: input_files, group_by = 1
output: f'{_input:n}.torus'
bash: container = 'gaow/atac-gwas', workdir = cwd, expand = True, stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'
rm -rf {_output:n}_prior
torus -d {z_score} -smap {_input:d}/smap.gz -gmap {gmap} -annot {_input} -est --load_zval -dump_prior {_output:n}_prior > {_output}
# Consolidate all torus result into one table
[enrichment_2 (make output table)]
input: group_by = 'all'
output: f'{out_dir}/{z_score:bn}.torus.merged.gz'
R: expand = '${ }', container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
library(dplyr)
library(readr)
library(stringr)
output_file <- ${_output:r}
names <- c(${_input:bnr,})
file_names <- c(${_input:r,})
tab <- data.frame("name"=names)
data <- lapply(file_names, function(f){
lns <- read_lines(f)
x <- unlist(str_split(lns[3], " "))
x <- x[str_length(x) > 0]
x})
tab[,"log odds ratio"] <- sapply(data, function(x){x[2]})
tab[,"95% ci"] <- sapply(data, function(x){paste0("(", x[3], ", ", x[4], ")")})
write.table(tab, file=gzfile(output_file), sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)
deepsea
model to variants¶Credits to Yanyu Liang. Required inputs are:
chr1 68090 68091 G T
chr1 68090 68091 G A
chr1 68090 68091 G C
chr1 68091 68092 C G
chr1 68091 68092 C T
chr1 68091 68092 C A
chr1 68092 68093 T A
chr1 68092 68093 T C
chr1 68092 68093 T G
chr1 68093 68094 A C
To run this workflow:
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants /path/to/variant/list.file
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/scz.swe.var.txt
It requires much memory. So we test for first 10K line of variant and it worked.
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/test.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/test100K.var.txt
Break 9898079 snps into two halves and run deepsea separately.
less scz.swe.var.txt | head -4949039 | gzip > half1.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/half1.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/half2.var.txt
[deepsea_apply_1 (prepare config file)]
parameter: variants = path()
input: variants
output: f'{_input:n}.config'
report: expand = True, output = _output
var_list: {_input:r}
pred_model:
path: '{deepsea_model}'
label:
CN: 1
DN: 2
GA: 3
ips: 4
NSC: 5
reference_genome: '{hg19}'
script_dir: '/opt/deepann'
out_dir: {_output:dr}
out_prefix: '{_output:bn}_deepsea'
##### some default setup #####
#### usually don't change ####
check_allele: False
design: '499-1-500'
[deepsea_apply_2 (apply deepsea weights)]
output: f'{_input:n}.deepsea.list'
bash: expand = True, container = 'gaow/deepann', volumes = [f'{hg19:d}:{hg19:d}', f'{deepsea_model:d}:{deepsea_model:d}', f'{os.path.expanduser("~")}:/home/$USER'], extra_args = "-e HOME=/home/$USER -e USER=$USER"
snakemake --snakefile /opt/deepann/Snakefile --configfile {_input} && ls {_input:n}_deepsea/output__*.gz > {_output}
To properly perform enrichment analysis we want to match the control SNPs with the SNPs of interest -- that is, SNPs inside CS -- in terms of LD, distance to nearest gene and MAF. The GREGOR software can generate list of matched SNPs. I will use SNPs inside CS as input and expect a list of output SNPs matching these inputs.
GREGOR is release under University of Michigan license so I'll not make it into a docker image. So path to GREGOR directory is required. Also we need reference files, prepared by:
cat \
GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.00 \
GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.01 \
> GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
tar zxvf GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
MD5SUM check:
AFR.part.0 ( MD5: 9926904128dd58d6bf1ad4f1e90638af )
AFR.part.1 ( MD5: c1d30aff89a584bfa8c1fa1bdc197f21 )
Same for EUR data-set.
PGC3
sos run workflow/gwas_enrichment.ipynb gregor --gregor-input ~/Documents/GWAS_ATAC/scz2_zscore/gregor.scz2.txt.gz --single-annot data/gregor.txt
PGC2
sos run workflow/gwas_enrichment.ipynb gregor --gregor-input ~/Documents/GWAS_ATAC/pgc2_zscore/gregor.pgc2.txt.gz --single-annot data/gregor.txt
[gregor_1 (make SNP index)]
depends: executable('zcat')
parameter: gregor_input = path()
input: gregor_input
output: f'{_input:nn}.rsid.txt', f'{_input:nn}.annotations.list'
bash: expand = '${ }'
zcat ${_input} | cut -f 2,3 -d "_" | sed 's/_/:/g' > ${_output[0]}
with open(_output[1], 'w') as f:
f.write('\n'.join(single_anno))
[gregor_2 (make configuration file)]
depends: executable('sed')
parameter: gregor_db = path('~/Documents/hg19/GREGOR_DB')
parameter: pop = 'EUR'
output: f'{_input[0]:nn}.gregor.conf'
report: output = f'{_output}', expand = True
##############################################################################
# CHIPSEQ ENRICHMENT CONFIGURATION FILE
# This configuration file contains run-time configuration of
# CHIP_SEQ ENRICHMENT
###############################################################################
## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
###############################################################################
INDEX_SNP_FILE = {_input[0]}
BED_FILE_INDEX = {_input[1]}
REF_DIR = {gregor_db}
R2THRESHOLD = 0.7 ## must be greater than 0.7
LDWINDOWSIZE = 10000 ## must be less than 1MB; these two values define LD buddies
OUT_DIR = {_output:nn}_gregor_output
MIN_NEIGHBOR_NUM = 10 ## define the size of neighborhood
BEDFILE_IS_SORTED = true ## false, if the bed files are not sorted
POPULATION = {pop} ## define the population, you can specify EUR, AFR, AMR or ASN
TOPNBEDFILES = 2
JOBNUMBER = 10
###############################################################################
#BATCHTYPE = mosix ## submit jobs on MOSIX
#BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
###############################################################################
#BATCHTYPE = slurm ## submit jobs on SLURM
#BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
###############################################################################
BATCHTYPE = local ## run jobs on local machine
bash: expand = True
sed -i '/^$/d' {_output}
GREGOR is written in perl
. Some libraries are required before one can run GREGOR:
sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl
[gregor_3 (run gregor)]
depends: executable('perl')
parameter: gregor_path = path('~/Documents/GREGOR')
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True
perl {gregor_path}/script/GREGOR.pl --conf {_input} && touch {_output}
[gregor_4 (format output)]
depends: executable('sed')
output: f'{_input:n}.csv'
bash: expand = True
sed 's/\t/,/g' {_input} > {_output}
Exported from workflow/gwas_enrichment.ipynb
committed by minqiao on Fri Jul 12 11:48:55 2019 revision 9, c02e3e9