Enrichment analysis workflows¶

This workflow document has several pipelines, all written in SoS Workflow language. The torus pipeline is based on some snakemake pipelines originally written by Jean.

%revisions -s -n 10

sos run gwas_enrichment.ipynb -h

usage: sos run gwas_enrichment.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  zscore2bed
  get_variants
  merge_annotations
  range2var_annotation
  enrichment
  deepsea_apply
  gregor

Global Workflow Options:
  --cwd . (as path)
                        working directory
  --hg19 . (as path)
                        hg19 path
  --deepsea-model . (as path)
                        Deepsea model for use with `deepsee_apply`
  --annotation-dir . (as path)
                        Path to directory of annotation files
  --z-score . (as path)
                        Path to z-score file
  --single-annot . (as path)
                        Path to list of single annotations to use
  --multi-annot  paths() # 5

                        Path to lists of multi-annotations to use

Sections
  zscore2bed_1:         Auxiliary step to get variant in bed format based on
                        variant ID in z-score file
    Workflow Options:
      --in-file . (as path)
  zscore2bed_2:
  get_variants:
  merge_annotations:    Auxiliary step to merge multiple annotations
    Workflow Options:
      --out-file . (as path)
      --data-files paths()

  range2var_annotation_1: Get variants in data that falls in target region
  range2var_annotation_2: Make binary annotation file
    Workflow Options:
      --discrete 1 (as int)
  range2var_annotation_3: Obtain multiple annotations per SNP for enrichment
                        analysis
  enrichment_1:         Run torus with annotation + z-score file
    Workflow Options:
      --gmap  path(f"{cwd}/gene_map/gmap.gz") 

  enrichment_2:         Consolidate all torus result into one table
  deepsea_apply_1:
    Workflow Options:
      --variants . (as path)
  deepsea_apply_2:
  gregor_1:
    Workflow Options:
      --gregor-input . (as path)
  gregor_2:
    Workflow Options:
      --gregor-db /home/min/Documents/hg19/GREGOR_DB (as path)
      --pop EUR
  gregor_3:
    Workflow Options:
      --gregor-path /home/min/Documents/GREGOR (as path)
  gregor_4:

Copy this GIT depository¶

git clone https://github.com/gaow/fine-mapping.git

To run enrichment analysis (overview)¶

Step 1 (start from `bed` format annotation)¶

sos run workflow/gwas_enrichment.ipynb range2var_annotation --z-score ... --single-annot ... --multi-annot ...

Step 2 (start from binary annotation for SNPs)¶

sos run workflow/gwas_enrichment.ipynb enrichment --z-score ... --single-annot ... --multi-annot ...

Software intallation¶

`SoS`¶

SoS introduction webpage

Installation:

pip install sos sos-notebook

SoS basic usage webpage

`docker`¶

Some steps in this pipeline uses docker images we prepare. If you do not have docker installed, you can install it via:

Run commands below:

curl -fsSL get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

Log out and log back in (no need to reboot computer).

In this pipeline, bedops and torus is executed via docker image we prepared and hosted. The pipeline should automatically download the image and run the docker container instance.

Reference data preparation¶

hg19.fa¶

only for deepsea and gregor steps; if you only use enrichment analysis, you do not need to download it.
Genome Reference: hg19 (GRCh37) or hg38 (GRCh38).
Download link

Data specification¶

Annotation specification¶

`bed` format annotation input¶

If you use this type of annotation file, you need to run step 1 and 2.

Annotation files are in bed format (for example: Promoter_UCSC.bed).

    chr1    9873    16361
    chr1    32610   36610
    chr1    67090   71090
    ...

Binary annotations of interest¶

If you use this type of annotation file, you only need to run step 2.

0/1 binary indicated. For example, a few lines in Coding_UCSC_annotation.torus.gz.

SNP     Coding_UCSC_d
9:140192349:G:C    0
9:140192576:C:T    0
9:140194020:G:A    0
9:140194735:G:T    1
9:140194793:A:G    1
9:140195019:C:A    1

There are many annotations one can use. For this workflow you should prepare two lists:

For the ones that you'd like to use for one variable torus, put their file name prefix in data/single_annotations.txt
For the ones that you'd like to use for multi-variable torus, put their file name prefix in data/multi_annotations.txt

If you want to remove one annotation just use # to comment it out in the list files.

%preview ../data/general_annotations.txt -l 10

Coding_UCSC
Promoter_UCSC
Repressed_Hoffman
Conserved_LindbladToh
DHS_peaks_Trynka
#FetalDHS_Trynka

GWAS z-scores specification¶

GWAS data of interest¶

Example GWAS data (European)

SCZ Sweden data

PTSD European Ancestry

LD block files¶

The last column indicates LD chunk. LD chunk blocks depend on chromosome and position. There are 1703 LD chunks in European human genome.

            chr1    10583       1892607     1
            chr1    1892607     3582736     2
            ...
            chr22   49824534    51243298    1703

GWAS input format¶

Format chr:pos:alt:ref ld_label z-score. 3 columns in total: chr:pos:alt:ref (chromosome, position, alternative allele and reference allele), ld_label (LD chunk label) and z-score (summary statistics).

Example data:

1:10583:A:G        1      0.116319
...
22:51228910:A:G    1703   -0.866894

All GWAS summary statistics have to be converted to this format in order to use.

SNP map¶

GWAS SNP map. Identical number of SNPs with GWAS.

Format chrom.pos chr pos. 3 columns in total: chrom.pos (example: chr1.729679), chr and pos.

chr1.729679 1   729679
chr1.731718 1   731718
chr1.734349 1   734349
...

Minimal working example¶

GWAS¶

GWAS example link

1:729679:C:G    1   -1.2583
1:731718:T:C    1   0.9745
1:734349:T:C    1   1.0247
...

SNP map¶

SNP map example link

chr1.729679 1   729679
chr1.731718 1   731718
chr1.734349 1   734349
...

Functional annotation¶

Example: Promoter_UCSC.bed download link. 22,436 lines in total. First few lines are as follows:

    chr1    9873    16361
    chr1    32610   36610
    chr1    67090   71090
    ...

LD blocks¶

LD block reference peper link.
Download link. Used to identify LD chunk for each GWAS SNP.

Workflow¶

You can edit and change the following 5 directories and file pathes. First edit and run the following 5 command to define bash variables.

work_dir=/home/min/Documents/GWAS_ATAC
anno_dir=/home/min/Documents/GWAS_ATAC/bed_annotations
z_file=/home/min/Documents/GWAS_ATAC/SCZGWAS_zscore/SCZGWAS.zscore.gz
single=/home/min/GIT/fine-mapping/data/general_annotations.txt
git_path=/home/min/GIT/fine-mapping

Then run following commands.

cd $git_path
sos run workflow/gwas_enrichment.ipynb range2var_annotation --cwd $work_dir --annotation_dir $anno_dir --z-score $z_file --single-annot $single
sos run workflow/gwas_enrichment.ipynb enrichment --cwd $work_dir --annotation_dir $anno_dir --z-score $z_file --single-annot $single

[global]
# working directory
parameter: cwd = path() # 1
# hg19 path
parameter: hg19 = path()
# Deepsea model for use with `deepsee_apply`
parameter: deepsea_model = path()
# Path to directory of annotation files
parameter: annotation_dir = path() # 2
# Path to z-score file
parameter: z_score = path() # 3
# Path to list of single annotations to use
parameter: single_annot = path() # 4
# Path to lists of multi-annotations to use
parameter: multi_annot = paths() # 5
try:
    single_anno = [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(single_annot).readlines() if not x.startswith('#')]
    multi_anno = dict([(f'{y:bn}', [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(y).readlines() if not x.startswith('#')]) for y in multi_annot])
except (FileNotFoundError, IsADirectoryError):
    single_anno = []
    multi_anno = dict()
out_dir = f'{cwd}/{z_score:bn}'.replace('.', '_')

Utility steps¶

Convert variants from z-score file to bed format¶

# Auxiliary step to get variant in bed format based on variant ID in z-score file
[zscore2bed_1]
parameter: in_file = path()
input: in_file
output: f'{_input:n}.bed.unsorted'
R: expand = "${ }", container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(readr)
    library(stringr)
    library(dplyr)
    var_file <- ${_input:r}
    out_file <- ${_output:r}

    variants <- read_tsv(var_file, col_names=FALSE)
    colnames(variants) = c('variant', 'ld', 'zscore')
    var_info <- str_split(variants$variant, ":")
    variants <- mutate(variants, chr = paste0("chr", sapply(var_info, function(x){x[1]})), 
                                 pos = sapply(var_info, function(x){x[2]})) %>%
                mutate(start = as.numeric(pos), stop=as.numeric(pos) + 1) %>%
                select(chr, start, stop, variant)
    options(scipen=1000) # So that positions are always fully written out)
    write.table(variants, file=out_file, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")

[zscore2bed_2]
output: f'{_input:n}'
bash: expand = True, container = 'gaow/atac-gwas', workdir = cwd
     sort-bed {_input} > {_output}

[get_variants: provides = '{data}.bed']
output: f'{data}.bed'
sos_run('zscore2bed', in_file = f'{_output:n}.gz')

Merge annotations for joint torus analysis¶

# Auxiliary step to merge multiple annotations
[merge_annotations]
parameter: out_file=path()
parameter: data_files=paths()
input: data_files
output: out_file
R: expand = '${ }', container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'

    library(readr)
    library(dplyr)
    library(stringr)

    out_name <- ${_output:r}
    annots <- c(${_input:r,})
    var_annot <- read_tsv(annots[1], col_names=T)
    for(i in 2:length(annots)){
        var_annot2 <- read_tsv(annots[i], col_names=T)
        stopifnot(all(var_annot$SNP == var_annot2$SNP))
        var_annot <- cbind(var_annot, var_annot2[,2])
    }
    write.table(var_annot, file=gzfile(out_name),
                row.names=FALSE, quote=FALSE, sep="\t")

Prepare `torus` format binary annotations¶

# Get variants in data that falls in target region
[range2var_annotation_1]
depends: f'{z_score:n}.bed'
input: set(paths(single_anno + list(multi_anno.values()))), group_by = 1
output: f'{out_dir}/{_input:bn}.{z_score:bn}.bed'
bash: expand = True, container = 'gaow/atac-gwas', workdir = cwd
    bedops -e {z_score:n}.bed {_input} > {_output}

# Make binary annotation file
[range2var_annotation_2]
parameter: discrete = 1
depends: z_score
output: f'{_input:n}.gz'
R: expand = "${ }", container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(readr)
    library(dplyr)
    library(stringr)

    variant_tsv <- ${z_score:r}
    annotation_var_bed <- ${_input:r}
    annot_name <- ${_input:bnr} %>% str_replace(paste0(".",${z_score:bnr}), "")
    out_name <- ${_output:r}

    vars <- read_tsv(variant_tsv, col_names=FALSE)[,1]
    annot_vars = read_tsv(annotation_var_bed, col_names=FALSE)
    names(vars) <- "SNP"
    vars <- vars %>%
            mutate(annot_d = case_when(SNP %in% annot_vars$X4 ~ 1,
                                                        TRUE ~ 0))
    names(vars)[2] <- paste0(annot_name, '${"_d" if discrete else "_c"}')
    write.table(vars, file=gzfile(out_name),
                col.names=TRUE, row.names=FALSE, sep="\t", quote=FALSE)

# Obtain multiple annotations per SNP for enrichment analysis
[range2var_annotation_3]
for k, value in multi_anno.items():
    sos_run('merge_annotations', out_file = f'{out_dir}/{k}.{z_score:bn}.gz', data_files = [f'{out_dir}/{v:bn}.{z_score:bn}.gz' for v in paths(value)])

Enrichment analysis via `torus`¶

# Run torus with annotation + z-score file
[enrichment_1 (run torus)]
depends: z_score
parameter: gmap = path(f"{cwd}/gene_map/gmap.gz") 
input_files = [f'{out_dir}/{value:bn}.{z_score:bn}.gz' for value in paths(single_anno)] + [f'{out_dir}/{k}.{z_score:bn}.gz' for k in multi_anno]
fail_if(len(input_files) == 0, msg = "No annotations to use! Please use ``--single-annot`` and ``--multi-annot`` to pass annotation lists")
input: input_files, group_by = 1
output: f'{_input:n}.torus'

bash: container = 'gaow/atac-gwas', workdir = cwd, expand = True, stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'
    rm -rf {_output:n}_prior 
    torus -d {z_score} -smap {_input:d}/smap.gz -gmap {gmap} -annot {_input} -est --load_zval -dump_prior {_output:n}_prior > {_output}

# Consolidate all torus result into one table
[enrichment_2 (make output table)]
input: group_by = 'all'
output: f'{out_dir}/{z_score:bn}.torus.merged.gz'
R: expand = '${ }', container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(dplyr)
    library(readr)
    library(stringr)
    output_file <- ${_output:r}
    names <- c(${_input:bnr,})
    file_names <- c(${_input:r,})
    tab <- data.frame("name"=names)

    data <- lapply(file_names, function(f){
                    lns <- read_lines(f)
                    x <- unlist(str_split(lns[3], " "))
                    x <- x[str_length(x) > 0]
                    x})

    tab[,"log odds ratio"] <- sapply(data, function(x){x[2]})
    tab[,"95% ci"] <- sapply(data, function(x){paste0("(", x[3], ", ", x[4], ")")})
    write.table(tab, file=gzfile(output_file), sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)

Apply pre-trained `deepsea` model to variants¶

Credits to Yanyu Liang. Required inputs are:

Path to hg19 reference genome
Path to model HDF5 file
Path to list of variants, in the format of:

chr1    68090   68091   G   T
chr1    68090   68091   G   A
chr1    68090   68091   G   C
chr1    68091   68092   C   G
chr1    68091   68092   C   T
chr1    68091   68092   C   A
chr1    68092   68093   T   A
chr1    68092   68093   T   C
chr1    68092   68093   T   G
chr1    68093   68094   A   C

To run this workflow:

sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants /path/to/variant/list.file

sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/scz.swe.var.txt

It requires much memory. So we test for first 10K line of variant and it worked.

sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/test.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/test100K.var.txt

Break 9898079 snps into two halves and run deepsea separately.

less scz.swe.var.txt | head -4949039 | gzip > half1.var.txt

sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/half1.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/half2.var.txt

[deepsea_apply_1 (prepare config file)]
parameter: variants = path()
input: variants
output: f'{_input:n}.config'
report: expand = True, output = _output
    var_list: {_input:r}
    pred_model:
      path: '{deepsea_model}'
      label:
        CN: 1
        DN: 2
        GA: 3
        ips: 4
        NSC: 5
    reference_genome: '{hg19}'
    script_dir: '/opt/deepann'
    out_dir: {_output:dr}
    out_prefix: '{_output:bn}_deepsea'

    ##### some default setup #####
    #### usually don't change ####
    check_allele: False
    design: '499-1-500'

[deepsea_apply_2 (apply deepsea weights)]
output: f'{_input:n}.deepsea.list'
bash: expand = True, container = 'gaow/deepann', volumes = [f'{hg19:d}:{hg19:d}', f'{deepsea_model:d}:{deepsea_model:d}', f'{os.path.expanduser("~")}:/home/$USER'], extra_args = "-e HOME=/home/$USER -e USER=$USER"
    snakemake --snakefile /opt/deepann/Snakefile --configfile {_input} && ls {_input:n}_deepsea/output__*.gz > {_output}

Enrichment analysis via GREGOR¶

To properly perform enrichment analysis we want to match the control SNPs with the SNPs of interest -- that is, SNPs inside CS -- in terms of LD, distance to nearest gene and MAF. The GREGOR software can generate list of matched SNPs. I will use SNPs inside CS as input and expect a list of output SNPs matching these inputs.

GREGOR is release under University of Michigan license so I'll not make it into a docker image. So path to GREGOR directory is required. Also we need reference files, prepared by:

cat \
    GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.00 \
    GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.01 \
    > GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
tar zxvf GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz

MD5SUM check:

AFR.part.0  ( MD5: 9926904128dd58d6bf1ad4f1e90638af )
AFR.part.1  ( MD5: c1d30aff89a584bfa8c1fa1bdc197f21 )

Same for EUR data-set.

To run GREGOR¶

PGC3

sos run workflow/gwas_enrichment.ipynb gregor --gregor-input ~/Documents/GWAS_ATAC/scz2_zscore/gregor.scz2.txt.gz --single-annot data/gregor.txt

PGC2

sos run workflow/gwas_enrichment.ipynb gregor --gregor-input ~/Documents/GWAS_ATAC/pgc2_zscore/gregor.pgc2.txt.gz --single-annot data/gregor.txt

[gregor_1 (make SNP index)]
depends: executable('zcat')
parameter: gregor_input = path()
input: gregor_input
output: f'{_input:nn}.rsid.txt', f'{_input:nn}.annotations.list'
bash: expand = '${ }'
    zcat ${_input} | cut -f 2,3 -d "_" | sed 's/_/:/g' > ${_output[0]}

with open(_output[1], 'w') as f:
    f.write('\n'.join(single_anno))

[gregor_2 (make configuration file)]
depends: executable('sed')
parameter: gregor_db = path('~/Documents/hg19/GREGOR_DB')
parameter: pop = 'EUR'
output: f'{_input[0]:nn}.gregor.conf'
report: output = f'{_output}', expand = True
    ##############################################################################
    # CHIPSEQ ENRICHMENT CONFIGURATION FILE
    # This configuration file contains run-time configuration of
    # CHIP_SEQ ENRICHMENT
    ###############################################################################
    ## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
    ###############################################################################
    INDEX_SNP_FILE = {_input[0]}
    BED_FILE_INDEX = {_input[1]} 
    REF_DIR = {gregor_db}
    R2THRESHOLD = 0.7 ## must be greater than 0.7
    LDWINDOWSIZE = 10000 ## must be less than 1MB; these two values define LD buddies
    OUT_DIR = {_output:nn}_gregor_output
    MIN_NEIGHBOR_NUM = 10 ## define the size of neighborhood
    BEDFILE_IS_SORTED = true  ## false, if the bed files are not sorted
    POPULATION = {pop}  ## define the population, you can specify EUR, AFR, AMR or ASN
    TOPNBEDFILES = 2 
    JOBNUMBER = 10
    ###############################################################################
    #BATCHTYPE = mosix ##  submit jobs on MOSIX
    #BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
    ###############################################################################
    #BATCHTYPE = slurm   ##  submit jobs on SLURM
    #BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
    ###############################################################################
    BATCHTYPE = local ##  run jobs on local machine

bash: expand = True
    sed -i '/^$/d' {_output}

GREGOR is written in perl. Some libraries are required before one can run GREGOR:

sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl

[gregor_3 (run gregor)]
depends: executable('perl')
parameter: gregor_path = path('~/Documents/GREGOR')
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True
    perl {gregor_path}/script/GREGOR.pl --conf {_input} && touch {_output}

[gregor_4 (format output)]
depends: executable('sed')
output: f'{_input:n}.csv'
bash: expand = True
    sed 's/\t/,/g' {_input} > {_output}

Revision	Author	Date	Message
43ca235	Min Qiao	2019-07-09	update directories info
53345f3	Min Qiao	2019-07-09	remove unnecessary depends
b6dec46	Gao Wang	2019-07-09	Replace torus instruction with docker installation
f7760f4	Gao Wang	2019-07-09	Update torus installation instruction
9840669	Gao Wang	2019-07-09	Build website
4e3cda4	minqiao	2019-07-09	update pipeline
8200a37	Gao Wang	2019-07-07	Add enrichment workflow

Enrichment analysis workflows¶

Copy this GIT depository¶

To run enrichment analysis (overview)¶

Step 1 (start from bed format annotation)¶

Step 2 (start from binary annotation for SNPs)¶

Software intallation¶

SoS¶

docker¶

Reference data preparation¶

hg19.fa¶

Data specification¶

Annotation specification¶

bed format annotation input¶

Binary annotations of interest¶

GWAS z-scores specification¶

GWAS data of interest¶

LD block files¶

GWAS input format¶

SNP map¶

Minimal working example¶

GWAS¶

SNP map¶

Functional annotation¶

LD blocks¶

Workflow¶

Utility steps¶

Convert variants from z-score file to bed format¶

Merge annotations for joint torus analysis¶

Prepare torus format binary annotations¶

Enrichment analysis via torus¶

Apply pre-trained deepsea model to variants¶

Enrichment analysis via GREGOR¶

To run GREGOR¶

Step 1 (start from `bed` format annotation)¶

`SoS`¶

`docker`¶

`bed` format annotation input¶

Prepare `torus` format binary annotations¶

Enrichment analysis via `torus`¶

Apply pre-trained `deepsea` model to variants¶