fine-mapping

Enrichment analysis workflows

This workflow document has several pipelines, all written in SoS Workflow language. The torus pipeline is based on some snakemake pipelines originally written by Jean.

In [1]:
%revisions -s -n 10
Revision Author Date Message
43ca235 Min Qiao 2019-07-09 update directories info
53345f3 Min Qiao 2019-07-09 remove unnecessary depends
b6dec46 Gao Wang 2019-07-09 Replace torus instruction with docker installation
f7760f4 Gao Wang 2019-07-09 Update torus installation instruction
9840669 Gao Wang 2019-07-09 Build website
4e3cda4 minqiao 2019-07-09 update pipeline
8200a37 Gao Wang 2019-07-07 Add enrichment workflow
In [3]:
sos run gwas_enrichment.ipynb -h
usage: sos run gwas_enrichment.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  zscore2bed
  get_variants
  merge_annotations
  range2var_annotation
  enrichment
  deepsea_apply
  gregor

Global Workflow Options:
  --cwd . (as path)
                        working directory
  --hg19 . (as path)
                        hg19 path
  --deepsea-model . (as path)
                        Deepsea model for use with `deepsee_apply`
  --annotation-dir . (as path)
                        Path to directory of annotation files
  --z-score . (as path)
                        Path to z-score file
  --single-annot . (as path)
                        Path to list of single annotations to use
  --multi-annot  paths() # 5

                        Path to lists of multi-annotations to use

Sections
  zscore2bed_1:         Auxiliary step to get variant in bed format based on
                        variant ID in z-score file
    Workflow Options:
      --in-file . (as path)
  zscore2bed_2:
  get_variants:
  merge_annotations:    Auxiliary step to merge multiple annotations
    Workflow Options:
      --out-file . (as path)
      --data-files paths()

  range2var_annotation_1: Get variants in data that falls in target region
  range2var_annotation_2: Make binary annotation file
    Workflow Options:
      --discrete 1 (as int)
  range2var_annotation_3: Obtain multiple annotations per SNP for enrichment
                        analysis
  enrichment_1:         Run torus with annotation + z-score file
    Workflow Options:
      --gmap  path(f"{cwd}/gene_map/gmap.gz") 

  enrichment_2:         Consolidate all torus result into one table
  deepsea_apply_1:
    Workflow Options:
      --variants . (as path)
  deepsea_apply_2:
  gregor_1:
    Workflow Options:
      --gregor-input . (as path)
  gregor_2:
    Workflow Options:
      --gregor-db /home/min/Documents/hg19/GREGOR_DB (as path)
      --pop EUR
  gregor_3:
    Workflow Options:
      --gregor-path /home/min/Documents/GREGOR (as path)
  gregor_4:

Copy this GIT depository

git clone https://github.com/gaow/fine-mapping.git

To run enrichment analysis (overview)

Step 1 (start from bed format annotation)

sos run workflow/gwas_enrichment.ipynb range2var_annotation --z-score ... --single-annot ... --multi-annot ...

Step 2 (start from binary annotation for SNPs)

sos run workflow/gwas_enrichment.ipynb enrichment --z-score ... --single-annot ... --multi-annot ...

Software intallation

SoS

SoS introduction webpage

Installation:

pip install sos sos-notebook

SoS basic usage webpage

docker

Some steps in this pipeline uses docker images we prepare. If you do not have docker installed, you can install it via:

  • Run commands below:
curl -fsSL get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
  • Log out and log back in (no need to reboot computer).

In this pipeline, bedops and torus is executed via docker image we prepared and hosted. The pipeline should automatically download the image and run the docker container instance.

Reference data preparation

hg19.fa

  • only for deepsea and gregor steps; if you only use enrichment analysis, you do not need to download it.
  • Genome Reference: hg19 (GRCh37) or hg38 (GRCh38).
  • Download link

Data specification

Annotation specification

bed format annotation input

If you use this type of annotation file, you need to run step 1 and 2.

Annotation files are in bed format (for example: Promoter_UCSC.bed).

    chr1    9873    16361
    chr1    32610   36610
    chr1    67090   71090
    ...

Binary annotations of interest

If you use this type of annotation file, you only need to run step 2.

0/1 binary indicated. For example, a few lines in Coding_UCSC_annotation.torus.gz.

SNP     Coding_UCSC_d
9:140192349:G:C    0
9:140192576:C:T    0
9:140194020:G:A    0
9:140194735:G:T    1
9:140194793:A:G    1
9:140195019:C:A    1


There are many annotations one can use. For this workflow you should prepare two lists:

  • For the ones that you'd like to use for one variable torus, put their file name prefix in data/single_annotations.txt
  • For the ones that you'd like to use for multi-variable torus, put their file name prefix in data/multi_annotations.txt

If you want to remove one annotation just use # to comment it out in the list files.

In [4]:
%preview ../data/general_annotations.txt -l 10
> ../data/general_annotations.txt (100 B):
6 lines
Coding_UCSC
Promoter_UCSC
Repressed_Hoffman
Conserved_LindbladToh
DHS_peaks_Trynka
#FetalDHS_Trynka

GWAS z-scores specification

GWAS data of interest

Example GWAS data (European)

SCZ Sweden data

PTSD European Ancestry

LD block files

The last column indicates LD chunk. LD chunk blocks depend on chromosome and position. There are 1703 LD chunks in European human genome.

            chr1    10583       1892607     1
            chr1    1892607     3582736     2
            ...
            chr22   49824534    51243298    1703

GWAS input format

Format chr:pos:alt:ref ld_label z-score. 3 columns in total: chr:pos:alt:ref (chromosome, position, alternative allele and reference allele), ld_label (LD chunk label) and z-score (summary statistics).

Example data:

1:10583:A:G        1      0.116319
...
22:51228910:A:G    1703   -0.866894

All GWAS summary statistics have to be converted to this format in order to use.

SNP map

GWAS SNP map. Identical number of SNPs with GWAS.

Format chrom.pos chr pos. 3 columns in total: chrom.pos (example: chr1.729679), chr and pos.

chr1.729679 1   729679
chr1.731718 1   731718
chr1.734349 1   734349
...

Minimal working example

GWAS

GWAS example link

1:729679:C:G    1   -1.2583
1:731718:T:C    1   0.9745
1:734349:T:C    1   1.0247
...

SNP map

SNP map example link

chr1.729679 1   729679
chr1.731718 1   731718
chr1.734349 1   734349
...

Functional annotation

Example: Promoter_UCSC.bed download link. 22,436 lines in total. First few lines are as follows:

    chr1    9873    16361
    chr1    32610   36610
    chr1    67090   71090
    ...

LD blocks

  • LD block reference peper link.

  • Download link. Used to identify LD chunk for each GWAS SNP.

Workflow

You can edit and change the following 5 directories and file pathes. First edit and run the following 5 command to define bash variables.

work_dir=/home/min/Documents/GWAS_ATAC
anno_dir=/home/min/Documents/GWAS_ATAC/bed_annotations
z_file=/home/min/Documents/GWAS_ATAC/SCZGWAS_zscore/SCZGWAS.zscore.gz
single=/home/min/GIT/fine-mapping/data/general_annotations.txt
git_path=/home/min/GIT/fine-mapping

Then run following commands.

cd $git_path
sos run workflow/gwas_enrichment.ipynb range2var_annotation --cwd $work_dir --annotation_dir $anno_dir --z-score $z_file --single-annot $single
sos run workflow/gwas_enrichment.ipynb enrichment --cwd $work_dir --annotation_dir $anno_dir --z-score $z_file --single-annot $single
In [6]:
[global]
# working directory
parameter: cwd = path() # 1
# hg19 path
parameter: hg19 = path()
# Deepsea model for use with `deepsee_apply`
parameter: deepsea_model = path()
# Path to directory of annotation files
parameter: annotation_dir = path() # 2
# Path to z-score file
parameter: z_score = path() # 3
# Path to list of single annotations to use
parameter: single_annot = path() # 4
# Path to lists of multi-annotations to use
parameter: multi_annot = paths() # 5
try:
    single_anno = [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(single_annot).readlines() if not x.startswith('#')]
    multi_anno = dict([(f'{y:bn}', [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(y).readlines() if not x.startswith('#')]) for y in multi_annot])
except (FileNotFoundError, IsADirectoryError):
    single_anno = []
    multi_anno = dict()
out_dir = f'{cwd}/{z_score:bn}'.replace('.', '_')

Utility steps

Convert variants from z-score file to bed format

In [ ]:
# Auxiliary step to get variant in bed format based on variant ID in z-score file
[zscore2bed_1]
parameter: in_file = path()
input: in_file
output: f'{_input:n}.bed.unsorted'
R: expand = "${ }", container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(readr)
    library(stringr)
    library(dplyr)
    var_file <- ${_input:r}
    out_file <- ${_output:r}

    variants <- read_tsv(var_file, col_names=FALSE)
    colnames(variants) = c('variant', 'ld', 'zscore')
    var_info <- str_split(variants$variant, ":")
    variants <- mutate(variants, chr = paste0("chr", sapply(var_info, function(x){x[1]})), 
                                 pos = sapply(var_info, function(x){x[2]})) %>%
                mutate(start = as.numeric(pos), stop=as.numeric(pos) + 1) %>%
                select(chr, start, stop, variant)
    options(scipen=1000) # So that positions are always fully written out)
    write.table(variants, file=out_file, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")

[zscore2bed_2]
output: f'{_input:n}'
bash: expand = True, container = 'gaow/atac-gwas', workdir = cwd
     sort-bed {_input} > {_output}

[get_variants: provides = '{data}.bed']
output: f'{data}.bed'
sos_run('zscore2bed', in_file = f'{_output:n}.gz')

Merge annotations for joint torus analysis

In [ ]:
# Auxiliary step to merge multiple annotations
[merge_annotations]
parameter: out_file=path()
parameter: data_files=paths()
input: data_files
output: out_file
R: expand = '${ }', container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'

    library(readr)
    library(dplyr)
    library(stringr)

    out_name <- ${_output:r}
    annots <- c(${_input:r,})
    var_annot <- read_tsv(annots[1], col_names=T)
    for(i in 2:length(annots)){
        var_annot2 <- read_tsv(annots[i], col_names=T)
        stopifnot(all(var_annot$SNP == var_annot2$SNP))
        var_annot <- cbind(var_annot, var_annot2[,2])
    }
    write.table(var_annot, file=gzfile(out_name),
                row.names=FALSE, quote=FALSE, sep="\t")

Prepare torus format binary annotations

In [ ]:
# Get variants in data that falls in target region
[range2var_annotation_1]
depends: f'{z_score:n}.bed'
input: set(paths(single_anno + list(multi_anno.values()))), group_by = 1
output: f'{out_dir}/{_input:bn}.{z_score:bn}.bed'
bash: expand = True, container = 'gaow/atac-gwas', workdir = cwd
    bedops -e {z_score:n}.bed {_input} > {_output}
In [ ]:
# Make binary annotation file
[range2var_annotation_2]
parameter: discrete = 1
depends: z_score
output: f'{_input:n}.gz'
R: expand = "${ }", container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(readr)
    library(dplyr)
    library(stringr)

    variant_tsv <- ${z_score:r}
    annotation_var_bed <- ${_input:r}
    annot_name <- ${_input:bnr} %>% str_replace(paste0(".",${z_score:bnr}), "")
    out_name <- ${_output:r}

    vars <- read_tsv(variant_tsv, col_names=FALSE)[,1]
    annot_vars = read_tsv(annotation_var_bed, col_names=FALSE)
    names(vars) <- "SNP"
    vars <- vars %>%
            mutate(annot_d = case_when(SNP %in% annot_vars$X4 ~ 1,
                                                        TRUE ~ 0))
    names(vars)[2] <- paste0(annot_name, '${"_d" if discrete else "_c"}')
    write.table(vars, file=gzfile(out_name),
                col.names=TRUE, row.names=FALSE, sep="\t", quote=FALSE)
In [ ]:
# Obtain multiple annotations per SNP for enrichment analysis
[range2var_annotation_3]
for k, value in multi_anno.items():
    sos_run('merge_annotations', out_file = f'{out_dir}/{k}.{z_score:bn}.gz', data_files = [f'{out_dir}/{v:bn}.{z_score:bn}.gz' for v in paths(value)])

Enrichment analysis via torus

In [1]:
# Run torus with annotation + z-score file
[enrichment_1 (run torus)]
depends: z_score
parameter: gmap = path(f"{cwd}/gene_map/gmap.gz") 
input_files = [f'{out_dir}/{value:bn}.{z_score:bn}.gz' for value in paths(single_anno)] + [f'{out_dir}/{k}.{z_score:bn}.gz' for k in multi_anno]
fail_if(len(input_files) == 0, msg = "No annotations to use! Please use ``--single-annot`` and ``--multi-annot`` to pass annotation lists")
input: input_files, group_by = 1
output: f'{_input:n}.torus'

bash: container = 'gaow/atac-gwas', workdir = cwd, expand = True, stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'
    rm -rf {_output:n}_prior 
    torus -d {z_score} -smap {_input:d}/smap.gz -gmap {gmap} -annot {_input} -est --load_zval -dump_prior {_output:n}_prior > {_output}
In [2]:
# Consolidate all torus result into one table
[enrichment_2 (make output table)]
input: group_by = 'all'
output: f'{out_dir}/{z_score:bn}.torus.merged.gz'
R: expand = '${ }', container = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(dplyr)
    library(readr)
    library(stringr)
    output_file <- ${_output:r}
    names <- c(${_input:bnr,})
    file_names <- c(${_input:r,})
    tab <- data.frame("name"=names)

    data <- lapply(file_names, function(f){
                    lns <- read_lines(f)
                    x <- unlist(str_split(lns[3], " "))
                    x <- x[str_length(x) > 0]
                    x})

    tab[,"log odds ratio"] <- sapply(data, function(x){x[2]})
    tab[,"95% ci"] <- sapply(data, function(x){paste0("(", x[3], ", ", x[4], ")")})
    write.table(tab, file=gzfile(output_file), sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)

Apply pre-trained deepsea model to variants

Credits to Yanyu Liang. Required inputs are:

  • Path to hg19 reference genome
  • Path to model HDF5 file
  • Path to list of variants, in the format of:
chr1    68090   68091   G   T
chr1    68090   68091   G   A
chr1    68090   68091   G   C
chr1    68091   68092   C   G
chr1    68091   68092   C   T
chr1    68091   68092   C   A
chr1    68092   68093   T   A
chr1    68092   68093   T   C
chr1    68092   68093   T   G
chr1    68093   68094   A   C

To run this workflow:

sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants /path/to/variant/list.file
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/scz.swe.var.txt

It requires much memory. So we test for first 10K line of variant and it worked.

sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/test.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/test100K.var.txt

Break 9898079 snps into two halves and run deepsea separately.

less scz.swe.var.txt | head -4949039 | gzip > half1.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/half1.var.txt
sos run workflow/gwas_enrichment.ipynb deepsea_apply --variants ~/Documents/GWAS_ATAC/GWAS_data/SCZSweden/half2.var.txt
In [ ]:
[deepsea_apply_1 (prepare config file)]
parameter: variants = path()
input: variants
output: f'{_input:n}.config'
report: expand = True, output = _output
    var_list: {_input:r}
    pred_model:
      path: '{deepsea_model}'
      label:
        CN: 1
        DN: 2
        GA: 3
        ips: 4
        NSC: 5
    reference_genome: '{hg19}'
    script_dir: '/opt/deepann'
    out_dir: {_output:dr}
    out_prefix: '{_output:bn}_deepsea'

    ##### some default setup #####
    #### usually don't change ####
    check_allele: False
    design: '499-1-500'
In [ ]:
[deepsea_apply_2 (apply deepsea weights)]
output: f'{_input:n}.deepsea.list'
bash: expand = True, container = 'gaow/deepann', volumes = [f'{hg19:d}:{hg19:d}', f'{deepsea_model:d}:{deepsea_model:d}', f'{os.path.expanduser("~")}:/home/$USER'], extra_args = "-e HOME=/home/$USER -e USER=$USER"
    snakemake --snakefile /opt/deepann/Snakefile --configfile {_input} && ls {_input:n}_deepsea/output__*.gz > {_output}

Enrichment analysis via GREGOR

To properly perform enrichment analysis we want to match the control SNPs with the SNPs of interest -- that is, SNPs inside CS -- in terms of LD, distance to nearest gene and MAF. The GREGOR software can generate list of matched SNPs. I will use SNPs inside CS as input and expect a list of output SNPs matching these inputs.

GREGOR is release under University of Michigan license so I'll not make it into a docker image. So path to GREGOR directory is required. Also we need reference files, prepared by:

cat \
    GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.00 \
    GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.01 \
    > GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
tar zxvf GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz

MD5SUM check:

AFR.part.0  ( MD5: 9926904128dd58d6bf1ad4f1e90638af )
AFR.part.1  ( MD5: c1d30aff89a584bfa8c1fa1bdc197f21 )

Same for EUR data-set.

To run GREGOR

PGC3

sos run workflow/gwas_enrichment.ipynb gregor --gregor-input ~/Documents/GWAS_ATAC/scz2_zscore/gregor.scz2.txt.gz --single-annot data/gregor.txt

PGC2

sos run workflow/gwas_enrichment.ipynb gregor --gregor-input ~/Documents/GWAS_ATAC/pgc2_zscore/gregor.pgc2.txt.gz --single-annot data/gregor.txt
In [ ]:
[gregor_1 (make SNP index)]
depends: executable('zcat')
parameter: gregor_input = path()
input: gregor_input
output: f'{_input:nn}.rsid.txt', f'{_input:nn}.annotations.list'
bash: expand = '${ }'
    zcat ${_input} | cut -f 2,3 -d "_" | sed 's/_/:/g' > ${_output[0]}

with open(_output[1], 'w') as f:
    f.write('\n'.join(single_anno))

[gregor_2 (make configuration file)]
depends: executable('sed')
parameter: gregor_db = path('~/Documents/hg19/GREGOR_DB')
parameter: pop = 'EUR'
output: f'{_input[0]:nn}.gregor.conf'
report: output = f'{_output}', expand = True
    ##############################################################################
    # CHIPSEQ ENRICHMENT CONFIGURATION FILE
    # This configuration file contains run-time configuration of
    # CHIP_SEQ ENRICHMENT
    ###############################################################################
    ## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
    ###############################################################################
    INDEX_SNP_FILE = {_input[0]}
    BED_FILE_INDEX = {_input[1]} 
    REF_DIR = {gregor_db}
    R2THRESHOLD = 0.7 ## must be greater than 0.7
    LDWINDOWSIZE = 10000 ## must be less than 1MB; these two values define LD buddies
    OUT_DIR = {_output:nn}_gregor_output
    MIN_NEIGHBOR_NUM = 10 ## define the size of neighborhood
    BEDFILE_IS_SORTED = true  ## false, if the bed files are not sorted
    POPULATION = {pop}  ## define the population, you can specify EUR, AFR, AMR or ASN
    TOPNBEDFILES = 2 
    JOBNUMBER = 10
    ###############################################################################
    #BATCHTYPE = mosix ##  submit jobs on MOSIX
    #BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
    ###############################################################################
    #BATCHTYPE = slurm   ##  submit jobs on SLURM
    #BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
    ###############################################################################
    BATCHTYPE = local ##  run jobs on local machine

bash: expand = True
    sed -i '/^$/d' {_output}

GREGOR is written in perl. Some libraries are required before one can run GREGOR:

sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl
In [1]:
[gregor_3 (run gregor)]
depends: executable('perl')
parameter: gregor_path = path('~/Documents/GREGOR')
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True
    perl {gregor_path}/script/GREGOR.pl --conf {_input} && touch {_output}

[gregor_4 (format output)]
depends: executable('sed')
output: f'{_input:n}.csv'
bash: expand = True
    sed 's/\t/,/g' {_input} > {_output}

Copyright © 2018 Gao Wang et al at Stephens Lab, University of Chicago

Exported from workflow/gwas_enrichment.ipynb committed by minqiao on Fri Jul 12 11:48:55 2019 revision 9, c02e3e9