Summary statistics data mungling in preparation for fine-mapping pipelines¶

This pipeline extracts loci of interest from association analysis summary statistics data, intersecting it with genotype data to compute correlation matrix, and output the data-set per loci with summary statistics and genotype correlations matched. Additionally it extracts prior inclusion probability (annotation scores) for each variant in the data-set, if the information is available.

This pipeline was devised by Gao Wang and implemented by Min Qiao at The University of Chicago. It can be downloaded from here.

Input¶

Summary statistics file, gzip compressed, containing information of 6 or 7 columns of core information
- chromsome
- position
- reference allele
- alternative allele
- [${\beta}$ (effect size) and standard error (se)] or $z$ score
The input file does not have to follow this ordering. It can be specified by --columns parameter later. Additionally it is possible to input a column loci ID using --loci-column. It is relevant for QTL analysis where the same position can belong to different loci.
Loci file, similar in flavor to bed files.
```
1st column is chr; 2nd is chunk start position; 3rd is chunk end position; 4th is loci identifier.

  chr22   44995308    46470495    1699
  chr22   46470495    47596318    1700
  chr22   47596318    48903703    1701
  chr22   48903703    49824534    1702
  chr22   49824534    51243298    1703
```
If the last column is not available the first 3 columns will be concatenated to become the loci identifier.

Note that default data-base for loci can be, naturally, LD blocks. For example European genomes are typically divided into 1703 LD chunks (exclude X chromosome).

Prior inclusion probability, for example:

1st columns format “chr:bp:ref:alt”，2nd is prior

  1:1847856:T:G  1.4413e-04
  1:1847979:T:C  7.3716e-05
  1:1848109:C:G  1.4413e-04
  1:1848160:A:G  1.4413e-04
  1:1848734:A:G  7.3716e-05

This is the default format from the enrichment analysis tool called torus.

Genotype data reference panel (or panels if by chromosome), in VCF format. Ideally this is the genotype data used to generate the summary statistics; but external reference panel can also be used. A popular choice is 1000 Genomes (data download) for genotypes of different population (data download).
- There are 503 Europeans (EUR) in 1000 Genomes data.
For genotype in multiple VCF files the input have to be a 'manifest' file under the same directory with the VCF files and reads like:
```
1 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz
2 ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz
...
```
where each line is white-space separated with the first the chromosome number and second the file name.

For UChicago midway users: reference genotype data for 1000 Genomes EUR samples are extracted using workflow prepare_1KG_reference below. The output can be found under /project2/xinhe/mqiao/SuSiE_GWAS/1KG_EUR/. The manifest file 1KG_EUR.manifest was prepared by:
```
for i in {1..22} X; do echo $i ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz; done > 1KG_EUR.manifest
```

The end of this workflow notebook has some example commands running on some toy data. These toy data (500MB; did not bother to make it smaller due to my laziness) can be downloaded from here if you want to reproduce these commands.

Pitfalls in data mungling with external reference panel¶

Genomic coordinate can be zero- or one-based. That means SNP positions can start from start from 0, instead of 1. This is indeed the case for 1000 Genomes data. In order to be consistent with one-based summary statistics, we need to add 1 to all 1000 genome SNPs position.
Reference and alternative alleles may mismatch between summary statistics and the reference panel. If it is a simple ref / alt flip we need to also flip the coding of genotypes or sign of summary statistics to adjust for the effect size direction. If it is strand flip we need to convert summary statistics and genotype data to use the same strand first and take from there.
When strand flip is involved, cases such as A/T and C/G are no longer identifiable -- whether it be strand flip or ref / alt flip. We will have to remove these locus (having A/T or C/G genotypes) from analysis.

Data extraction steps¶

Denote summary statistics matrix in the specific loci as $S_1$, and annotation (prior) matrix as $A_1$. The number of row in these two matrices is the number of SNPs in summary statistics of the study of interest.
Extract corresponding genotype from reference panel in VCF format for this loci. Denote this genotype matrix as $G_1$. Rows are SNPs, columns are population genotypes.
- Genotype coding: we use numeric coding 0, 1 and 2 indicating the number of "non-reference allele". In other words we have the following "mapping rule":
```
  0|0  ->  0
  1|0  ->  1
  1|1  ->  2
  2|0  ->  1
  2|1  ->  2
  2|2  ->  2
```
- Non-variant sites (lines having identical genotypes for everybody) will be removed.
Find overlapped SNPs of $S_1$ and $G_1$, then extract new genotype matrix from $G_1$ excluding non-overlaps, denote as $G_2$, and new summary statistics matrix from $S_1$ excluding non-overlaps, denote as $S_2$.
Compare $G_2$ and $S_2$'s reference and alternative allele, flip coding as necessary, generating new summay statistic matrix $S_3$ and new genotype matrix $G_3$. There could be several situations as follows:
- completely identical;
- Not identical, but identical after switching ref and alt in $S_2$: add opposite sign for z score and beta; does not apply to A/T, T/A, G/C or C/G for ref/alt;
- Not identical, but identical after strand flip ref and alt in $S_2$: keep the sign of z score and beta; does not apply to A/T, T/A, G/C or C/G for ref/alt;
- Not identical, but identical after strand flip ref and alt then swith their positions: add opposite sign for z score and beta; only apply to A/G, G/A, A/C, C/A, T/C, C/T, T/G and G/T.
- Not identical, A/T, T/A, G/C or C/G for ref/alt: consider this situation at last; if flip strand has applied in this LD block, then flip strand and keep the sign of z score and beta; if not, switch ref and alt of $S_2$, add opposite sign for z score and beta.
- Not identical after previous 5 substeps: drop.
Calculate row correlation matrix of $G_3$, denote as $R$. The number of rows and columns of $R$ is the number of SNPs in $G_3$.
Obtain overlapped SNPs for $S_3$ and $A_1$ and use overlapped SNPs to generate new annotation/prior $A_2$.

Notice that if genotype data used to generate the summary statistics is available as reference panel, there should not be a need for 4. But it is a good double-check anyways to do 4 -- we therefore provide an option to specify whether or not we expect step 4 is unnecessary; and if so, throw an error when we found discrepency in 4, instead of trying to fix those.

Outputs¶

$S_3$: the number of rows of $S_3$ is the same with $A_2$. It can be 4 columns, chr:pos beta se ID, or 3 columns, chr:pos z ID.
$R$: correlation matrix, #SNPs $\times$ #SNPs of $G_3$.
$A_2$: adjusted annotation/prior information, the number of rows is the same with $S_3$.

Software requirement¶

Python package cyvcf2, conda install -c bioconda cyvcf2
tabix and bcftools, conda install -c tabix bcftools

%cd ~/GIT/github/fine-mapping

/scratch/midway2/gaow/GIT/github/fine-mapping

sos run workflow/summary_statistics_wrangler.ipynb -h

usage: sos run workflow/summary_statistics_wrangler.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  prepare_1KG_reference
  default

Global Workflow Options:
  --reference . (as path)
                        Reference panel, either VCF file or a manifest file each
                        line is "chrom number <white space> VCF file name" the
                        manifest file have to be in the same folder as the VCF
                        files it refers to.
  --loci . (as path)
                        Loci file
  --ss-data . (as path)
                        summary statistics
  --[no-]strand-flip (default to True)
                        Use --no-strand-flip to set it to false if you are sure
                        there is no strand_flip involved
  --[no-]ref-flip (default to True)
                        Use --no-ref-flip to set it to false if you are sure
                        there is no reference / alternative mismatch involved
  --job-size 80 (as int)
                        For cluster jobs, number of loci to analyze per job

Sections
  prepare_1KG_reference: Get information about a specified race
    Workflow Options:
      --race EUR
                        Race identifier in 1000 genomes
      --panel-meta . (as path)
                        Path to
                        `integrated_call_samples_v3.20130502.ALL.panel.txt`
  default_1:            Get summary stats
    Workflow Options:
      --columns 1 2 3 4 5 6 (as list)
                        Have to be 5 or 6 numbers indicating the required
                        columns to be extracted from summary statistics file 6
                        numbers for ['chr','bp','ref','alt','beta','se'] 5
                        numbers fo ['chr','bp','ref','alt','z']
      --loci-column 0 (as int)
                        ID column index, optional
      --[no-]header (default to True)
                        use --no-header to indicate that input summary
                        statistics file does not have a header
      --adjust-panel-position 0 (as int)
                        when set to N (typically 0, 1 or -1) the genomic
                        position in the panel will be adjusted by N, ie.
                        position = position + N This is useful when summary
                        stats and reference panel have different coordinates
                        (off by 1 position)
      --panel-chrom-prefix ''
                        Either empty, or `chr`
  default_2:            Get annotations
    Workflow Options:
      --annotation 'name:/path/to/annotation/file'

[global]
# Reference panel, either VCF file or a manifest file each line is "chrom number <white space> VCF file name"
# the manifest file have to be in the same folder as the VCF files it refers to.
parameter: reference = path()
# Loci file
parameter: loci = path()
# summary statistics
parameter: ss_data = path()
# Use --no-strand-flip to set it to false
# if you are sure there is no strand_flip involved
parameter: strand_flip = True
# Use --no-ref-flip to set it to false
# if you are sure there is no reference / alternative mismatch involved
parameter: ref_flip = True
# For cluster jobs, number of loci to analyze per job
parameter: job_size = 80

# Decide whether or not reference panel matches summary statistics
if not ref_flip and not strand_flip:
    strictly_match = True
else:
    strictly_match = False

# Check if files exist
fail_if(not reference.is_file(), msg = 'Please specify valid path for --reference')
fail_if(not loci.is_file(), msg = 'Please specify valid path for --loci')
fail_if(not ss_data.is_file(), msg = 'Please specify valid path for --ss-data')

Preparing external reference genotypes¶

For 1000 genomes data, using integrated_call_samples_v3.20130502.ALL.panel.txt file to extract sample ID for given population (eg EUR) and subset the data. We only need to run it once.

If loci manifest is provided it generate both a new loci manifest file and the files it contains.

Example usage:

sos run workflow/summary_statistics_wrangler.ipynb prepare_1KG_reference

# Get information about a specified race
[prepare_1KG_reference]
depends: executable("bcftools"), executable("tabix")
# Race identifier in 1000 genomes
parameter: race = "EUR"
# Path to `integrated_call_samples_v3.20130502.ALL.panel.txt`
parameter: panel_meta = path()
stop_if(not panel_meta.is_file(), msg = 'Please specify valid path for --panel-meta')

if str(reference).endswith('vcf.gz'):
    chroms = ['0']
    genotypes = [reference]
else:
    manifest = [x.strip().split() for x in open(f'{reference:a}').readlines()]
    meta = [x[0] for x in manifest]
    genotypes = [x[1] for x in manifest]

input: genotypes, group_by = 1, paired_with = dict(chrom=chroms)
output: f"{_input:nn}.{race}.vcf.gz"
task: trunk_workers = 1, trunk_size = 1, walltime = '5m', mem = '2G', cores = 1, tags = f'{step_name}_{_output:bn}'

bash: expand = True
    grep -w "{race}" {panel_meta} | cut -f 1 > {_output:nn}_extracted.txt
    bcftools view -S {_output:nn}_extracted.txt {_input} -Oz > {_output}
    tabix -p vcf {_output}
    rm {_output:nn}_extracted.txt

if not str(reference).endswith('vcf.gz'):
    with open(f"{reference:n}.{race}.{reference:x}", 'w') as f:
        for item in _output:
            f.write(f'{item.chrom} {item}')

Obtain summary statistics data-set¶

Obtain reference genotypes for given loci: matrix $G1$
Obtain summary statistics and the matching genotype correlation for given loci

For step 2, we obtain genotype matrix 𝐺2, summary statistics 𝑆2, compare ref/alt in 𝑆2 and 𝐺2 => 𝑆3 and 𝐺3, then calculate 𝑅=row_corr(𝐺3) and save to disk.

# Get summary stats
[default_1 (get summary statistics)]
depends: Py_Module('cyvcf2')
# Have to be 5 or 6 numbers indicating the required columns
# to be extracted from summary statistics file
# 6 numbers for ['chr','bp','ref','alt','beta','se']
# 5 numbers fo ['chr','bp','ref','alt','z']
parameter: columns = [1,2,3,4,5,6]
# ID column index, optional
parameter: loci_column = 0
# use --no-header to indicate that input summary statistics file does not have a header
parameter: header = True
# when set to N (typically 0, 1 or -1) the genomic position
# in the panel will be adjusted by N, ie. position = position + N
# This is useful when summary stats and reference panel have different coordinates (off by 1 position)
parameter: adjust_panel_position = 1
# Either empty, or `chr`
parameter: panel_chrom_prefix = ''
import os

fail_if(len(columns) not in (5,6), msg = 'Input column ID has to be of length 5 or 6.')

if str(reference).endswith('vcf.gz'):
    genotype = reference
else:
    genotype = dict([tuple(x.strip().split()) for x in open(f'{reference:a}').readlines()])
    for k in genotype:
        genotype[k] = os.path.join(f'{reference:ad}', genotype[k])
chunks = [x.strip().split() for x in open(f'{loci:a}').readlines() if x.strip() and not x.strip().startswith('#')]
prefix = paths([f"{ss_data:n}/{c[-1] if len(c) > 3 else '%s_%s_%s' % (c[0], c[1], c[2])}" for c in chunks])
    
input: for_each = 'chunks'
output: summary_stats = f"{prefix[_index]}/{prefix[_index]:b}.summary_stats.gz",
        ld_matrix = f"{prefix[_index]}/{prefix[_index]:b}.LD.gz"
        
task: trunk_workers = 1, trunk_size = job_size, walltime = '5m', mem = '5G', cores = 1, tags = f'{step_name}_{_output["summary_stats"]:bn}'

python3: expand = "${ }", stdout = f'{_output["summary_stats"]:nn}.allele_flip.log'
    from cyvcf2 import VCF
    import pandas as pd, numpy as np
    import datetime, sys

    def exit_if_empty(x):
        is_empty = (len(x) == 0) if isinstance(x, list) else x.empty
        if is_empty:
            open("${_output['summary_stats']}", 'w').close()
            open("${_output['ld_matrix']}", 'w').close()
            sys.exit(0)

    # Step 1: extract relevant summary statistics
    print(datetime.datetime.now(),"\n")
    locus = ${_chunks}
    col_to_use = ${[x-1 for x in columns]}
    loci_col = ${loci_column}
    if loci_col > 0:
        col_to_use.append(loci_col-1)
    col_ranks = [x for x in sorted(range(len(col_to_use)), key=col_to_use.__getitem__)]
    S1 = pd.read_table(${ss_data:ar}, compression='gzip', header = ${0 if header else None}, index_col=False, usecols = col_to_use)
    if loci_col > 0:
        columns = ['chr','bp','ref','alt','beta','se','locus_id'] if S1.shape[1] == 7 else ['chr','bp','ref','alt','z','locus_id']
    else:
        columns = ['chr','bp','ref','alt','beta','se'] if S1.shape[1] == 6 else ['chr','bp','ref','alt','z']
    S1.columns = [columns[r] for r in col_ranks]
    S1 = S1[columns]
    S1["chr"] = S1["chr"].apply(lambda x: str(x).replace("chr", ""))
    locus[0] = str(locus[0]).replace("chr", "")
    if "locus_id" in S1.columns and len(locus) == 4:
        S1 = S1[(S1["chr"] == locus[0]) & (S1["bp"] >= int(locus[1])) & (S1["bp"] < int(locus[2])) & (S1["locus_id"] == locus[3])]
    else:
        S1 = S1[(S1["chr"] == locus[0]) & (S1["bp"] >= int(locus[1])) & (S1["bp"] < int(locus[2]))]
    exit_if_empty(S1)
    print("SAMPLE: %s\tLENGTH:%d" % ("S1",len(S1)))
    S1["ID"] = S1["chr"] + ":" + S1["bp"].astype(str)

    # Step 2: get genotypes
    chromosome = '${panel_chrom_prefix}'+'${_chunks[0].replace("chr","")}'
    panel = ${path(genotype[_chunks[0].replace("chr","")] if isinstance(genotype, dict) else genotype):ar}
    # loci region
    queryid = chromosome + ":" + "${_chunks[1]}" + "-" + "${_chunks[2]}"
    off_set = ${adjust_panel_position}
    # scan VCF chunk
    vcf = VCF(panel, gts012=False)
    res = []
    n_G0 = 0
    var_ids = set(S1["ID"].tolist())
    ## Attention: 1000 Genome position start from 0, not 1! So need to set off_set=1 for variant.start
    for variant in vcf(queryid):
        n_G0 += 1
        var_id = f'{variant.CHROM.replace("chr","")}:{variant.start+off_set}'
        if var_id not in var_ids:
            continue
        for i in range(len(variant.ALT)):
            line = [var_id, variant.REF, variant.ALT[i]] + \
                    [x[:-1].count(i+1) for x in variant.genotypes]
            if len(set(line[3:])) == 1:
                # remove non-variant site in reference panel
                continue
            res.append(line)
    print("SAMPLE: %s\tLENGTH:%d" % ("G0",n_G0))
    exit_if_empty(res)
    G1 = pd.DataFrame(res, columns = ['ID', 'ref', 'alt'] + vcf.samples)
    print("SAMPLE: %s\tLENGTH:%d" % ("G1",len(G1)))

    # Step 3: overlap genotypes and summary statistics for LD matrix
    # S2
    # have to drop duplicates in S2 because in some files there are duplicates!
    S2 = S1[S1["ID"].isin(G1["ID"])].drop_duplicates()
    exit_if_empty(S2)
    #S2["ID"] = S2["ID"].fillna
    
    print("SAMPLE: %s\tLENGTH:%d" % ("S2",len(S2)))
    # 𝐺2
    # have to drop duplicates in S2 because for some reason there are duplicates in reference panel
    G2 = G1[G1["ID"].isin(S1["ID"])].drop_duplicates()
    exit_if_empty(G2)
    print("SAMPLE: %s\tLENGTH:%d" % ("G2",len(G2)))
    # 𝑆2 and 𝐺2 : reference and alternative
    # at ta cg gc
    def gt(s1,s2,s3,s4):
        if (s1+s2 == "AT" and s3+s4 == "TA") or (s1+s2 == "GC" and s3+s4 == "CG" ) or (s1+s2 == "TA" and s3+s4 == "AT") or (s1+s2 == "CG" and s3+s4 == "GC"):
            return ${0 if strand_flip else 1}
        else:
            return 1
    # flip
    def atcg(inp):
        if inp == "A":
            return "T"
        elif inp == "T":
            return "A"
        elif inp == "G":
            return "C"
        elif inp == "C":
            return "G"
    # S3
    FLIPD = []
    # adjusted
    ADJ = 0
    # equal
    EQUAL = 0
    # flipped by strand 
    FLIPNUM = 0
    # flipped by reference and alternative
    REFALTNUM = 0
    
    # keep at/ta/cg/gc line numbers
    at_cg_id = []
    for i in range(len(S2)):
        old_id = S2.iloc[i,0] + ":" + str(S2.iloc[i,1]) + ":" + S2.iloc[i,2] + ":" + S2.iloc[i,3]
        line = S2.iloc[i,:-1].tolist() + [old_id]
        G2S2 = G2[G2["ID"]==S2.iloc[i,-1]]
        if all([set([S2.iloc[i,2],S2.iloc[i,3]]) != set([G2S2["ref"].iloc[idx], G2S2["alt"].iloc[idx]]) for idx in range(G2S2.shape[0])]):
            ADJ += 1
        for idx in range(G2S2.shape[0]):
            # not at/ta/gc/cg
            if gt(S2.iloc[i,2],S2.iloc[i,3],G2S2["ref"].iloc[idx],G2S2["alt"].iloc[idx])==1:
                if G2S2["ref"].iloc[idx] == S2.iloc[i,2] and G2S2["alt"].iloc[idx] == S2.iloc[i,3]:
                    EQUAL+=1
                    FLIPD.append(line)
                elif G2S2["alt"].iloc[idx] == S2.iloc[i,2] and G2S2["ref"].iloc[idx] == S2.iloc[i,3]:
                    REFALTNUM+=1
                    line[2], line[3] = line[3], line[2]
                    line[4] = -line[4]
                    FLIPD.append(line)
                elif atcg(S2.iloc[i,2]) == G2S2["ref"].iloc[idx] and atcg(S2.iloc[i,3]) == G2S2["alt"].iloc[idx]:
                    FLIPNUM+=1
                    line[2], line[3] = atcg(line[2]), atcg(line[3])
                    FLIPD.append(line)
                elif atcg(S2.iloc[i,2]) == G2S2["alt"].iloc[idx] and atcg(S2.iloc[i,3]) == G2S2["ref"].iloc[idx]:
                    REFALTNUM+=1
                    FLIPNUM+=1
                    line[2], line[3] = atcg(line[3]), atcg(line[2])
                    line[4] = -line[4]
                    FLIPD.append(line)
                else:
                    # not sure what to do yet
                    # print(1, line)
                    pass
            # at/ta/cg/gc only do ref/alt flip
            elif gt(S2.iloc[i,2],S2.iloc[i,3],G2S2["ref"].iloc[idx],G2S2["alt"].iloc[idx])==0:
                if G2S2["ref"].iloc[idx] == S2.iloc[i,2] and G2S2["alt"].iloc[idx] == S2.iloc[i,3]:
                    EQUAL+=1
                    FLIPD.append(line)
                    at_cg_id.append(len(FLIPD) - 1)
                elif G2S2["alt"].iloc[idx] == S2.iloc[i,2] and G2S2["ref"].iloc[idx] == S2.iloc[i,3]:
                    REFALTNUM+=1
                    line[2], line[3] = line[3], line[2]
                    line[4] = -line[4]                
                    FLIPD.append(line)
                    at_cg_id.append(len(FLIPD) - 1)
                else:
                    # print(0, line)
                    pass
    if ${strictly_match} and (FLIPNUM > 0 or REFALTNUM > 0):
        raise ValueError(f"Strict panel match failed for locus {locus}.")
    #
    MISMATCH = len(S2) - len(FLIPD)
    if FLIPNUM > 0:
        # flips involved, we need to remove at/ta/cg/gc
        FLIPD = [i for j, i in enumerate(FLIPD) if j not in at_cg_id]
    else:
        at_cg_id = []
    columns = [x for x in S1.columns.tolist() if x != 'ID'] + ["original_ID"]
    S3 = pd.DataFrame(FLIPD, columns=columns)
    S3["ID"] = S3["chr"] +":"+S3["bp"].astype(str) + ":" + S3["ref"] + ":" + S3["alt"]
    G2["ID"] = G2[['ID', 'ref', 'alt']].apply(lambda x: ':'.join(x), axis=1)
    # 𝐺3
    G3 = G2[G2["ID"].isin(S3["ID"])]
    assert G3.shape[0] == S3.shape[0]
    exit_if_empty(S3)
    # recover original ID
    S3["ID"] = S3.pop("original_ID")
    S3 = S3.sort_values(by = ["chr", "bp"]).drop(columns = [x for x in S3.columns if not x in ('ID', 'z', 'beta', 'se')])
    columns = S3.columns.tolist()
    columns.insert(0, columns.pop(columns.index('ID')))
    S3[columns].to_csv("${_output['summary_stats']}",sep="\t",index=False,header=None, compression = 'gzip')
    print("SAMPLE: %s\tLENGTH:%d" % ("S3",len(S3)))
    print("equal:%s\nflipped by strand:%d\nflipped by reference and alternative:%d\ntotal adjusted:%d\ntotal mismatch:%d\ntotal A/T and C/G removed:%d" % (EQUAL,FLIPNUM,REFALTNUM,ADJ,MISMATCH,len(at_cg_id)))
    # 𝑅=row_corr(𝐺3) OUT
    np.savetxt("${_output['ld_matrix']}", np.corrcoef(G3.drop(columns=['ID', "ref","alt"]).values) if G3.shape[0] > 1 else np.array([[1]]), fmt = '%.5f')

Obtain annotation $A_2$¶

# Get annotations
[default_2 (get annotations)]
parameter: annotation = "name:/path/to/annotation/file"
anno, prior_file = annotation.split(':')
prior_file = path(prior_file)
stop_if(not prior_file.is_file(), msg = 'Quit because annotation file is not found')
output: f"{_input['summary_stats']:nn}.{anno}.gz"
skip_if(_input['summary_stats'].stat().st_size == 0 or _input['ld_matrix'].stat().st_size == 0)

task: trunk_workers = 1, trunk_size = job_size, walltime = '10m', mem = '2G', cores = 1, tags = f'{step_name}_{_output:bn}'

python3: expand="${ }"
    import pandas as pd
    A1 = pd.read_table("${prior_file}", compression='gzip', sep="\s+", header=None)
    A1.columns = ["ID","VALUE"]
    S3 = pd.read_table("${_input['summary_stats']}",sep="\t", header=None, compression='gzip')
    # have to drop duplicates in A2 because in some files there are duplicates!
    A2 = A1[A1["ID"].isin(S3.iloc[:,0])][["ID","VALUE"]].drop_duplicates()
    if set(S3.iloc[:,0]) != set(A2["ID"]):
        raise ValueError("SNPs in summary statistics and annotation are not identical.")
    else:
        A2 = A2.set_index("ID")
        A2 = A2.reindex(S3.iloc[:,0])
        A2 = A2.reset_index()
    A2.to_csv("${_output}",sep="\t",index=False,header=None, compression='gzip')

Run preprocessing pipeline¶

eQTL data example run:

sos run workflow/summary_statistics_wrangler.ipynb --reference /home/gaow/tmp/19-Dec-2018/1KG_EUR/test.manifest \
        --loci /home/gaow/tmp/19-Dec-2018/metasoft/one_loci.txt \
        --ss-data /home/gaow/tmp/19-Dec-2018/metasoft/gtex_metasoft_summary_stats.gz \
        --columns 1 2 3 4 6 7 --loci-column 5 --no-header \
        -q none -j 8

GWAS data example run (with annotations)

sos run workflow/summary_statistics_wrangler.ipynb --reference /home/gaow/tmp/19-Dec-2018/1KG_EUR/test.manifest \
        --loci /home/gaow/tmp/19-Dec-2018/SCZ/chunks.list \
        --ss-data /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics.gz \
        --columns 1 2 3 4 6 7 --adjust-panel-position 1 \
        --annotation atac-seq:/home/gaow/tmp/19-Dec-2018/SCZ/Annotation_atac-seq.gz \
        -q none -j 8

For UChicago midway users: to run these pipelines on RCC cluster, please take a look at midway2.yml file (found in this repository), modify it as you see fit, and replace -q none -j 8 with:

-c workflow/midway2.yml -q midway2 -J 40

Results preview¶

Reference panel matching summary¶

%preview /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.allele_flip.log -n --limit 20

2018-12-21 12:48:22.194216 

SAMPLE: S1	LENGTH:184
SAMPLE: G1	LENGTH:1034
SAMPLE: S2	LENGTH:182
SAMPLE: G2	LENGTH:182
SAMPLE: S3	LENGTH:182
equal:94
flipped by strand:0
flipped by reference and alternative:88
total adjusted:0
total mismatch:0
total A/T and C/G removed:0

Summary statistics matrix S3¶

%preview /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.summary_stats.txt -n -l 10

9:84631606:A:G	-0.024303	0.0176
9:84632950:A:G	0.021595	0.0174
9:84635599:A:G	-0.0587	0.0555
9:84636858:A:T	0.011901	0.025
9:84636881:A:G	0.022104	0.0174
9:84638794:T:C	-0.038596	0.0337
9:84638880:A:C	-0.121004	0.0643
9:84639829:A:G	-0.058703	0.0558
9:84640522:A:G	0.024605000000000002	0.0175
9:84640717:T:C	-0.046501999999999995	0.0345

Annotation/prior matrix A2¶

%preview /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.Annotation_atac-seq.txt -n -l 10

9:84631606:A:G	7.3588e-05
9:84632950:A:G	7.3588e-05
9:84635599:A:G	7.3588e-05
9:84636858:A:T	7.3588e-05
9:84636881:A:G	7.3588e-05
9:84638794:T:C	7.3588e-05
9:84638880:A:C	7.3588e-05
9:84639829:A:G	0.00023664
9:84640522:A:G	7.3588e-05
9:84640717:T:C	7.3588e-05

𝑅: LD matrix¶

import pandas as pd
pd.read_table('/home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.LD.txt', sep = ' ', header = None, nrows = 5)

	0	1	2	3	4	5	6	7	8	9	...	172	173	174	175	176	177	178	179	180	181
0	1.00000	-0.16511	-0.08450	-0.05728	-0.16511	-0.09862	0.26204	-0.08450	-0.16958	-0.09553	...	0.19386	0.23134	0.28432	-0.09078	-0.13401	-0.06267	0.17942	-0.13495	0.22525	0.36580
1	-0.16511	1.00000	0.35094	-0.12045	1.00000	-0.08884	-0.05748	0.35094	0.99163	-0.08591	...	-0.08443	0.00125	-0.02916	0.70175	-0.08891	0.29019	0.00745	-0.07106	-0.12696	-0.04784
2	-0.08450	0.35094	1.00000	-0.06756	0.35094	-0.05645	-0.02924	1.00000	0.35389	-0.05572	...	-0.11853	0.27991	-0.02222	-0.02840	-0.04367	0.04285	0.29319	0.00498	-0.04503	-0.04552
3	-0.05728	-0.12045	-0.06756	1.00000	-0.12045	-0.00022	-0.05648	-0.06756	-0.11798	-0.01960	...	-0.19109	-0.01319	0.00848	-0.08558	-0.08434	-0.01702	-0.01805	-0.06870	0.00707	-0.08792
4	-0.16511	1.00000	0.35094	-0.12045	1.00000	-0.08884	-0.05748	0.35094	0.99163	-0.08591	...	-0.08443	0.00125	-0.02916	0.70175	-0.08891	0.29019	0.00745	-0.07106	-0.12696	-0.04784