This pipeline extracts loci of interest from association analysis summary statistics data, intersecting it with genotype data to compute correlation matrix, and output the data-set per loci with summary statistics and genotype correlations matched. Additionally it extracts prior inclusion probability (annotation scores) for each variant in the data-set, if the information is available.
This pipeline was devised by Gao Wang and implemented by Min Qiao at The University of Chicago. It can be downloaded from here.
Summary statistics file, gzip
compressed, containing information of 6 or 7 columns of core information
The input file does not have to follow this ordering. It can be specified by --columns
parameter later. Additionally it is possible to input a column loci ID
using --loci-column
. It is relevant for QTL analysis where the same position can belong to different loci.
Loci file, similar in flavor to bed
files.
1st column is chr; 2nd is chunk start position; 3rd is chunk end position; 4th is loci identifier.
chr22 44995308 46470495 1699
chr22 46470495 47596318 1700
chr22 47596318 48903703 1701
chr22 48903703 49824534 1702
chr22 49824534 51243298 1703
If the last column is not available the first 3 columns will be concatenated to become the loci identifier.
Note that default data-base for loci can be, naturally, LD blocks. For example European genomes are typically divided into 1703 LD chunks (exclude X chromosome).
Prior inclusion probability, for example:
1st columns format βchr:bp:ref:altβοΌ2nd is prior
1:1847856:T:G 1.4413e-04
1:1847979:T:C 7.3716e-05
1:1848109:C:G 1.4413e-04
1:1848160:A:G 1.4413e-04
1:1848734:A:G 7.3716e-05
This is the default format from the enrichment analysis tool called torus
.
Genotype data reference panel (or panels if by chromosome), in VCF format. Ideally this is the genotype data used to generate the summary statistics; but external reference panel can also be used. A popular choice is 1000 Genomes (data download) for genotypes of different population (data download).
EUR
) in 1000 Genomes data.For genotype in multiple VCF files the input have to be a 'manifest' file under the same directory with the VCF files and reads like:
1 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz
2 ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz
...
where each line is white-space separated with the first the chromosome number and second the file name.
For UChicago midway users: reference genotype data for 1000 Genomes EUR samples are extracted using workflow prepare_1KG_reference
below. The output can be found under /project2/xinhe/mqiao/SuSiE_GWAS/1KG_EUR/
. The manifest file 1KG_EUR.manifest
was prepared by:
for i in {1..22} X; do echo $i ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz; done > 1KG_EUR.manifest
The end of this workflow notebook has some example commands running on some toy data. These toy data (500MB; did not bother to make it smaller due to my laziness) can be downloaded from here if you want to reproduce these commands.
Extract corresponding genotype from reference panel in VCF format for this loci. Denote this genotype matrix as $G_1$. Rows are SNPs, columns are population genotypes.
0|0 -> 0
1|0 -> 1
1|1 -> 2
2|0 -> 1
2|1 -> 2
2|2 -> 2
Find overlapped SNPs of $S_1$ and $G_1$, then extract new genotype matrix from $G_1$ excluding non-overlaps, denote as $G_2$, and new summary statistics matrix from $S_1$ excluding non-overlaps, denote as $S_2$.
Compare $G_2$ and $S_2$'s reference and alternative allele, flip coding as necessary, generating new summay statistic matrix $S_3$ and new genotype matrix $G_3$. There could be several situations as follows:
A/T
, T/A
, G/C
or C/G
for ref/alt
;A/T
, T/A
, G/C
or C/G
for ref/alt
;A/G
, G/A
, A/C
, C/A
, T/C
, C/T
, T/G
and G/T
.A/T
, T/A
, G/C
or C/G
for ref/alt
: consider this situation at last; if flip strand has applied in this LD block, then flip strand and keep the sign of z score and beta; if not, switch ref and alt of $S_2$, add opposite sign for z score and beta.Notice that if genotype data used to generate the summary statistics is available as reference panel, there should not be a need for 4. But it is a good double-check anyways to do 4 -- we therefore provide an option to specify whether or not we expect step 4 is unnecessary; and if so, throw an error when we found discrepency in 4, instead of trying to fix those.
chr:pos beta se ID
, or 3 columns, chr:pos z ID
.cyvcf2
, conda install -c bioconda cyvcf2
tabix
and bcftools
, conda install -c tabix bcftools
%cd ~/GIT/github/fine-mapping
sos run workflow/summary_statistics_wrangler.ipynb -h
[global]
# Reference panel, either VCF file or a manifest file each line is "chrom number <white space> VCF file name"
# the manifest file have to be in the same folder as the VCF files it refers to.
parameter: reference = path()
# Loci file
parameter: loci = path()
# summary statistics
parameter: ss_data = path()
# Use --no-strand-flip to set it to false
# if you are sure there is no strand_flip involved
parameter: strand_flip = True
# Use --no-ref-flip to set it to false
# if you are sure there is no reference / alternative mismatch involved
parameter: ref_flip = True
# For cluster jobs, number of loci to analyze per job
parameter: job_size = 80
# Decide whether or not reference panel matches summary statistics
if not ref_flip and not strand_flip:
strictly_match = True
else:
strictly_match = False
# Check if files exist
fail_if(not reference.is_file(), msg = 'Please specify valid path for --reference')
fail_if(not loci.is_file(), msg = 'Please specify valid path for --loci')
fail_if(not ss_data.is_file(), msg = 'Please specify valid path for --ss-data')
For 1000 genomes data, using integrated_call_samples_v3.20130502.ALL.panel.txt
file to extract sample ID for given population (eg EUR
) and subset the data. We only need to run it once.
If loci manifest is provided it generate both a new loci manifest file and the files it contains.
Example usage:
sos run workflow/summary_statistics_wrangler.ipynb prepare_1KG_reference
# Get information about a specified race
[prepare_1KG_reference]
depends: executable("bcftools"), executable("tabix")
# Race identifier in 1000 genomes
parameter: race = "EUR"
# Path to `integrated_call_samples_v3.20130502.ALL.panel.txt`
parameter: panel_meta = path()
stop_if(not panel_meta.is_file(), msg = 'Please specify valid path for --panel-meta')
if str(reference).endswith('vcf.gz'):
chroms = ['0']
genotypes = [reference]
else:
manifest = [x.strip().split() for x in open(f'{reference:a}').readlines()]
meta = [x[0] for x in manifest]
genotypes = [x[1] for x in manifest]
input: genotypes, group_by = 1, paired_with = dict(chrom=chroms)
output: f"{_input:nn}.{race}.vcf.gz"
task: trunk_workers = 1, trunk_size = 1, walltime = '5m', mem = '2G', cores = 1, tags = f'{step_name}_{_output:bn}'
bash: expand = True
grep -w "{race}" {panel_meta} | cut -f 1 > {_output:nn}_extracted.txt
bcftools view -S {_output:nn}_extracted.txt {_input} -Oz > {_output}
tabix -p vcf {_output}
rm {_output:nn}_extracted.txt
if not str(reference).endswith('vcf.gz'):
with open(f"{reference:n}.{race}.{reference:x}", 'w') as f:
for item in _output:
f.write(f'{item.chrom} {item}')
For step 2, we obtain genotype matrix πΊ2, summary statistics π2, compare ref/alt in π2 and πΊ2 => π3 and πΊ3, then calculate π =row_corr(πΊ3) and save to disk.
# Get summary stats
[default_1 (get summary statistics)]
depends: Py_Module('cyvcf2')
# Have to be 5 or 6 numbers indicating the required columns
# to be extracted from summary statistics file
# 6 numbers for ['chr','bp','ref','alt','beta','se']
# 5 numbers fo ['chr','bp','ref','alt','z']
parameter: columns = [1,2,3,4,5,6]
# ID column index, optional
parameter: loci_column = 0
# use --no-header to indicate that input summary statistics file does not have a header
parameter: header = True
# when set to N (typically 0, 1 or -1) the genomic position
# in the panel will be adjusted by N, ie. position = position + N
# This is useful when summary stats and reference panel have different coordinates (off by 1 position)
parameter: adjust_panel_position = 1
# Either empty, or `chr`
parameter: panel_chrom_prefix = ''
import os
fail_if(len(columns) not in (5,6), msg = 'Input column ID has to be of length 5 or 6.')
if str(reference).endswith('vcf.gz'):
genotype = reference
else:
genotype = dict([tuple(x.strip().split()) for x in open(f'{reference:a}').readlines()])
for k in genotype:
genotype[k] = os.path.join(f'{reference:ad}', genotype[k])
chunks = [x.strip().split() for x in open(f'{loci:a}').readlines() if x.strip() and not x.strip().startswith('#')]
prefix = paths([f"{ss_data:n}/{c[-1] if len(c) > 3 else '%s_%s_%s' % (c[0], c[1], c[2])}" for c in chunks])
input: for_each = 'chunks'
output: summary_stats = f"{prefix[_index]}/{prefix[_index]:b}.summary_stats.gz",
ld_matrix = f"{prefix[_index]}/{prefix[_index]:b}.LD.gz"
task: trunk_workers = 1, trunk_size = job_size, walltime = '5m', mem = '5G', cores = 1, tags = f'{step_name}_{_output["summary_stats"]:bn}'
python3: expand = "${ }", stdout = f'{_output["summary_stats"]:nn}.allele_flip.log'
from cyvcf2 import VCF
import pandas as pd, numpy as np
import datetime, sys
def exit_if_empty(x):
is_empty = (len(x) == 0) if isinstance(x, list) else x.empty
if is_empty:
open("${_output['summary_stats']}", 'w').close()
open("${_output['ld_matrix']}", 'w').close()
sys.exit(0)
# Step 1: extract relevant summary statistics
print(datetime.datetime.now(),"\n")
locus = ${_chunks}
col_to_use = ${[x-1 for x in columns]}
loci_col = ${loci_column}
if loci_col > 0:
col_to_use.append(loci_col-1)
col_ranks = [x for x in sorted(range(len(col_to_use)), key=col_to_use.__getitem__)]
S1 = pd.read_table(${ss_data:ar}, compression='gzip', header = ${0 if header else None}, index_col=False, usecols = col_to_use)
if loci_col > 0:
columns = ['chr','bp','ref','alt','beta','se','locus_id'] if S1.shape[1] == 7 else ['chr','bp','ref','alt','z','locus_id']
else:
columns = ['chr','bp','ref','alt','beta','se'] if S1.shape[1] == 6 else ['chr','bp','ref','alt','z']
S1.columns = [columns[r] for r in col_ranks]
S1 = S1[columns]
S1["chr"] = S1["chr"].apply(lambda x: str(x).replace("chr", ""))
locus[0] = str(locus[0]).replace("chr", "")
if "locus_id" in S1.columns and len(locus) == 4:
S1 = S1[(S1["chr"] == locus[0]) & (S1["bp"] >= int(locus[1])) & (S1["bp"] < int(locus[2])) & (S1["locus_id"] == locus[3])]
else:
S1 = S1[(S1["chr"] == locus[0]) & (S1["bp"] >= int(locus[1])) & (S1["bp"] < int(locus[2]))]
exit_if_empty(S1)
print("SAMPLE: %s\tLENGTH:%d" % ("S1",len(S1)))
S1["ID"] = S1["chr"] + ":" + S1["bp"].astype(str)
# Step 2: get genotypes
chromosome = '${panel_chrom_prefix}'+'${_chunks[0].replace("chr","")}'
panel = ${path(genotype[_chunks[0].replace("chr","")] if isinstance(genotype, dict) else genotype):ar}
# loci region
queryid = chromosome + ":" + "${_chunks[1]}" + "-" + "${_chunks[2]}"
off_set = ${adjust_panel_position}
# scan VCF chunk
vcf = VCF(panel, gts012=False)
res = []
n_G0 = 0
var_ids = set(S1["ID"].tolist())
## Attention: 1000 Genome position start from 0, not 1! So need to set off_set=1 for variant.start
for variant in vcf(queryid):
n_G0 += 1
var_id = f'{variant.CHROM.replace("chr","")}:{variant.start+off_set}'
if var_id not in var_ids:
continue
for i in range(len(variant.ALT)):
line = [var_id, variant.REF, variant.ALT[i]] + \
[x[:-1].count(i+1) for x in variant.genotypes]
if len(set(line[3:])) == 1:
# remove non-variant site in reference panel
continue
res.append(line)
print("SAMPLE: %s\tLENGTH:%d" % ("G0",n_G0))
exit_if_empty(res)
G1 = pd.DataFrame(res, columns = ['ID', 'ref', 'alt'] + vcf.samples)
print("SAMPLE: %s\tLENGTH:%d" % ("G1",len(G1)))
# Step 3: overlap genotypes and summary statistics for LD matrix
# S2
# have to drop duplicates in S2 because in some files there are duplicates!
S2 = S1[S1["ID"].isin(G1["ID"])].drop_duplicates()
exit_if_empty(S2)
#S2["ID"] = S2["ID"].fillna
print("SAMPLE: %s\tLENGTH:%d" % ("S2",len(S2)))
# πΊ2
# have to drop duplicates in S2 because for some reason there are duplicates in reference panel
G2 = G1[G1["ID"].isin(S1["ID"])].drop_duplicates()
exit_if_empty(G2)
print("SAMPLE: %s\tLENGTH:%d" % ("G2",len(G2)))
# π2 and πΊ2 : reference and alternative
# at ta cg gc
def gt(s1,s2,s3,s4):
if (s1+s2 == "AT" and s3+s4 == "TA") or (s1+s2 == "GC" and s3+s4 == "CG" ) or (s1+s2 == "TA" and s3+s4 == "AT") or (s1+s2 == "CG" and s3+s4 == "GC"):
return ${0 if strand_flip else 1}
else:
return 1
# flip
def atcg(inp):
if inp == "A":
return "T"
elif inp == "T":
return "A"
elif inp == "G":
return "C"
elif inp == "C":
return "G"
# S3
FLIPD = []
# adjusted
ADJ = 0
# equal
EQUAL = 0
# flipped by strand
FLIPNUM = 0
# flipped by reference and alternative
REFALTNUM = 0
# keep at/ta/cg/gc line numbers
at_cg_id = []
for i in range(len(S2)):
old_id = S2.iloc[i,0] + ":" + str(S2.iloc[i,1]) + ":" + S2.iloc[i,2] + ":" + S2.iloc[i,3]
line = S2.iloc[i,:-1].tolist() + [old_id]
G2S2 = G2[G2["ID"]==S2.iloc[i,-1]]
if all([set([S2.iloc[i,2],S2.iloc[i,3]]) != set([G2S2["ref"].iloc[idx], G2S2["alt"].iloc[idx]]) for idx in range(G2S2.shape[0])]):
ADJ += 1
for idx in range(G2S2.shape[0]):
# not at/ta/gc/cg
if gt(S2.iloc[i,2],S2.iloc[i,3],G2S2["ref"].iloc[idx],G2S2["alt"].iloc[idx])==1:
if G2S2["ref"].iloc[idx] == S2.iloc[i,2] and G2S2["alt"].iloc[idx] == S2.iloc[i,3]:
EQUAL+=1
FLIPD.append(line)
elif G2S2["alt"].iloc[idx] == S2.iloc[i,2] and G2S2["ref"].iloc[idx] == S2.iloc[i,3]:
REFALTNUM+=1
line[2], line[3] = line[3], line[2]
line[4] = -line[4]
FLIPD.append(line)
elif atcg(S2.iloc[i,2]) == G2S2["ref"].iloc[idx] and atcg(S2.iloc[i,3]) == G2S2["alt"].iloc[idx]:
FLIPNUM+=1
line[2], line[3] = atcg(line[2]), atcg(line[3])
FLIPD.append(line)
elif atcg(S2.iloc[i,2]) == G2S2["alt"].iloc[idx] and atcg(S2.iloc[i,3]) == G2S2["ref"].iloc[idx]:
REFALTNUM+=1
FLIPNUM+=1
line[2], line[3] = atcg(line[3]), atcg(line[2])
line[4] = -line[4]
FLIPD.append(line)
else:
# not sure what to do yet
# print(1, line)
pass
# at/ta/cg/gc only do ref/alt flip
elif gt(S2.iloc[i,2],S2.iloc[i,3],G2S2["ref"].iloc[idx],G2S2["alt"].iloc[idx])==0:
if G2S2["ref"].iloc[idx] == S2.iloc[i,2] and G2S2["alt"].iloc[idx] == S2.iloc[i,3]:
EQUAL+=1
FLIPD.append(line)
at_cg_id.append(len(FLIPD) - 1)
elif G2S2["alt"].iloc[idx] == S2.iloc[i,2] and G2S2["ref"].iloc[idx] == S2.iloc[i,3]:
REFALTNUM+=1
line[2], line[3] = line[3], line[2]
line[4] = -line[4]
FLIPD.append(line)
at_cg_id.append(len(FLIPD) - 1)
else:
# print(0, line)
pass
if ${strictly_match} and (FLIPNUM > 0 or REFALTNUM > 0):
raise ValueError(f"Strict panel match failed for locus {locus}.")
#
MISMATCH = len(S2) - len(FLIPD)
if FLIPNUM > 0:
# flips involved, we need to remove at/ta/cg/gc
FLIPD = [i for j, i in enumerate(FLIPD) if j not in at_cg_id]
else:
at_cg_id = []
columns = [x for x in S1.columns.tolist() if x != 'ID'] + ["original_ID"]
S3 = pd.DataFrame(FLIPD, columns=columns)
S3["ID"] = S3["chr"] +":"+S3["bp"].astype(str) + ":" + S3["ref"] + ":" + S3["alt"]
G2["ID"] = G2[['ID', 'ref', 'alt']].apply(lambda x: ':'.join(x), axis=1)
# πΊ3
G3 = G2[G2["ID"].isin(S3["ID"])]
assert G3.shape[0] == S3.shape[0]
exit_if_empty(S3)
# recover original ID
S3["ID"] = S3.pop("original_ID")
S3 = S3.sort_values(by = ["chr", "bp"]).drop(columns = [x for x in S3.columns if not x in ('ID', 'z', 'beta', 'se')])
columns = S3.columns.tolist()
columns.insert(0, columns.pop(columns.index('ID')))
S3[columns].to_csv("${_output['summary_stats']}",sep="\t",index=False,header=None, compression = 'gzip')
print("SAMPLE: %s\tLENGTH:%d" % ("S3",len(S3)))
print("equal:%s\nflipped by strand:%d\nflipped by reference and alternative:%d\ntotal adjusted:%d\ntotal mismatch:%d\ntotal A/T and C/G removed:%d" % (EQUAL,FLIPNUM,REFALTNUM,ADJ,MISMATCH,len(at_cg_id)))
# π
=row_corr(πΊ3) OUT
np.savetxt("${_output['ld_matrix']}", np.corrcoef(G3.drop(columns=['ID', "ref","alt"]).values) if G3.shape[0] > 1 else np.array([[1]]), fmt = '%.5f')
# Get annotations
[default_2 (get annotations)]
parameter: annotation = "name:/path/to/annotation/file"
anno, prior_file = annotation.split(':')
prior_file = path(prior_file)
stop_if(not prior_file.is_file(), msg = 'Quit because annotation file is not found')
output: f"{_input['summary_stats']:nn}.{anno}.gz"
skip_if(_input['summary_stats'].stat().st_size == 0 or _input['ld_matrix'].stat().st_size == 0)
task: trunk_workers = 1, trunk_size = job_size, walltime = '10m', mem = '2G', cores = 1, tags = f'{step_name}_{_output:bn}'
python3: expand="${ }"
import pandas as pd
A1 = pd.read_table("${prior_file}", compression='gzip', sep="\s+", header=None)
A1.columns = ["ID","VALUE"]
S3 = pd.read_table("${_input['summary_stats']}",sep="\t", header=None, compression='gzip')
# have to drop duplicates in A2 because in some files there are duplicates!
A2 = A1[A1["ID"].isin(S3.iloc[:,0])][["ID","VALUE"]].drop_duplicates()
if set(S3.iloc[:,0]) != set(A2["ID"]):
raise ValueError("SNPs in summary statistics and annotation are not identical.")
else:
A2 = A2.set_index("ID")
A2 = A2.reindex(S3.iloc[:,0])
A2 = A2.reset_index()
A2.to_csv("${_output}",sep="\t",index=False,header=None, compression='gzip')
eQTL data example run:
sos run workflow/summary_statistics_wrangler.ipynb --reference /home/gaow/tmp/19-Dec-2018/1KG_EUR/test.manifest \
--loci /home/gaow/tmp/19-Dec-2018/metasoft/one_loci.txt \
--ss-data /home/gaow/tmp/19-Dec-2018/metasoft/gtex_metasoft_summary_stats.gz \
--columns 1 2 3 4 6 7 --loci-column 5 --no-header \
-q none -j 8
GWAS data example run (with annotations)
sos run workflow/summary_statistics_wrangler.ipynb --reference /home/gaow/tmp/19-Dec-2018/1KG_EUR/test.manifest \
--loci /home/gaow/tmp/19-Dec-2018/SCZ/chunks.list \
--ss-data /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics.gz \
--columns 1 2 3 4 6 7 --adjust-panel-position 1 \
--annotation atac-seq:/home/gaow/tmp/19-Dec-2018/SCZ/Annotation_atac-seq.gz \
-q none -j 8
For UChicago midway users: to run these pipelines on RCC cluster, please take a look at midway2.yml
file (found in this repository), modify it as you see fit, and replace -q none -j 8
with:
-c workflow/midway2.yml -q midway2 -J 40
%preview /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.allele_flip.log -n --limit 20
%preview /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.summary_stats.txt -n -l 10
%preview /home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.Annotation_atac-seq.txt -n -l 10
import pandas as pd
pd.read_table('/home/gaow/tmp/19-Dec-2018/SCZ/Summary_statistics/chr9_84630941_84813641/chr9_84630941_84813641.LD.txt', sep = ' ', header = None, nrows = 5)
Exported from workflow/summary_statistics_wrangler.ipynb
committed by minqiao on Mon Jul 1 18:51:56 2019 revision 36, 69d4049