This notebook investigates the impact of annotation to fine-mapping performance.
This simulation study compares fine-mapping with use of annotations / without it, in terms of:
We use DAP-G to analyze the data. We run and compare two versions of DAP-G: one that uses the "oracle" prior from enrichment based simulation, one uses uniform priors.
To speed up DAP computations we set its -msize
parameter to the oracle number $L$.
We simulate fine-mapping analysis units of 500 variants, with LD structure estimated from European genotypes of GTEx project (version 8 release). Each unit contains >2 LD blocks for this data-set. We generate expected GWAS z-scores under a polygenic model for SCZ, that is, a total of PVE 0.8 contributed by ~6,000 variants. We assume effect size of causal variants are identically distributed, but the location of causal variants is associated with genomic annotations: atac-seq annotation and ASCA annotation. From previous results, enrichment estimate (odds ratio) is 3 to 6 for atac-seq peaks of five neuron cell types (what are they??), which in total covers 13.36% of the genome. ASCA in Glucose transporter and NP cells involves 5% variants in atac-seq peaks at FDR cut-off of 5%, yet has an enrichment odds ratio 55. We assume one causal variant per analysis unit, and position it such that the expected enrichment is 3 for an atac-seq annotation and 55 for ASCA annotation. Observed GWAS z-scores are then drawn from simulated expected z-score taking into consideration the convoluted effect due to LD. We simulate studies of different sample sizes of 50K, 80K and 150K. We replicates the procedure over 2,090 such simulated units. We fine-map the simulated data with DAP-G and compare performance between DAP-G that uses the simulated annotation as priors vs. DAP-G that ignores annotations. Specifically we look at the "top hit rate", defined as the proportion of times that the strongest signal identified by DAP-G is the causal signal in our simulation replicates.
The general idea is to simulate summary statistics enriched in genomic annotations -- because nowadays most studies work with summary statistics. We need these quantities:
We generate from standard univariate linear regression,
\begin{align} y=X\beta + E, E \sim N(0, \sigma_a^2), \end{align}where both $X$ and $y$ are centered thus absence of the intercept term. Percentage of variance explained, assuming additive polygenic effects:
\begin{align} PVE & = \frac{var(X\beta)}{var(y)} \\ & = \sum_{j\in C} \frac{var(X_j\beta_j)}{var(y)} \\ & = \sum_{j\in C} PVE_j, \end{align}where $C$ is the causal set whose effects are independent, and for SCZ
\begin{align} PVE_j \approx \frac{0.8}{1~\text{million} \times 0.006} = 1.34\times 10^{-4}. \end{align}Assume $y$ is scaled to have unit variance. Then $PVE_j = var(X_j\beta_j)$. For causal SNP $j$,
\begin{align} var(X_j\beta_j) &= \beta_j^2var(X_j) \\ &= 2f_j(1-f_j)\beta_j^2 \end{align}where $f_j$ is the causal allele frequency. The effect size is then
\begin{align} \beta_j &= -1^{\mathbb{1}[f_j<0.5]}\sqrt{\frac{PVE_j}{2f_j(1-f_j)}} \end{align}that is, under conventional genotype coding of 0 for major and 1 for minor allele, the direction of effect size $\beta_j$ is positive if the causal allele is a minor allele.
If columns of $X$ are independent, estimate of effect size follows distribution $\hat{\beta}_j \sim N(\beta_j, \sigma_j^2)$, where the variance for BLUE of $\beta$,
\begin{align} \sigma_j^2 &= \frac{var(y)}{Nvar({X_j})} \\ &= \frac{1}{Nvar({X_j})} \\ &= \frac{1}{2Nf_j(1-f_j)} \end{align}The sample size $N$, for SCZ GWAS study here is 80K. Hence the z-score summary statistic for causal SNP $j$ is set to
\begin{align} z_j &= \frac{\hat{\beta}_j}{\sigma_j} \\ &= -1^{\mathbb{1}[f_j<0.5]} \sqrt{N \times PVE_j} \\ &= -1^{\mathbb{1}[f_j<0.5]} \times 3.27 \end{align}z-scores are zero for non-causal SNPs $j \not\in C$.
Update: the proper way to model polygenic effect is to use random effect model for SNPs, along the lines of GCTA paper, and summary statistics along the lines of RSS paper. Let $z_j$ be the true (standardized) effect size of SNP $j$, we have the prior distribution:
$$z_j \sim (1-\pi) \delta_0 + \pi N(0, \sigma^2)$$For a causal SNP, the parameter $\sigma^2$ is related to PVE. Assuming all SNPs contribute equally,
$$PVE_j = \frac{\sigma^2}{N}$$so for our data $\sigma = \sqrt{PVE_j \times N} = 3.27$. In my implementation below, for $j \in C$ instead of setting it to fixed value $\pm 3.27$, I draw it from $N(0, 3.27^2)$.
In practice columns of $X$ are correlated due to LD. Observed GWAS z-scores $\hat{z}_j$ vary randomly about expected z-scores $z_j$ (that we derived above) with variance 1 and correlation between $j_1$ and $j_2$ equal to $cor(X_{j_1}, X_{j_2})$ (Methods section of CAVIAR paper has a justification, for example). We simulate multiple observed z-scores from a multivariate normal distrubtion, $$\hat{Z} \sim MVN(RZ, RI)$$ where the $R$ matrix, correlation between SNPs for the simulated genomic region, is the LD matrix. $I$ is identity matrix.
For simplicity we create 2,090 arbitrary chunks of genomic regions (fine-mapping analysis units) each containing 500 SNPs. That is, 2,090 simulations of different underlying LD structure. Each unit should have one or more LD blocks. We assume one causal variant per unit. The exact positions these causal variants occur are associated with genomic annotations.
Suppose we have $K$ non-overlapping, independent annotation regions. Let $p_k$ and $p_0$ denote causal probability of SNPs inside and outside annotation $k$'s region.
\begin{align} \gamma_k & = \frac{p_k/1-p_k}{p_0/1-p_0} \\ L & = [\sum_k q_kp_k + (1-\sum_k q_k)p_0] \times M \end{align}where $\gamma_k$ is the odds ratio for annotation $k$ obtained from enrichment analysis, $L$ is the number of causal variants in the region ($L = 1$), $M$ is length of the region ($M=500$), $q$ is proportion of region overl apping with functional annotations (see choice of $q$ below for two types of annotations).
From previous enrichment analysis of 5 atac-seq annotations we estimate enrichment of GWAS signals in these regions with odds ratios ranging from 3.70 to 6.02, with mean 4.74. We can try $\gamma_1 = 3, 5, 7$. The 5 annotations, when combined, physically cover a total of 13.36% of the genome. The average size of the combined peaks is 576; thus iff we assume uniform distribution of peaks over the uniformally distributed 1 million variants we test the consective annotated region should be less than 1. In that case we'll simply assign 13.36% variants in an analysis unit to fall in the annotation, without worrying about their consectiveness.
From previous enrichment analysis of ASCA we have odds ratio of $\gamma_2 = 55$ for ASCA variants with FDR < 0.05. On average 5% variants in each atac-seq peak are ASCA. To add in this annotation, for simplicity we now assume 12.69% variants are atac-seq annotations, 0.66% are ASCA.
Simulate study is implemented in DSC framework. Input data are just matrices of genotypes.
To run the benchmark, first click inside SoS notebook "restart kernel and run all" botton. Then run the exported scripts with 8 CPU threads:
dsc master.dsc -c 8
zzz.dsc
¶%save -f modules/zzz.dsc
%include modules/simulate_region
%include modules/simulate_z
%include modules/fit
%include modules/evaluate
master.dsc
¶%save -f master.dsc
#!/usr/bin/env dsc
%include modules/zzz
DSC:
define:
fit: dap, dapa
# FIXME: I should run 2 separate pipelines for dap and dap
# because dap only runs one scenario but dapa runs multiple
# now I in fact run the same dap multiple times; but i can live with it
# because dap is really fast in this simulation and the result is easier to parse
# with this DSC logic.
run: sample_region * simulate_prior * simulate_z * fit * evaluate
exec_path: modules
global:
data_file: gtex-manifest.txt
n_units: 150
output: xh_grant
simulate_region.dsc
¶%save -f modules/simulate_region.dsc
sample_region: sim_region.R + R(data = readRDS(dataset);
X = get_loci(data$X, M);
R = lapply(1:length(X), function(i) round(cor(X[[i]]),4));
eff_sign = get_sign(X);
status = lapply(1:length(R), function(i) write.table(R[[i]],paste0(ld_file, '.', i),quote=F,col.names=F,row.names=F)))
dataset: Shell{head -${n_units} ${data_file}}
M: 500
$eff_sign: eff_sign
$R: R
$ld_file: file(ld)
simulate_prior: sim_region.R + R(prior = get_prior(nrow(read.table(paste0(ld_file, '.1'))), eparam[[1]], eparam[[2]]))
ld_file: $ld_file
# enrichment parameters:
# should be (gamma, p) where gamma is a vector of odds ratios, p is the corresponding vector of proportions of annotated regions
eparam: (3, 0.1336), ((3, 55), (0.1269, 0.00668))
$prior: prior$prior
$annotation: prior$annotation
simulate_z.dsc
¶%save -f modules/simulate_z.dsc
simulate_z: sim_z.py + Python(z, z_true, L = simulate(R, N, pve, eff_sign, n_signal, prior))
R: $R
eff_sign: $eff_sign
prior: $prior
N: 50000, 80000, 150000
n_signal: 1
pve: 1.34E-4
$z: z
$z_true: z_true
$L: L
fit.dsc
¶%save -f modules/fit.dsc
dap: fit_dap.py + Python(posterior = dap_batch_z(z, ld, L, cache, None, args))
ld: $ld_file
z: $z
L: $L
args: "-ld_control 0.20 --all"
cache: file(DAP)
$posterior: posterior
dapa(dap): fit_dap.py + Python(posterior = dap_batch_z(z, ld, L, cache, prior, args))
prior: $prior
evaluate.dsc
¶%save -f modules/evaluate.dsc
evaluate: evaluate_dap.py
z_true: $z_true
posterior: $posterior
$is_recovered: is_recovered
$is_cs_true: is_cs_true
$is_top_true: is_top_true
$size: size
$purity: purity
sim_region.R
¶To simulate priors for every SNP we need to determine $p_0$ and $p_k$. From equations above, for $L=1$ we derive:
\begin{align} p_k = \frac{\gamma_k p_0}{1 - p_0 + \gamma_k p_0} \\ [\sum_k q_k \frac{\gamma_k p_0}{1 - p_0 + \gamma_k p_0} + (1-\sum_k q_k)p_0] \times M - 1 = 0 \end{align}We can solve this numerically in R, eg:
g = c(4.5, 55)
M = 500
q = c(0.1269, 0.0066)
foo = function(x) M * (sum(q * g * x / (1 - x + g * x)) + (1 - sum(q)) * x) - 1
p0 = uniroot(foo, lower=0, upper=1, tol = .Machine$double.eps^0.8)$root
p1 = g * p0 / (1-p0+g*p0)
print(c(p0, p1))
# verify it:
print(p1/(1-p1) / (p0 / (1-p0)))
print(M * (sum(q * p1) + (1-sum(q)) * p0))
These probabilities recover the odds ratio, so we should be good. I'll use it in my codes below.
%save -f modules/sim_region.R
CorrectCM <- function(CM)
{
n <- dim(var(CM))[1L]
E <- eigen(CM)
CM1 <- E$vectors %*% tcrossprod(diag(pmax(E$values, 0), n), E$vectors)
Balance <- diag(1/sqrt(diag(CM1)))
CM2 <- Balance %*% CM1 %*% Balance
return(CM2)
}
get_loci = function(X, N) {
segs = floor(ncol(X) / N)
lapply(1:segs, function(i) X[,i:(i+N-1)])
}
get_prior = function(M, g, q) {
foo = function(x) M * (sum(q * g * x / (1 - x + g * x)) + (1 - sum(q)) * x) - 1
p0 = uniroot(foo, lower=0, upper=1, tol = .Machine$double.eps^0.8)$root
p1 = g * p0 / (1-p0+g*p0)
n_anns = floor(M * q)
annotated = lapply(1:length(n_anns), function(i) sample(c(rep(1,n_anns[i]), rep(0, M - n_anns[i]))))
prior = rep(p0, M)
annotation = rep(0, M)
for (i in 1:length(p1)) {
prior[which(annotated[[i]] == 1)] = p1[i]
annotation = annotation + annotated[[i]]
}
list(prior=prior, annotation=annotation)
}
get_sign = function(X) {
lapply(1:length(X), function(i) apply(X[[i]], 2, function(x) (-1)^as.integer((mean(x)/2) > 0.5)))
}
sim_z.py
¶%save -f modules/sim_z.py
import numpy as np
def sim_gwas_z(R, N, pve, eff_sign, n_signal, prior):
np.random.seed(int(np.sum(np.absolute(np.array(R[:,1])))))
# get expected z-score assuming all causal
# it really does not matter to our purpose whether Z is positive or negative
# because it is only a matter of allele coding ...
z_true = np.random.normal(0, np.sqrt(N * pve)) * eff_sign
# sparsify expected z-score to allow for n_signal causal
# FIXME: might overlap if some prior is very high;
# thus not gauranteed to have exactly n_signal non-zeros
# but it is quite convenient to call this function
z_true *= np.random.multinomial(n_signal, prior)
# get observed z-scores
# FIXME: there is a PSD warning I cannot get rid of for now
# simply ignore it ...
z = np.random.multivariate_normal(np.ravel(R @ z_true), R, check_valid = 'ignore')
return z, z_true
def simulate(R, N, pve, eff_sign, n_signal, prior):
z_true = {k:[] for k in R}
z = {k:[] for k in R}
for k in R:
z[k], z_true[k] = sim_gwas_z(R[k], N, pve, eff_sign[k], n_signal, prior)
return z, z_true, {k: sum(z_true[k]!=0) for k in z_true}
fit_dap.py
¶Below is example output for DAP-G:
Posterior expected model size: 0.500 (sd = 0.500)
LogNC = -0.30685 ( Log10NC = -0.133 )
Posterior inclusion probability
((1)) 7492 6.68581e-05 0.000 1
((2)) 7490 6.68581e-05 0.000 1
... 7 lines
((8)) 7491 6.68046e-05 0.000 2
((9)) 7483 6.68046e-05 0.000 2
((10)) 7485 6.68046e-05 0.000 2
... 13 lines
((20)) 7459 6.68046e-05 0.000 2
((21)) 7482 6.67422e-05 0.000 -1
((22)) 7489 6.67422e-05 0.000 -1
... other lines until below ...
Independent association signal clusters
cluster member_snp cluster_pip average_r2
{1} 7 4.680e-04 0.951 0.951 0.037
{2} 13 8.685e-04 0.623 0.037 0.623
%save -f modules/fit_dap.py
import subprocess
import pandas as pd
import numpy as np
def write_dap_z(z, prefix):
'''z-score vesion of dap input is the same as FINEMAP'''
ids = np.array([str(i+1) for i in range(z.shape[0])])
with open(f'{prefix}.z', 'w') as f:
np.savetxt(f, np.vstack((ids, z)).T, fmt = '%s', delimiter = ' ')
def write_prior(prior, prefix):
with open(f'{prefix}.prior', 'w') as f:
f.write('\n'.join([f'{i+1}\t{p}' for i, p in enumerate(prior)]))
def run_dap_z(ld, r, L, prior, prefix, args):
cmd = ['dap-g', '-d_z', f'{prefix}.z', '-d_ld', f'{ld}.{r}', '-o', f'{prefix}.result', '--output_all'] + ' '.join(args).split()
if prior is not None:
cmd.extend(['-p', f'{prefix}.prior'])
if L > 0:
cmd.extend(['-msize', str(L)])
subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
def extract_dap_output(prefix):
out = [x.strip().split() for x in open(f'{prefix}.result').readlines()]
pips = []
clusters = []
still_pip = True
for line in out:
if len(line) == 0:
continue
if len(line) > 2 and line[2] == 'cluster_pip':
still_pip = False
continue
if still_pip and (not line[0].startswith('((')):
continue
if still_pip:
pips.append([line[1], float(line[2]), float(line[3]), int(line[4])])
else:
clusters.append([len(clusters) + 1, float(line[2]), float(line[3])])
pips = pd.DataFrame(pips, columns = ['snp', 'snp_prob', 'snp_log10bf', 'cluster'])
clusters = pd.DataFrame(clusters, columns = ['cluster', 'cluster_prob', 'cluster_avg_r2'])
clusters = pd.merge(clusters, pips.groupby(['cluster'])['snp'].apply(','.join).reset_index(), on = 'cluster')
return {'snp': pips, 'set': clusters}
def dap_single_z(z, ld, L, prefix, r, prior, args):
write_dap_z(z,prefix)
if prior is not None:
write_prior(prior, prefix)
run_dap_z(ld, r, L, prior, prefix, args)
return extract_dap_output(prefix)
def dap_batch_z(Z, ld, L, prefix, prior, *args):
return dict([(k, dap_single_z(Z[k], ld, L[k], f'{prefix}_condition_{k}', k, prior, args)) for k in Z])
%save -f modules/evaluate_dap.py
def dap_summary(cluster, snp, z_true):
signal_expected = [f'{idx+1}' for idx, i in enumerate(z_true) if i != 0]
cluster = cluster.loc[cluster['cluster_prob'] > 0.95]
if cluster.shape[0] == 0:
return ["failed"] * 5
# to return
purity = cluster['cluster_prob'].tolist()
cluster = [x.split(',') for x in cluster['snp']]
signal_detected = sum(cluster, [])
# to return
size = [len(c) for c in cluster]
snp = [snp.loc[snp['snp'].isin(c)] for c in cluster]
top_snp = [s.loc[s['snp_prob'] == max(s['snp_prob'])]['snp'].tolist()[0] for s in snp]
# to return
is_top_true = [1 if x in signal_expected else 0 for x in top_snp]
# to return
is_recovered = [1 if x in signal_detected else 0 for x in signal_expected]
# to return
is_cs_true = [1 if len(set(x).intersection(signal_expected)) else 0 for x in cluster]
return is_recovered, is_cs_true, is_top_true, size, purity
is_recovered = dict()
is_cs_true = dict()
is_top_true = dict()
size = dict()
purity = dict()
for k in posterior:
is_recovered[k], is_cs_true[k], is_top_true[k], size[k], purity[k] = dap_summary(posterior[k]['set'], posterior[k]['snp'], z_true[k])
Exported from analysis/dsc.ipynb
committed by Gao Wang on Sun Oct 20 10:11:36 2019 revision 27, 9959df5