Annotation enhanced genetic fine-mapping¶

This notebook investigates the impact of annotation to fine-mapping performance.

Aim¶

This simulation study compares fine-mapping with use of annotations / without it, in terms of:

Improvements in power: the top signal from each candidate fine-mapping cluster is more likely to be the true causal signal when annotations are used.
- power: measures how many times the simulated signals ("singals" hereafter) are captured by fine-mapping clusters ("clusters" hereafter).
- false discovery porportion: measures how many clusters do not capture any signal.
Improvements in fine-mapping resolution: with the use of annotations we expect to provide smaller sets of candidate SNPs than without them.
- top hit rate: measures how many times the top association from each cluster is in fact a signal.
- size: median size of clusters, smaller size means higher resolution.
- purity: average $r^2$ within clusters, higher $r^2$ means higher resolution.

Fine-mapping method outline¶

We use DAP-G to analyze the data. We run and compare two versions of DAP-G: one that uses the "oracle" prior from enrichment based simulation, one uses uniform priors.

To speed up DAP computations we set its -msize parameter to the oracle number $L$.

Simulation methods overview¶

We simulate fine-mapping analysis units of 500 variants, with LD structure estimated from European genotypes of GTEx project (version 8 release). Each unit contains >2 LD blocks for this data-set. We generate expected GWAS z-scores under a polygenic model for SCZ, that is, a total of PVE 0.8 contributed by ~6,000 variants. We assume effect size of causal variants are identically distributed, but the location of causal variants is associated with genomic annotations: atac-seq annotation and ASCA annotation. From previous results, enrichment estimate (odds ratio) is 3 to 6 for atac-seq peaks of five neuron cell types (what are they??), which in total covers 13.36% of the genome. ASCA in Glucose transporter and NP cells involves 5% variants in atac-seq peaks at FDR cut-off of 5%, yet has an enrichment odds ratio 55. We assume one causal variant per analysis unit, and position it such that the expected enrichment is 3 for an atac-seq annotation and 55 for ASCA annotation. Observed GWAS z-scores are then drawn from simulated expected z-score taking into consideration the convoluted effect due to LD. We simulate studies of different sample sizes of 50K, 80K and 150K. We replicates the procedure over 2,090 such simulated units. We fine-map the simulated data with DAP-G and compare performance between DAP-G that uses the simulated annotation as priors vs. DAP-G that ignores annotations. Specifically we look at the "top hit rate", defined as the proportion of times that the strongest signal identified by DAP-G is the causal signal in our simulation replicates.

Simulation methods details¶

The general idea is to simulate summary statistics enriched in genomic annotations -- because nowadays most studies work with summary statistics. We need these quantities:

GWAS sample size, MAF and LD matrix, which can be estimated from real genotype data
- We estimate it from GTEx project of ~600 European samples. We use sample size of 80,000 to reflect the scale of available SCZ data.
PVE and polygenicity information from literature for the phenotype of interest
- For SCZ the total PVE is 0.8; for the 1 million variants typically analyzed the proportion of causal is 0.006
Enrichment of possibly causal variance in different genomic annotations of relevance
- We evaluate 5 (or 7?) atac-seq based annotations, enrichment test performed by Min Qiao

Simulation of expected z-scores¶

We generate from standard univariate linear regression,

\begin{align} y=X\beta + E, E \sim N(0, \sigma_a^2), \end{align}

where both $X$ and $y$ are centered thus absence of the intercept term. Percentage of variance explained, assuming additive polygenic effects:

\begin{align} PVE & = \frac{var(X\beta)}{var(y)} \\ & = \sum_{j\in C} \frac{var(X_j\beta_j)}{var(y)} \\ & = \sum_{j\in C} PVE_j, \end{align}

where $C$ is the causal set whose effects are independent, and for SCZ

\begin{align} PVE_j \approx \frac{0.8}{1~\text{million} \times 0.006} = 1.34\times 10^{-4}. \end{align}

Assume $y$ is scaled to have unit variance. Then $PVE_j = var(X_j\beta_j)$. For causal SNP $j$,

\begin{align} var(X_j\beta_j) &= \beta_j^2var(X_j) \\ &= 2f_j(1-f_j)\beta_j^2 \end{align}

where $f_j$ is the causal allele frequency. The effect size is then

\begin{align} \beta_j &= -1^{\mathbb{1}[f_j<0.5]}\sqrt{\frac{PVE_j}{2f_j(1-f_j)}} \end{align}

that is, under conventional genotype coding of 0 for major and 1 for minor allele, the direction of effect size $\beta_j$ is positive if the causal allele is a minor allele.

If columns of $X$ are independent, estimate of effect size follows distribution $\hat{\beta}_j \sim N(\beta_j, \sigma_j^2)$, where the variance for BLUE of $\beta$,

\begin{align} \sigma_j^2 &= \frac{var(y)}{Nvar({X_j})} \\ &= \frac{1}{Nvar({X_j})} \\ &= \frac{1}{2Nf_j(1-f_j)} \end{align}

The sample size $N$, for SCZ GWAS study here is 80K. Hence the z-score summary statistic for causal SNP $j$ is set to

\begin{align} z_j &= \frac{\hat{\beta}_j}{\sigma_j} \\ &= -1^{\mathbb{1}[f_j<0.5]} \sqrt{N \times PVE_j} \\ &= -1^{\mathbb{1}[f_j<0.5]} \times 3.27 \end{align}

z-scores are zero for non-causal SNPs $j \not\in C$.

Update: the proper way to model polygenic effect is to use random effect model for SNPs, along the lines of GCTA paper, and summary statistics along the lines of RSS paper. Let $z_j$ be the true (standardized) effect size of SNP $j$, we have the prior distribution:

$$z_j \sim (1-\pi) \delta_0 + \pi N(0, \sigma^2)$$

For a causal SNP, the parameter $\sigma^2$ is related to PVE. Assuming all SNPs contribute equally,

$$PVE_j = \frac{\sigma^2}{N}$$

so for our data $\sigma = \sqrt{PVE_j \times N} = 3.27$. In my implementation below, for $j \in C$ instead of setting it to fixed value $\pm 3.27$, I draw it from $N(0, 3.27^2)$.

Simulation of observed z-scores¶

In practice columns of $X$ are correlated due to LD. Observed GWAS z-scores $\hat{z}_j$ vary randomly about expected z-scores $z_j$ (that we derived above) with variance 1 and correlation between $j_1$ and $j_2$ equal to $cor(X_{j_1}, X_{j_2})$ (Methods section of CAVIAR paper has a justification, for example). We simulate multiple observed z-scores from a multivariate normal distrubtion, $$\hat{Z} \sim MVN(RZ, RI)$$ where the $R$ matrix, correlation between SNPs for the simulated genomic region, is the LD matrix. $I$ is identity matrix.

Simulation of causal signals¶

For simplicity we create 2,090 arbitrary chunks of genomic regions (fine-mapping analysis units) each containing 500 SNPs. That is, 2,090 simulations of different underlying LD structure. Each unit should have one or more LD blocks. We assume one causal variant per unit. The exact positions these causal variants occur are associated with genomic annotations.

Suppose we have $K$ non-overlapping, independent annotation regions. Let $p_k$ and $p_0$ denote causal probability of SNPs inside and outside annotation $k$'s region.

\begin{align} \gamma_k & = \frac{p_k/1-p_k}{p_0/1-p_0} \\ L & = [\sum_k q_kp_k + (1-\sum_k q_k)p_0] \times M \end{align}

where $\gamma_k$ is the odds ratio for annotation $k$ obtained from enrichment analysis, $L$ is the number of causal variants in the region ($L = 1$), $M$ is length of the region ($M=500$), $q$ is proportion of region overl apping with functional annotations (see choice of $q$ below for two types of annotations).

atac-seq annotations¶

From previous enrichment analysis of 5 atac-seq annotations we estimate enrichment of GWAS signals in these regions with odds ratios ranging from 3.70 to 6.02, with mean 4.74. We can try $\gamma_1 = 3, 5, 7$. The 5 annotations, when combined, physically cover a total of 13.36% of the genome. The average size of the combined peaks is 576; thus iff we assume uniform distribution of peaks over the uniformally distributed 1 million variants we test the consective annotated region should be less than 1. In that case we'll simply assign 13.36% variants in an analysis unit to fall in the annotation, without worrying about their consectiveness.

Allele-specific chromatin accessability (ASCA) annotations¶

From previous enrichment analysis of ASCA we have odds ratio of $\gamma_2 = 55$ for ASCA variants with FDR < 0.05. On average 5% variants in each atac-seq peak are ASCA. To add in this annotation, for simplicity we now assume 12.69% variants are atac-seq annotations, 0.66% are ASCA.

DSC benchmark¶

Simulate study is implemented in DSC framework. Input data are just matrices of genotypes.

To run the benchmark, first click inside SoS notebook "restart kernel and run all" botton. Then run the exported scripts with 8 CPU threads:

dsc master.dsc -c 8

`zzz.dsc`¶

%save -f modules/zzz.dsc
%include modules/simulate_region
%include modules/simulate_z
%include modules/fit
%include modules/evaluate

`master.dsc`¶

%save -f master.dsc
#!/usr/bin/env dsc
%include modules/zzz

DSC:
    define:
        fit: dap, dapa
    # FIXME: I should run 2 separate pipelines for dap and dap
    # because dap only runs one scenario but dapa runs multiple
    # now I in fact run the same dap multiple times; but i can live with it
    # because dap is really fast in this simulation and the result is easier to parse
    # with this DSC logic.
    run: sample_region * simulate_prior * simulate_z * fit * evaluate
    exec_path: modules
    global:
        data_file: gtex-manifest.txt
        n_units: 150
    output: xh_grant

`simulate_region.dsc`¶

%save -f modules/simulate_region.dsc

sample_region: sim_region.R + R(data = readRDS(dataset);
                                X = get_loci(data$X, M);
                                R = lapply(1:length(X), function(i) round(cor(X[[i]]),4));
                                eff_sign = get_sign(X);
                                status = lapply(1:length(R), function(i) write.table(R[[i]],paste0(ld_file, '.', i),quote=F,col.names=F,row.names=F)))                           
  dataset: Shell{head -${n_units} ${data_file}}
  M: 500
  $eff_sign: eff_sign
  $R: R
  $ld_file: file(ld)

simulate_prior: sim_region.R + R(prior = get_prior(nrow(read.table(paste0(ld_file, '.1'))), eparam[[1]], eparam[[2]]))
  ld_file: $ld_file
  # enrichment parameters:
  # should be (gamma, p) where gamma is a vector of odds ratios, p is the corresponding vector of proportions of annotated regions
  eparam: (3, 0.1336), ((3, 55), (0.1269, 0.00668))
  $prior: prior$prior
  $annotation: prior$annotation

`simulate_z.dsc`¶

%save -f modules/simulate_z.dsc
simulate_z: sim_z.py + Python(z, z_true, L = simulate(R, N, pve, eff_sign, n_signal, prior))
  R: $R
  eff_sign: $eff_sign
  prior: $prior
  N: 50000, 80000, 150000
  n_signal: 1
  pve: 1.34E-4
  $z: z
  $z_true: z_true
  $L: L

`fit.dsc`¶

%save -f modules/fit.dsc

dap: fit_dap.py + Python(posterior = dap_batch_z(z, ld, L, cache, None, args))
  ld: $ld_file
  z: $z
  L: $L
  args: "-ld_control 0.20 --all"
  cache: file(DAP)
  $posterior: posterior

dapa(dap): fit_dap.py + Python(posterior = dap_batch_z(z, ld, L, cache, prior, args))
  prior: $prior

`evaluate.dsc`¶

%save -f modules/evaluate.dsc

evaluate: evaluate_dap.py
  z_true: $z_true
  posterior: $posterior
  $is_recovered: is_recovered
  $is_cs_true: is_cs_true
  $is_top_true: is_top_true
  $size: size
  $purity: purity

Code for Simulation¶

`sim_region.R`¶

To simulate priors for every SNP we need to determine $p_0$ and $p_k$. From equations above, for $L=1$ we derive:

\begin{align} p_k = \frac{\gamma_k p_0}{1 - p_0 + \gamma_k p_0} \\ [\sum_k q_k \frac{\gamma_k p_0}{1 - p_0 + \gamma_k p_0} + (1-\sum_k q_k)p_0] \times M - 1 = 0 \end{align}

We can solve this numerically in R, eg:

g = c(4.5, 55)
M = 500
q = c(0.1269, 0.0066)
foo = function(x) M * (sum(q * g * x / (1 - x + g * x)) + (1 - sum(q)) * x) - 1
p0 = uniroot(foo, lower=0, upper=1, tol = .Machine$double.eps^0.8)$root
p1 = g * p0 / (1-p0+g*p0)
print(c(p0, p1))

[1] 0.001125165 0.005043379 0.058339412

# verify it:
print(p1/(1-p1) / (p0 / (1-p0)))
print(M * (sum(q * p1) + (1-sum(q)) * p0))

[1]  4.5 55.0
[1] 1

These probabilities recover the odds ratio, so we should be good. I'll use it in my codes below.

%save -f modules/sim_region.R
CorrectCM <- function(CM)
  {
  n <- dim(var(CM))[1L]
  E <- eigen(CM)
  CM1 <- E$vectors %*% tcrossprod(diag(pmax(E$values, 0), n), E$vectors)
  Balance <- diag(1/sqrt(diag(CM1)))
  CM2 <- Balance %*% CM1 %*% Balance  
  return(CM2)
}

get_loci = function(X, N) {
    segs = floor(ncol(X) / N)
    lapply(1:segs, function(i) X[,i:(i+N-1)])
}
get_prior = function(M, g, q) {
    foo = function(x) M * (sum(q * g * x / (1 - x + g * x)) + (1 - sum(q)) * x) - 1
    p0 = uniroot(foo, lower=0, upper=1, tol = .Machine$double.eps^0.8)$root
    p1 = g * p0 / (1-p0+g*p0)
    n_anns = floor(M * q)
    annotated = lapply(1:length(n_anns), function(i) sample(c(rep(1,n_anns[i]), rep(0, M - n_anns[i]))))
    prior = rep(p0, M)
    annotation = rep(0, M)
    for (i in 1:length(p1)) {
        prior[which(annotated[[i]] == 1)] = p1[i]
        annotation = annotation + annotated[[i]]
    }
    list(prior=prior, annotation=annotation)   
}
get_sign = function(X) {
    lapply(1:length(X), function(i) apply(X[[i]], 2, function(x) (-1)^as.integer((mean(x)/2) > 0.5)))
}

`sim_z.py`¶

%save -f modules/sim_z.py
import numpy as np

def sim_gwas_z(R, N, pve, eff_sign, n_signal, prior):
    np.random.seed(int(np.sum(np.absolute(np.array(R[:,1])))))
    # get expected z-score assuming all causal
    # it really does not matter to our purpose whether Z is positive or negative
    # because it is only a matter of allele coding ...
    z_true = np.random.normal(0, np.sqrt(N * pve)) * eff_sign
    # sparsify expected z-score to allow for n_signal causal
    # FIXME: might overlap if some prior is very high; 
    # thus not gauranteed to have exactly n_signal non-zeros
    # but it is quite convenient to call this function
    z_true *= np.random.multinomial(n_signal, prior)
    # get observed z-scores
    # FIXME: there is a PSD warning I cannot get rid of for now
    # simply ignore it ...
    z = np.random.multivariate_normal(np.ravel(R @ z_true), R, check_valid = 'ignore')
    return z, z_true

def simulate(R, N, pve, eff_sign, n_signal, prior):
    z_true = {k:[] for k in R}
    z = {k:[] for k in R}
    for k in R:
        z[k], z_true[k] = sim_gwas_z(R[k], N, pve, eff_sign[k], n_signal, prior)
    return z, z_true, {k: sum(z_true[k]!=0) for k in z_true}

Fine-mapping with DAP¶

`fit_dap.py`¶

Below is example output for DAP-G:

Posterior expected model size: 0.500 (sd = 0.500)
LogNC = -0.30685 ( Log10NC = -0.133 )
Posterior inclusion probability

((1))              7492 6.68581e-05       0.000 1
((2))              7490 6.68581e-05       0.000 1
... 7 lines
((8))              7491 6.68046e-05       0.000 2
((9))              7483 6.68046e-05       0.000 2
((10))             7485 6.68046e-05       0.000 2
... 13 lines
((20))             7459 6.68046e-05       0.000 2
((21))             7482 6.67422e-05       0.000 -1
((22))             7489 6.67422e-05       0.000 -1
... other lines until below ...

Independent association signal clusters

     cluster         member_snp      cluster_pip      average_r2
       {1}              7            4.680e-04          0.951                 0.951   0.037
       {2}             13            8.685e-04          0.623                 0.037   0.623

%save -f modules/fit_dap.py
import subprocess
import pandas as pd
import numpy as np

def write_dap_z(z, prefix):
    '''z-score vesion of dap input is the same as FINEMAP'''
    ids = np.array([str(i+1) for i in range(z.shape[0])])
    with open(f'{prefix}.z', 'w') as f:
        np.savetxt(f,  np.vstack((ids, z)).T, fmt = '%s', delimiter = ' ')

def write_prior(prior, prefix):
    with open(f'{prefix}.prior', 'w') as f:
        f.write('\n'.join([f'{i+1}\t{p}' for i, p in enumerate(prior)]))

def run_dap_z(ld, r, L, prior, prefix, args):
    cmd = ['dap-g', '-d_z', f'{prefix}.z', '-d_ld', f'{ld}.{r}', '-o', f'{prefix}.result', '--output_all'] + ' '.join(args).split()
    if prior is not None:
        cmd.extend(['-p', f'{prefix}.prior'])
    if L > 0:
        cmd.extend(['-msize', str(L)])
    subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()    
    
def extract_dap_output(prefix):
    out = [x.strip().split() for x in open(f'{prefix}.result').readlines()]
    pips = []
    clusters = []
    still_pip = True
    for line in out:
        if len(line) == 0:
            continue
        if len(line) > 2 and line[2] == 'cluster_pip':
            still_pip = False
            continue
        if still_pip and (not line[0].startswith('((')):
            continue
        if still_pip:
            pips.append([line[1], float(line[2]), float(line[3]), int(line[4])])
        else:
            clusters.append([len(clusters) + 1, float(line[2]), float(line[3])])
    pips = pd.DataFrame(pips, columns = ['snp', 'snp_prob', 'snp_log10bf', 'cluster'])
    clusters = pd.DataFrame(clusters, columns = ['cluster', 'cluster_prob', 'cluster_avg_r2'])
    clusters = pd.merge(clusters, pips.groupby(['cluster'])['snp'].apply(','.join).reset_index(), on = 'cluster')
    return {'snp': pips, 'set': clusters}

def dap_single_z(z, ld, L, prefix, r, prior, args):
    write_dap_z(z,prefix)
    if prior is not None:
        write_prior(prior, prefix)
    run_dap_z(ld, r, L, prior, prefix, args)
    return extract_dap_output(prefix)


def dap_batch_z(Z, ld, L, prefix, prior, *args):
    return dict([(k, dap_single_z(Z[k], ld, L[k], f'{prefix}_condition_{k}', k, prior, args)) for k in Z])

Evaluate results¶

`evaluate.py`¶

%save -f modules/evaluate_dap.py
def dap_summary(cluster, snp, z_true):
    signal_expected = [f'{idx+1}' for idx, i in enumerate(z_true) if i != 0]
    cluster = cluster.loc[cluster['cluster_prob'] > 0.95]
    if cluster.shape[0] == 0:
        return ["failed"] * 5
    # to return
    purity = cluster['cluster_prob'].tolist()
    cluster = [x.split(',') for x in cluster['snp']]
    signal_detected = sum(cluster, [])
    # to return
    size = [len(c) for c in cluster]
    snp = [snp.loc[snp['snp'].isin(c)] for c in cluster]
    top_snp = [s.loc[s['snp_prob'] == max(s['snp_prob'])]['snp'].tolist()[0] for s in snp]
    # to return
    is_top_true = [1 if x in signal_expected else 0 for x in top_snp]
    # to return
    is_recovered = [1 if x in signal_detected else 0 for x in signal_expected]
    # to return
    is_cs_true = [1 if len(set(x).intersection(signal_expected)) else 0 for x in cluster]
    return is_recovered, is_cs_true, is_top_true, size, purity

is_recovered = dict()
is_cs_true = dict()
is_top_true = dict()
size = dict()
purity = dict()
for k in posterior:
    is_recovered[k], is_cs_true[k], is_top_true[k], size[k], purity[k] = dap_summary(posterior[k]['set'], posterior[k]['snp'], z_true[k])