A summary of genotype sample LD in GTEx data¶

This is partially in response to the reviewer response for SuSiE paper.

Here I'd like to find out for a randomly chosen variant, how many others is it in LD >=0.8, 0.9,0.95 , 0.99 and 1.

To run the workflow on RCC:

sos run analysis/20190422_Genotype_LD_Survey.ipynb -c src/sos.config.yml -q midway2_head -J 150

[global]
parameter: datadir = path('~/Documents/GTExV8/Toys/')
parameter: categories = [0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.99,1]
parameter: nvar = 100
# Just in case some regions has more than 10K variants let's ignore them because they are too big
parameter: maxvar = 20000
parameter: workdir = path('~/GIT/LargeFiles/SuSiE_Revision_1/LD_Summary')

[1]
# load all genotypes and computer LD
# For computational reasons let's not save them ... instead lets survey for each variant the different LD catagories
import glob
input: glob.glob(f"{datadir:a}/*.RDS"), group_by = 1, concurrent = True
output: f"{workdir:a}/{_input:bn}.LD.rds"
task: trunk_workers = 1, trunk_size = 10, walltime = '10m', mem = '4G', cores = 1, tags = f'{step_name}_{_output:bn}'
R: workdir=workdir, expand = '${ }'
  set.seed(999)
  X = readRDS(${_input:r})$X
  if (ncol(X) > ${maxvar}) X = X[,1:${maxvar}]
  muffled_corr = function(x)
  withCallingHandlers(cor(x),
                    warning = function(w) {
                      if (grepl("the standard deviation is zero", w$message))
                        invokeRestart("muffleWarning")
                    } )
  R = abs(muffled_corr(X))
  # mask self-correlation out to zero
  diag(R) = 0
  if (${nvar} > nrow(R)) { 
      nvar = nrow(R)
  } else {
      nvar = ${nvar}
  }
  var_id = sort(sample(1:nrow(R), size = nvar))
  res = list()
  i = 1
  categories = c(${",".join(map(str,categories))})
  for (item in categories) {
      res[[i]] = rowSums(R[var_id, ] >= item)
      i = i + 1
  }
  res = do.call(cbind, res)
  colnames(res) = as.character(categories)
  saveRDS(res, ${_output:r})

[2]
input: group_by = 'all'
output: f"{workdir:a}/LD_Summary.rds"
R: workdir = workdir, expand = '${ }'
  res = NULL
  for (item in c(${_input:ar,})) {
      if (is.null(res)) res = readRDS(item)
      else res = rbind(res, readRDS(item))
  }
  saveRDS(res, ${_output:r})

Here is what I got with nvar=100,

> dim(dat)
[1] 110300     10

The result is:

> summary(dat)
      0.4              0.5              0.6               0.7         
 Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.00  
 1st Qu.:  40.0   1st Qu.:  21.0   1st Qu.:  13.00   1st Qu.:   7.00  
 Median : 101.0   Median :  60.0   Median :  39.00   Median :  26.00  
 Mean   : 163.3   Mean   : 106.6   Mean   :  76.21   Mean   :  55.79  
 3rd Qu.: 211.0   3rd Qu.: 136.0   3rd Qu.:  94.00   3rd Qu.:  66.00  
 Max.   :2940.0   Max.   :2876.0   Max.   :2841.00   Max.   :2815.00  
      0.8               0.85              0.9               0.95        
 Min.   :   0.00   Min.   :   0.00   Min.   :   0.00   Min.   :   0.00  
 1st Qu.:   4.00   1st Qu.:   3.00   1st Qu.:   2.00   1st Qu.:   1.00  
 Median :  16.00   Median :  12.00   Median :   8.00   Median :   4.00  
 Mean   :  40.26   Mean   :  33.03   Mean   :  25.62   Mean   :  17.69  
 3rd Qu.:  45.00   3rd Qu.:  36.00   3rd Qu.:  27.00   3rd Qu.:  17.00  
 Max.   :2772.00   Max.   :2711.00   Max.   :2563.00   Max.   :2505.00  
      0.99                1          
 Min.   :   0.000   Min.   :  0.000  
 1st Qu.:   0.000   1st Qu.:  0.000  
 Median :   1.000   Median :  0.000  
 Mean   :   7.966   Mean   :  2.116  
 3rd Qu.:   5.000   3rd Qu.:  0.000  
 Max.   :2226.000   Max.   :494.000