This is partially in response to the reviewer response for SuSiE paper.
Here I'd like to find out for a randomly chosen variant, how many others is it in LD >=0.8, 0.9,0.95 , 0.99 and 1.
To run the workflow on RCC:
sos run analysis/20190422_Genotype_LD_Survey.ipynb -c src/sos.config.yml -q midway2_head -J 150
[global]
parameter: datadir = path('~/Documents/GTExV8/Toys/')
parameter: categories = [0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.99,1]
parameter: nvar = 100
# Just in case some regions has more than 10K variants let's ignore them because they are too big
parameter: maxvar = 20000
parameter: workdir = path('~/GIT/LargeFiles/SuSiE_Revision_1/LD_Summary')
[1]
# load all genotypes and computer LD
# For computational reasons let's not save them ... instead lets survey for each variant the different LD catagories
import glob
input: glob.glob(f"{datadir:a}/*.RDS"), group_by = 1, concurrent = True
output: f"{workdir:a}/{_input:bn}.LD.rds"
task: trunk_workers = 1, trunk_size = 10, walltime = '10m', mem = '4G', cores = 1, tags = f'{step_name}_{_output:bn}'
R: workdir=workdir, expand = '${ }'
set.seed(999)
X = readRDS(${_input:r})$X
if (ncol(X) > ${maxvar}) X = X[,1:${maxvar}]
muffled_corr = function(x)
withCallingHandlers(cor(x),
warning = function(w) {
if (grepl("the standard deviation is zero", w$message))
invokeRestart("muffleWarning")
} )
R = abs(muffled_corr(X))
# mask self-correlation out to zero
diag(R) = 0
if (${nvar} > nrow(R)) {
nvar = nrow(R)
} else {
nvar = ${nvar}
}
var_id = sort(sample(1:nrow(R), size = nvar))
res = list()
i = 1
categories = c(${",".join(map(str,categories))})
for (item in categories) {
res[[i]] = rowSums(R[var_id, ] >= item)
i = i + 1
}
res = do.call(cbind, res)
colnames(res) = as.character(categories)
saveRDS(res, ${_output:r})
[2]
input: group_by = 'all'
output: f"{workdir:a}/LD_Summary.rds"
R: workdir = workdir, expand = '${ }'
res = NULL
for (item in c(${_input:ar,})) {
if (is.null(res)) res = readRDS(item)
else res = rbind(res, readRDS(item))
}
saveRDS(res, ${_output:r})
Here is what I got with nvar=100
,
> dim(dat)
[1] 110300 10
The result is:
> summary(dat)
0.4 0.5 0.6 0.7
Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 40.0 1st Qu.: 21.0 1st Qu.: 13.00 1st Qu.: 7.00
Median : 101.0 Median : 60.0 Median : 39.00 Median : 26.00
Mean : 163.3 Mean : 106.6 Mean : 76.21 Mean : 55.79
3rd Qu.: 211.0 3rd Qu.: 136.0 3rd Qu.: 94.00 3rd Qu.: 66.00
Max. :2940.0 Max. :2876.0 Max. :2841.00 Max. :2815.00
0.8 0.85 0.9 0.95
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 4.00 1st Qu.: 3.00 1st Qu.: 2.00 1st Qu.: 1.00
Median : 16.00 Median : 12.00 Median : 8.00 Median : 4.00
Mean : 40.26 Mean : 33.03 Mean : 25.62 Mean : 17.69
3rd Qu.: 45.00 3rd Qu.: 36.00 3rd Qu.: 27.00 3rd Qu.: 17.00
Max. :2772.00 Max. :2711.00 Max. :2563.00 Max. :2505.00
0.99 1
Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000
Median : 1.000 Median : 0.000
Mean : 7.966 Mean : 2.116
3rd Qu.: 5.000 3rd Qu.: 0.000
Max. :2226.000 Max. :494.000