Multivariate Bayesian variable selection regression

BIMBAM plots using Matthew's code 2007

Matthew Stephens July 30, 2007

The utility.multiSNP.R contains R code to plot a summary ofbimbamoutputwhen computing multi-SNP BFs.

The problem this code attempts to solve is that whenperforming a multi-SNP analysis there may be large numbers of different combinations of SNPs that could be associated with the phenotype, and sifting through and interpretingthese results can be challenging.

The approach we take is to

  • Summarise each SNP by its (marginal) posterior probability of affecting phenotype.
  • Group SNPs together into clusters of SNPs that tend not to occur together in themodel. (this can be thought of as analagous to grouping SNPs into bins of SNPs that are good proxies for one another (i.e. highly correlated), since when two SNPsare highly correlated parsimony tends to favour not including both in the model.
  • Output, for each cluster, the probability that at least one SNP from that cluster affects phenotype.

Note that this code is all based on the assumption that thereis at least oneSNP in the region affecting phenotype. That is, it is attempting to explain observed associations, rather than test for the presence of associations. The overall BF for the region (in the summary file), and the individual SNP BFs, should be used to check that there is actually good evidence for something interesting in the region before over-interpreting the results of this analysis!

The code contains two functions:

  • cluster.multiSNPtakes as input the .multi and a .single file, produced bybimbam, and performs some calculations to produce an object.
  • plot.multiSNP plots this object.
In [3]:
nnonmissing = function(x){return(sum(!is.na(x)))}
normalise = function(x){return(x/sum(x))}

cluster.multiSNP=function(multifile,singlefile){
x = read.table(multifile,header=T)
rs = read.table(singlefile,header=T)
pos = rs[,2]

nsnp = apply(x,1,nnonmissing)-1
L = max(nsnp)
NSNP = max(x[,-(1:2)],na.rm=T)
prior = normalise((0.5^(1:L))[nsnp]/choose(NSNP,nsnp))
post = normalise(10^x[,1]*prior)

#print(lapply(split(post,nsnp),sum)) # posterior on different numbers of SNPs


pairp = matrix(0,nrow=NSNP,ncol=NSNP)
contains = matrix(0,nrow=NSNP,ncol=length(post))
for(i in 1:NSNP){
contains[i,] = apply(x[,-(1:2)]==i,1,sum,na.rm=T)
}
for(i in 1:NSNP){
for(j in 1:NSNP){
pairp[i,j] = sum(post[contains[i,] & contains[j,]])
}
}

marginalp = diag(pairp)
indepp = marginalp %*% t(marginalp)

#marginalp = rep(0,NSNP)
#for(j in 1:NSNP){
#marginalp[j] = sum(post[rowSums(x[,-(1:2)]==j & !is.na(x[,-(1:2)]))==1])
#}
#now focus on only those SNPs with appreciable marginal prob
#Nums = 1:length(marginalp)
#cu_5 = sort(marginalp, decreasing = T)[5]
#Nums = Nums[marginalp > min(0.01,cu_5)]

#for(i in Nums){
#containsi = apply(x[,-(1:2)]==i,1,sum,na.rm=T)
#sub = x[containsi,]
#posti = post[containsi]
#for(j in Nums){
#containsj = apply(sub[,-(1:2)]==j,1,sum,na.rm=T)
#pairp[i,j] = sum(posti[containsj])
#}
#}
#pairp = pairp[Nums,Nums]


#d1 = dor(i,j) - dindep(i,j)
# where dor is big if i and j tend to occur together
# and dindep is big if i and j do not tend to occur together (ie less often than expected by chance)
#so d is small if i and j tend not to occur together
d = pairp^2 - 4*(indepp > pairp)*(indepp - pairp)^2
#d = pairp - 4*(indepp > pairp) * (indepp - pairp)

diag(d) = 0

a= cutree(hclust(as.dist(d),method="complete"),h=0)
nclust = max(a)
clusterprobs = rep(0,nclust)
for(i in 1:nclust){
members = (1:NSNP)[a==i]
present = apply(x[,-(1:2)],2,is.element,members)
clusterprobs[i] = sum(post[apply(present,1,max)==1])
}
return(list(id = as.character(rs[,1]),pos=pos,marginalp=marginalp,cluster = (rank(-unlist(clusterprobs)))[a],clusterprobs=sort(clusterprobs,decreasing=TRUE)))
}

plot.multiSNP=function(mSNP,minp=0.05,interactive=FALSE,names=FALSE,...){
#par(mfcol=c(2,1))
plot(mSNP$pos,mSNP$marginalp,type="h",lwd=2,...,col=mSNP$cluster,ylab="Posterior Prob", xlab="Position",main="Marginal Probability of Each SNP")
if(names){
text(mSNP$pos[mSNP$marginalp>minp],mSNP$marginalp[mSNP$marginalp>minp],mSNP$id[mSNP$marginalp>minp],srt=90,cex=0.8,pos=2,col=mSNP$cluster[mSNP$marginalp>minp])
}
if(interactive==TRUE){identify(mSNP$pos,mSNP$marginalp,mSNP$id)}
#barplot(mSNP$clusterprobs,col=1:length(mSNP$clusterprobs),ylim=c(0,1),main="Probability for each cluster of SNPs")
}

Now I (Gao Wang) use this code to make plot for my results obtained here.

In [ ]:
mSNP = cluster.multiSNP("output/pref4.multi.txt","output/pref4.single.txt")
Keyboard Interrupt

This script takes long time and non-trivial memory (40GB) to run.


Copyright © 2016-2020 Gao Wang et al at Stephens Lab, University of Chicago