Linear Mixed Model

Linear Mixed Model#

When we want to study a specific genetic variant as a fixed effect but know there are countless unmeasured factors affecting our trait, linear mixed models offer a practical solution: use a random effect based on genetic similarity to absorb all that unmeasured heterogeneity.

Graphical Summary#

Fig

Key Formula#

In the linear mixed model (mixed because it incorporate both the fixed and the random effects), which accurately represent non-independent data structures for a variant,

\[ \mathbf{Y} = \mathbf{X}\beta + \mathbf{g} + \boldsymbol{\epsilon} \]

where:

\(\mathbf{Y}\) is the \(N \times 1\) vector of phenotypes
\(\mathbf{X}\) is the \(N \times 1\) design matrix for fixed effects (e.g., the specific genetic variant we’re testing, plus measurable covariates like age and sex)
\(\beta\) is the scalar of fixed genetic effect coefficient (unknown, to be estimated)
\(\mathbf{g}\) is the \(N \times 1\) vector of random effects that captures all the genetic factors we can’t or don’t want to model explicitly - population structure, polygenic background, family relationships, and countless unknown genetic influences
\(\boldsymbol{\epsilon}\) is the \(N \times 1\) vector of residual errors, where \(\boldsymbol{\epsilon} \sim N(0, \sigma^2_e\mathbf{I})\)

Technical Details#

Random Effects Decomposition: \(\mathbf{g}\) and \(\mathbf{Zu}\)#

The key insight is that for a genetic variant, there are countless things to account for: age, sex, ancestry, thousands of genes and unknown ones. We can never write the complete model:

\[ \mathbf{Y} = \text{genetic variant} + \text{age} + \text{sex} + \text{ancestry} + \text{gene}_1 + \ldots + \text{gene}_{20000} + \text{unknown factors} + \boldsymbol{\epsilon} \]

Instead, we approximate it as:

\[ \mathbf{Y} = \text{genetic variant} + \text{age} + \text{sex} + \mathbf{g} + \boldsymbol{\epsilon} \]

where \(\mathbf{g}\) soaks up the effects of ancestry, polygenic background, and all the other genetic factors we can’t explicitly model. It’s not as precise as conditioning on everything, but it’s the best we can do when complete conditioning is impossible.

What information should \(\mathbf{g}\) capture? Ideally, \(\mathbf{g}\) should include:

Population structure and ancestry differences
Polygenic effects from thousands of small-effect variants
Family relationships and shared genetic background
Gene-gene interactions we haven’t discovered
Any other source of genetic similarity between individuals

The random effects term g can be decomposed as:

\[ \mathbf{g} = \mathbf{Z} \mathbf{u} \]

where:

\(\mathbf{g}\) is the \(N \times 1\) vector of random effects for \(N\) individuals, where \(\mathbf{g} \sim N(\mathbf{0}, \sigma^2_g\mathbf{G})\)
\(\mathbf{G}\) is the genetic relationship matrix (GRM) that measures genetic similarity between individuals, effectively capturing all the unmeasured genetic factors through genome-wide similarity patterns
\(\mathbf{Z}\) is the \(N \times M\) genotype matrix for \(M\) genetic variants
\(\mathbf{u} \sim N(0, \sigma_u^2\mathbf{I})\) is a \(M \times 1\) vector of random SNP effects
Formally known as the infinitesmal model

Popular LMM Methods in Statistical Genetics#

Method	Purpose	Key Innovation	Scale/Application
GCTA	Heritability estimation and association testing	Uses genome-wide SNPs to construct genetic relationship matrix (GRM)	\(h^2 = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e}\) estimation
GEMMA	Fast genome-wide association studies with population structure control	Efficient algorithms for large-scale GWAS with kinship correction	Computationally efficient for large-scale data
BOLT-LMM	Ultra-fast mixed model association testing	Bayesian sparse linear mixed model approach	Hundreds of thousands of individuals
SAIGE	Association testing for binary and quantitative traits	Handles unbalanced case-control studies efficiently	Saddlepoint approximation for computational efficiency
REGENIE	Whole genome regression with prediction	Two-step procedure combining whole-genome regression to solve mixed models	Use ridge regression instead of conventional LMM

Example#

What happens when we acknowledge that our trait is influenced not just by one specific variant, but also by a “genetic background” from all the other variants we’re not explicitly testing? Let’s see how linear mixed models capture this reality.

We’ll use the same 5 individuals, but now we’ll model two sources of genetic influence: a fixed effect from one specific variant we’re studying, plus a random polygenic effect that represents the combined influence of all variants acting as genetic background. How much does each component contribute to the total genetic influence on our trait?

# Clear the environment
rm(list = ls())
set.seed(13)
library(MASS) # For mvrnorm function
# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)

# Reshape into a matrix
N = 5
M = 3
geno_matrix <- matrix(genotypes, nrow = N, ncol = M, byrow = TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")

# Convert to raw genotype matrix using the additive model
Xraw_additive <- matrix(0, nrow = N, ncol = M) # count number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i,j], "")[[1]]
    Xraw_additive[i,j] <- sum(alleles == alt_alleles[j])
  }
}

X <- scale(Xraw_additive, center = TRUE, scale = TRUE)

We generate the data according to a linear mixed model:

# Linear Mixed Model: Y = X*beta + g + epsilon
# where g = Z*u represents random polygenic effects

# Fixed effect for variant 1
beta <- 0.8  # Fixed effect size for variant 1

# Create Genetic Relationship Matrix (GRM) using all variants
# GRM = (1/M) * X_raw * X_raw^T, where X_raw is the standardized genotype matrix
GRM <- (1/M) * X %*% t(X)
cat("Genetic Relationship Matrix (GRM):\n")
GRM

# Generate random polygenic effects: g ~ N(0, sigma_u^2 * GRM)
sigma_u <- 0.5  # Standard deviation of random effects
g <- mvrnorm(n = 1, mu = rep(0, N), Sigma = sigma_u^2 * GRM)
g <- as.vector(g)

# Residual errors
sigma_e <- 0.3  # Standard deviation of residual errors
epsilon <- rnorm(N, mean = 0, sd = sigma_e)

# Generate phenotype with both fixed and random effects
Y <- X[, 1] * beta + g + epsilon

Genetic Relationship Matrix (GRM):

A matrix: 5 x 5 of type dbl
	Individual 1	Individual 2	Individual 3	Individual 4	Individual 5
Individual 1	0.23571429	-0.5261905	-0.18095238	-0.02619048	0.4976190
Individual 2	-0.52619048	1.2714286	0.30714286	0.10476190	-1.1571429
Individual 3	-0.18095238	0.3071429	0.23571429	-0.02619048	-0.3357143
Individual 4	-0.02619048	0.1047619	-0.02619048	0.60476190	-0.6571429
Individual 5	0.49761905	-1.1571429	-0.33571429	-0.65714286	1.6523810

We can estimate the PVE again by each component:

# In LMM, total genetic variance = fixed effects variance + random effects variance
# Fixed genetic component from variant 1
G_fixed <- X[, 1] * beta

# Random genetic component (polygenic background)
G_random <- g

# Total genetic component
G_total <- G_fixed + G_random

# Calculate variance components
var_G_fixed <- var(G_fixed)
var_G_random <- var(G_random)
var_G_total <- var(G_total)
var_Y <- var(Y)

# PVE calculations
PVE_fixed <- var_G_fixed / var_Y           # PVE from fixed effects only
PVE_random <- var_G_random / var_Y         # PVE from random effects only  
PVE_total <- var_G_total / var_Y           # Total PVE

cat("PVE Components:\n")
cat("PVE_fixed (variant 1) =", round(PVE_fixed, 4), "\n")
cat("PVE_random (polygenic) =", round(PVE_random, 4), "\n")
cat("PVE_total =", round(PVE_total, 4), "\n\n")

PVE Components:
PVE_fixed (variant 1) = 0.8187 
PVE_random (polygenic) = 0.1212 
PVE_total = 0.67 

Note that the same framework applies when the genetic effect \(\beta\) is modeled as a random effect rather than fixed.