This document demonstrates permutation testing for assessing gene enrichment on a specific chromosome. The key question is: Are differentially expressed genes (DEGs) over-represented on a particular chromosome compared to what we’d expect by chance?
Permutation testing is a non-parametric approach that: - Makes no assumptions about the underlying distribution - Creates an empirical null distribution through random sampling - Provides an alternative to parametric tests like Fisher’s exact test or Chi-square test
Imagine you have:
Question: Is Chromosome 1 enriched for DEGs, or could this pattern occur by random chance?
First, we define the complete set of genes in our analysis, classified by their chromosomal location.
# Create vector representing all genes in the study
# 40 genes on Chromosome 1, 160 on other chromosomes
a <- c(rep("On Other Chromosomes", 160), rep("On Chromosome 1", 40))
# Calculate totals
total.all <- length(a) # Total genes (200)
total.chr <- sum(a == "On Chromosome 1") # Genes on Chromosome 1 (40)
total.other <- sum(a == "On Other Chromosomes") # Genes on other chromosomes (160)
# Verify totals add up correctly
all.equal(total.all, total.chr + total.other)
## [1] TRUE
cat("Total genes:", total.all, "\n")
## Total genes: 200
cat("Genes on Chromosome 1:", total.chr, "(", round(100*total.chr/total.all, 1), "%)\n")
## Genes on Chromosome 1: 40 ( 20 %)
cat("Genes on other chromosomes:", total.other, "(", round(100*total.other/total.all, 1), "%)\n")
## Genes on other chromosomes: 160 ( 80 %)
Next, we define which genes are differentially expressed in our experiment.
# Selected genes (DEGs): 50 total
# 15 on Chromosome 1, 35 on other chromosomes
b <- c(rep("On Other Chromosomes", 35), rep("On Chromosome 1", 15))
selected.chr <- sum(b == "On Chromosome 1") # DEGs on Chromosome 1 (15)
selected.other <- sum(b == "On Other Chromosomes") # DEGs on other chromosomes (35)
cat("Total DEGs:", length(b), "\n")
## Total DEGs: 50
cat("DEGs on Chromosome 1:", selected.chr, "(", round(100*selected.chr/length(b), 1), "%)\n")
## DEGs on Chromosome 1: 15 ( 30 %)
cat("DEGs on other chromosomes:", selected.other, "(", round(100*selected.other/length(b), 1), "%)\n")
## DEGs on other chromosomes: 35 ( 70 %)
Observation: While only 20% of all genes are on Chromosome 1, 30% of DEGs are on Chromosome 1. This suggests possible enrichment.
A 2×2 contingency table summarizes the relationship between chromosomal location and differential expression status.
# Create 2x2 contingency table
# DEGs not DEGs
# On Chromosome 1 15 25 (40 total)
# Other Chromosomes 35 125 (160 total)
# (50) (150) (200 total)
mtx <- matrix(c(selected.chr,
selected.other,
total.chr - selected.chr,
total.other - selected.other),
ncol = 2,
dimnames = list(c("On Chromosome 1", "On Other Chromosomes"),
c("DEGs", "not DEGs")))
print(mtx)
## DEGs not DEGs
## On Chromosome 1 15 25
## On Other Chromosomes 35 125
cat("\nTotal genes (verification):", sum(mtx), "\n")
##
## Total genes (verification): 200
Reading the table:
Fisher’s exact test calculates the exact probability of observing this contingency table (or more extreme) under the null hypothesis of independence.
fisher_p <- fisher.test(mtx)$p.value
cat("Fisher's exact test p-value:", fisher_p, "\n")
## Fisher's exact test p-value: 0.06439416
if (fisher_p < 0.05) {
cat("Result: Significant enrichment detected (p < 0.05)\n")
} else {
cat("Result: No significant enrichment (p >= 0.05)\n")
}
## Result: No significant enrichment (p >= 0.05)
Interpretation: Fisher’s exact test is the gold standard for small sample contingency tables and provides an exact p-value without relying on asymptotic approximations.
The Chi-square test of independence assesses whether the two categorical variables (chromosomal location and DEG status) are independent.
chisq_p <- chisq.test(mtx)$p.value
cat("Chi-square test p-value:", chisq_p, "\n")
## Chi-square test p-value: 0.06619258
if (chisq_p < 0.05) {
cat("Result: Significant association detected (p < 0.05)\n")
} else {
cat("Result: No significant association (p >= 0.05)\n")
}
## Result: No significant association (p >= 0.05)
Note: Chi-square test uses asymptotic approximations and may be less accurate for small sample sizes, but gives similar results to Fisher’s test in most cases.
Instead of relying on theoretical distributions, we create an empirical null distribution by: 1. Randomly selecting 50 genes (same size as our DEG set) from all 200 genes 2. Calculating the Chi-square statistic for each random selection 3. Repeating this process many times (10,000 permutations) 4. Comparing our observed statistic to this empirical distribution
Null Hypothesis: DEGs are randomly distributed with respect to chromosomal location.
set.seed(2) # For reproducibility
# Number of permutations
n <- 1:10000
# Observed test statistic from our actual data
estimate <- chisq.test(mtx)$statistic
cat("Observed Chi-square statistic:", round(estimate, 3), "\n\n")
## Observed Chi-square statistic: 3.375
# Generate permutation distribution
estimate.rand <- sapply(n, function(x) {
# Randomly select 50 genes from all 200 genes (without replacement)
b.rand <- sample(a, size = length(b))
# Count how many randomly selected genes are on each chromosome
selected.chr.rand <- sum(b.rand == "On Chromosome 1")
selected.other.rand <- sum(b.rand == "On Other Chromosomes")
# Create permutation contingency table
mtx.rand <- matrix(c(selected.chr.rand,
selected.other.rand,
total.chr - selected.chr.rand,
total.other - selected.other.rand),
ncol = 2)
# Calculate Chi-square statistic for this permutation
chisq.test(mtx.rand)$statistic
})
# Calculate permutation p-value
perm_p_chisq <- (sum(estimate.rand >= estimate) + 1) / (length(n) + 1)
cat("Permutation p-value (Chi-square method):", perm_p_chisq, "\n")
## Permutation p-value (Chi-square method): 0.06429357
# Plot histogram of permutation distribution
hist(estimate.rand,
breaks = 50,
main = "Permutation Distribution of Chi-square Statistics",
xlab = "Chi-square Statistic",
col = "lightblue",
border = "white")
# Add observed statistic as vertical line
abline(v = estimate, col = "red", lwd = 2, lty = 2)
# Add legend
legend("topright",
legend = c("Permutation distribution",
paste("Observed statistic =", round(estimate, 2))),
col = c("lightblue", "red"),
lwd = c(10, 2),
lty = c(1, 2))
# Add text with p-value
text(x = max(estimate.rand) * 0.7,
y = max(hist(estimate.rand, plot = FALSE)$counts) * 0.9,
labels = paste("Permutation p-value =", perm_p_chisq),
cex = 1.2)
Interpretation: The red line shows our observed Chi-square statistic. The permutation p-value represents the proportion of random permutations that produced a statistic as extreme or more extreme than what we observed.
A simpler approach focuses directly on the count of DEGs on Chromosome 1:
set.seed(1) # For reproducibility
n <- 1:10000
# Observed count of DEGs on Chromosome 1
estimate <- selected.chr
cat("Observed DEGs on Chromosome 1:", estimate, "\n")
## Observed DEGs on Chromosome 1: 15
cat("Expected DEGs on Chromosome 1 (under null):",
round(length(b) * total.chr / total.all, 1), "\n\n")
## Expected DEGs on Chromosome 1 (under null): 10
# Generate permutation distribution of counts
estimate.rand <- sapply(n, function(x) {
# Randomly select 50 genes from all 200 genes
b.rand <- sample(a, size = length(b))
# Count how many are on Chromosome 1
sum(b.rand == "On Chromosome 1")
})
# Calculate permutation p-value
# Count how often random selection gives >= observed count
perm_p_count <- sum(estimate.rand >= estimate) / length(n)
cat("Permutation p-value (count method):", perm_p_count, "\n")
## Permutation p-value (count method): 0.0355
# Plot histogram of permutation counts
hist(estimate.rand,
breaks = seq(min(estimate.rand) - 0.5, max(estimate.rand) + 0.5, by = 1),
main = "Permutation Distribution of DEG Counts on Chromosome 1",
xlab = "Number of DEGs on Chromosome 1",
col = "lightgreen",
border = "white")
# Add observed count as vertical line
abline(v = estimate, col = "red", lwd = 2, lty = 2)
# Add expected count as vertical line
expected_count <- length(b) * total.chr / total.all
abline(v = expected_count, col = "blue", lwd = 2, lty = 3)
# Add legend
legend("topright",
legend = c("Permutation distribution",
paste("Observed count =", estimate),
paste("Expected count =", round(expected_count, 1))),
col = c("lightgreen", "red", "blue"),
lwd = c(10, 2, 2),
lty = c(1, 2, 3))
# Add text with p-value
text(x = max(estimate.rand) * 0.85,
y = max(hist(estimate.rand, plot = FALSE)$counts) * 0.9,
labels = paste("Permutation p-value =", perm_p_count),
cex = 1.2)
Interpretation: The distribution shows what we’d expect by random chance. The red line (observed = 15) is notably higher than the blue line (expected ≈ 10), indicating enrichment.
results <- data.frame(
Method = c("Fisher's Exact Test",
"Chi-square Test",
"Permutation (Chi-square)",
"Permutation (Count)"),
P_Value = c(fisher_p, chisq_p, perm_p_chisq, perm_p_count),
Significant = c(fisher_p < 0.05,
chisq_p < 0.05,
perm_p_chisq < 0.05,
perm_p_count < 0.05)
)
print(results)
## Method P_Value Significant
## 1 Fisher's Exact Test 0.06439416 FALSE
## 2 Chi-square Test 0.06619258 FALSE
## 3 Permutation (Chi-square) 0.06429357 FALSE
## 4 Permutation (Count) 0.03550000 TRUE
cat("Summary of Permutation Distribution (Count Method):\n")
## Summary of Permutation Distribution (Count Method):
cat("Mean:", round(mean(estimate.rand), 2), "\n")
## Mean: 9.98
cat("Median:", median(estimate.rand), "\n")
## Median: 10
cat("SD:", round(sd(estimate.rand), 2), "\n")
## SD: 2.46
cat("95% CI:", quantile(estimate.rand, c(0.025, 0.975)), "\n")
## 95% CI: 5 15
cat("\nObserved value:", estimate, "\n")
##
## Observed value: 15
cat("Z-score:", round((estimate - mean(estimate.rand)) / sd(estimate.rand), 2), "\n")
## Z-score: 2.05
In this example, all methods (Fisher’s exact, Chi-square, and both permutation approaches) give similar results, suggesting that:
The permutation approach is particularly valuable because it: - Provides an independent validation of traditional tests - Offers intuitive visualization of the null distribution - Works well even with small sample sizes or violated assumptions
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.0.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.53
## [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [9] lifecycle_1.0.4 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4
## [13] compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1 evaluate_1.0.5
## [17] bslib_0.9.0 yaml_2.3.10 rlang_1.1.6 jsonlite_2.0.0