This document demonstrates statistical hypothesis testing concepts in R, including:
Here we demonstrate a basic two-sample t-test comparing two groups of measurements. This test assesses whether the means of two groups are statistically different from each other.
# Create two groups of data
group1 <- c(2013.7, 2141.9, 2040.2, 1973.3, 2162.2, 1994.8, 1913.3, 2068.7)
group2 <- c(1974.6, 2027.6, 1914.8, 1955.8, 1963, 2025.5, 1865.1, 1922.4)
# Check sample sizes
length(group1)
## [1] 8
length(group2)
## [1] 8
# Calculate summary statistics
mean(group1)
## [1] 2038.513
mean(group2)
## [1] 1956.1
var(group1)
## [1] 7051.284
var(group2)
## [1] 3062.991
The t-statistic is calculated manually using the formula for independent samples with potentially unequal variances (Welch’s t-test):
\[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]
# Calculate t-value manually
t.value <- (mean(group1) - mean(group2)) /
sqrt(var(group1)/length(group1) + var(group2)/length(group2))
t.value
## [1] 2.317772
# Perform t-test using R's function
t.test(group1, group2)
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = 2.3178, df = 12.116, p-value = 0.03873
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.023166 159.801834
## sample estimates:
## mean of x mean of y
## 2038.513 1956.100
Here we visualize the t-distribution with the appropriate degrees of freedom and calculate the p-value manually.
# Create sequence for plotting
t <- seq(-3.5, 3.5, by = 0.01)
# Plot the t-distribution
plot(t, dt(t, df = 12.116), type = "l",
main = "T-Distribution (df = 12.116)",
xlab = "t-value", ylab = "Density")
abline(v = t.value, col = "red", lty = 2)
# Calculate p-value manually
# One-tailed p-value
pt(t.value, df = 12.116, lower.tail = FALSE)
## [1] 0.01936709
# Two-tailed p-value
2 * pt(t.value, df = 12.116, lower.tail = FALSE)
## [1] 0.03873418
An equivalent way to perform the t-test using a formula interface with a grouping variable:
# Create grouping variable (0 for group1, 1 for group2)
group <- rep(0:1, each = 8)
# Combine all data
all <- c(group1, group2)
# Perform t-test using formula notation
t.test(all ~ group)
##
## Welch Two Sample t-test
##
## data: all by group
## t = 2.3178, df = 12.116, p-value = 0.03873
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 5.023166 159.801834
## sample estimates:
## mean in group 0 mean in group 1
## 2038.513 1956.100
# Extract p-value from test results
out <- t.test(all ~ group)
names(out)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
out$p.value
## [1] 0.03873408
When performing many statistical tests simultaneously, the probability of false positives (Type I errors) increases dramatically. This section demonstrates this issue through simulation.
We generate 10,000 features (e.g., genes) with no true difference between groups. Under the null hypothesis, we expect approximately 5% false positives at α = 0.05.
# Create matrix for 10,000 tests with 16 samples each
data <- matrix(NA, ncol = 16, nrow = 10000)
# Generate null data: same distribution for all samples
for (i in 1:10000) {
data[i, ] <- rnorm(16, mean = 2000, sd = 100)
}
# Show first row as example
data[1, ]
## [1] 1968.278 1965.951 1924.416 2061.096 1971.210 1981.762 1962.739 1886.672
## [9] 1905.874 2073.810 2035.566 2025.355 2073.500 2064.784 2007.549 2006.342
# Perform t-test for each feature
p.value <- numeric()
for (i in 1:10000) {
p.value[i] <- t.test(data[i, ] ~ group)$p.value
}
# Visualize p-value distribution (should be uniform under null)
hist(p.value, breaks = 50, main = "Distribution of P-values Under Null",
xlab = "P-value", col = "lightblue")
# Count significant tests at α = 0.05
sum(p.value < 0.05)
## [1] 497
Interpretation: Even though there are no true differences (all data comes from the same distribution), we still observe approximately 500 “significant” results (5% of 10,000 tests). This demonstrates the need for multiple testing correction.
We generate 10,000 features (e.g., genes) where approximately 1,000 have true differences between groups, while 9,000 are null (no difference). This simulates a realistic differential expression scenario.
set.seed(123) # For reproducibility
# Define group structure
group <- rep(0:1, each = 8)
# Create matrix for 10,000 tests with 16 samples each
data <- matrix(NA, ncol = 16, nrow = 10000)
# Number of truly differentially expressed features
n_true_diff <- 1000
# Randomly select which features will be differentially expressed
true_diff_indices <- sample(1:10000, n_true_diff)
# Generate data for each feature
for (i in 1:10000) {
if (i %in% true_diff_indices) {
# Differentially expressed: different means between groups
# Group 0 (samples 1-8): mean = 2000
# Group 1 (samples 9-16): mean = 2100 (difference of 100)
data[i, group == 0] <- rnorm(8, mean = 2000, sd = 100)
data[i, group == 1] <- rnorm(8, mean = 2100, sd = 100)
} else {
# Not differentially expressed: same distribution for all samples
data[i, ] <- rnorm(16, mean = 2000, sd = 100)
}
}
# Show examples
cat("Example of null feature (row 1):\n")
## Example of null feature (row 1):
print(round(data[1, ], 1))
## [1] 1943.2 2009.8 1955.7 1819.2 2024.5 2167.3 1861.7 2115.0 2200.6 2139.3
## [11] 2011.4 1962.8 2252.1 2157.2 2124.0 2129.3
cat("\nMean group 0:", round(mean(data[1, group == 0]), 1))
##
## Mean group 0: 1987
cat("\nMean group 1:", round(mean(data[1, group == 1]), 1))
##
## Mean group 1: 2122.1
cat("\n\nExample of differentially expressed feature (row", true_diff_indices[1], "):\n")
##
##
## Example of differentially expressed feature (row 2463 ):
print(round(data[true_diff_indices[1], ], 1))
## [1] 1939.6 1957.0 1720.3 1898.8 1943.3 1948.5 1932.6 1913.1 2179.7 2049.3
## [11] 2146.9 2148.8 2327.7 1859.9 2233.7 2017.0
cat("\nMean group 0:", round(mean(data[true_diff_indices[1], group == 0]), 1))
##
## Mean group 0: 1906.7
cat("\nMean group 1:", round(mean(data[true_diff_indices[1], group == 1]), 1))
##
## Mean group 1: 2120.4
# Store true differential expression status for later evaluation
true_status <- rep("null", 10000)
true_status[true_diff_indices] <- "DE"
table(true_status)
## true_status
## DE null
## 1000 9000
# Perform t-test for each feature
p.value <- numeric()
for (i in 1:10000) {
p.value[i] <- t.test(data[i, ] ~ group)$p.value
}
# Visualize p-value distribution (should be uniform under null)
hist(p.value, breaks = 50, main = "Distribution of P-values Under Null",
xlab = "P-value", col = "lightblue")
# Count significant tests at α = 0.05
sum(p.value < 0.05)
## [1] 890
# Create confusion matrix to evaluate performance
predicted_sig <- p.value < 0.05
table(True = true_status, Predicted = ifelse(predicted_sig, "Significant", "Not Sig"))
## Predicted
## True Not Sig Significant
## DE 540 460
## null 8570 430
# Calculate performance metrics
true_positives <- sum(predicted_sig & true_status == "DE")
false_positives <- sum(predicted_sig & true_status == "null")
false_negatives <- sum(!predicted_sig & true_status == "DE")
true_negatives <- sum(!predicted_sig & true_status == "null")
cat("\nPerformance at α = 0.05 (no correction):\n")
##
## Performance at α = 0.05 (no correction):
cat("True Positives:", true_positives, "\n")
## True Positives: 460
cat("False Positives:", false_positives, "\n")
## False Positives: 430
cat("False Negatives:", false_negatives, "\n")
## False Negatives: 540
cat("True Negatives:", true_negatives, "\n")
## True Negatives: 8570
cat("Sensitivity (Power):", round(true_positives / 1000, 3), "\n")
## Sensitivity (Power): 0.46
cat("False Discovery Rate:", round(false_positives / sum(predicted_sig), 3), "\n")
## False Discovery Rate: 0.483
Interpretation: With 1,000 truly differentially expressed features and 9,000 null features, we can now evaluate the performance of different correction methods. Without correction, we detect many true positives but also have a high false discovery rate (many false positives among the significant results).
FWER methods control the probability of making at least one false positive across all tests.
g <- 10000
alpha <- 0.05
# Bonferroni correction: most conservative
bonferroni <- alpha / g
bonferroni
## [1] 5e-06
# Šidák single-step correction: slightly less conservative
SidakSS <- 1 - (1 - alpha)^(1/g)
SidakSS
## [1] 5.129316e-06
# Count significant results
sum(p.value < bonferroni)
## [1] 1
sum(p.value < SidakSS)
## [1] 1
# Evaluate performance with Bonferroni correction
bonf_sig <- p.value < bonferroni
cat("\nBonferroni Performance:\n")
##
## Bonferroni Performance:
cat("Significant features:", sum(bonf_sig), "\n")
## Significant features: 1
cat("True Positives:", sum(bonf_sig & true_status == "DE"), "\n")
## True Positives: 1
cat("False Positives:", sum(bonf_sig & true_status == "null"), "\n")
## False Positives: 0
cat("Sensitivity:", round(sum(bonf_sig & true_status == "DE") / 1000, 3), "\n")
## Sensitivity: 0.001
if(sum(bonf_sig) > 0) {
cat("FDR:", round(sum(bonf_sig & true_status == "null") / sum(bonf_sig), 3), "\n")
}
## FDR: 0
Step-down methods are more powerful than single-step methods as they adjust significance thresholds based on the rank of each p-value.
# Holm's step-down method
holm <- numeric()
SidakSD <- numeric()
for (i in 1:g) {
holm[i] <- alpha / (g - i + 1)
SidakSD[i] <- 1 - (1 - alpha)^(1/(g - i + 1))
}
# Show first 10 thresholds
head(holm, 10)
## [1] 5.000000e-06 5.000500e-06 5.001000e-06 5.001500e-06 5.002001e-06
## [6] 5.002501e-06 5.003002e-06 5.003502e-06 5.004003e-06 5.004504e-06
head(SidakSD, 10)
## [1] 5.129316e-06 5.129829e-06 5.130342e-06 5.130856e-06 5.131369e-06
## [6] 5.131882e-06 5.132396e-06 5.132909e-06 5.133423e-06 5.133937e-06
# Count significant results with step-down methods
sum(sort(p.value) < holm)
## [1] 1
sum(sort(p.value) < SidakSD)
## [1] 1
# Evaluate Holm's method performance
holm_sig <- sort(p.value) < holm
holm_features <- order(p.value)[holm_sig]
cat("\nHolm's Method Performance:\n")
##
## Holm's Method Performance:
cat("Significant features:", sum(holm_sig), "\n")
## Significant features: 1
cat("True Positives:", sum(true_status[holm_features] == "DE"), "\n")
## True Positives: 1
cat("False Positives:", sum(true_status[holm_features] == "null"), "\n")
## False Positives: 0
cat("Sensitivity:", round(sum(true_status[holm_features] == "DE") / 1000, 3), "\n")
## Sensitivity: 0.001
if(sum(holm_sig) > 0) {
cat("FDR:", round(sum(true_status[holm_features] == "null") / sum(holm_sig), 3), "\n")
}
## FDR: 0
This example uses DNA methylation data to test for differences between males and females across 1,490 CpG sites.
library(Biobase)
# Load methylation data (update path as needed)
load("/Users/mdozmorov/Documents/Work/Teaching/Fall 2015/07_Differential Expression/Methyl.RData")
# Explore the data
ls()
methyl # ExpressionSet with 1490 features, 95 samples
# View expression values
exprs(methyl)[1:5, 1:2]
# View sample information
pData(methyl)
# Visualize distribution of first sample
plot(density(exprs(methyl)[, 1]),
main = "Distribution of Methylation Values",
xlab = "Methylation Beta Value")
# Calculate adjusted alpha levels for 1490 tests
g <- 1490
alpha <- 0.05
bonferroni <- alpha / g
SidakSS <- 1 - (1 - alpha)^(1/g)
holm <- numeric()
SidakSD <- numeric()
for (i in 1:g) {
holm[i] <- alpha / (g - i + 1)
SidakSD[i] <- 1 - (1 - alpha)^(1/(g - i + 1))
}
# Compare first 10 thresholds
cbind(holm, SidakSD)[1:10, ]
Testing each CpG site for differential methylation between males and females.
# Check gender distribution
table(pData(methyl)$Gender.rep)
# Create binary gender variable (0 = female, 1 = male)
gender <- pData(methyl)$Gender.rep - 1
table(gender)
# Perform t-test for each CpG site
pvalue <- numeric()
for (i in 1:dim(exprs(methyl))[1]) {
exprs_i <- as.numeric(exprs(methyl)[i, ])
pvalue[i] <- t.test(exprs_i[gender == 0], exprs_i[gender == 1])$p.value
}
# Count significant results with FWER methods
sum(pvalue < bonferroni)
sum(pvalue < SidakSS)
sum(sort(pvalue) < holm)
sum(sort(pvalue) < SidakSD)
FDR methods control the expected proportion of false positives among rejected hypotheses, providing more power than FWER methods.
The Benjamini-Hochberg (BH) procedure is the most widely used FDR control method.
# Manual implementation of BH procedure
k <- 1:g
sorted.p <- sort(pvalue)
# Reject hypotheses where p <= (alpha * rank) / total_tests
bh <- ifelse(sorted.p <= (0.05 * k) / 1490, TRUE, FALSE)
sum(bh)
library(stats)
# BH adjustment using built-in function
by <- p.adjust(pvalue, method = "BH")
sum(by < 0.05)
The q-value approach provides an alternative FDR estimation method.
library(qvalue)
# Calculate q-values
qval <- qvalue(pvalue)
names(qval)
# Count significant features at FDR < 0.05
sum(qval$qvalues < 0.05)
The limma package provides a sophisticated approach to differential expression analysis using moderated t-statistics, which “borrows information” across features to improve power.
library(limma)
# Create design matrix
f <- factor(gender, levels = unique(gender))
design <- model.matrix(~0 + f)
# Fit linear model
fit <- lmFit(methyl, design = design)
# Define contrast (difference between genders)
contrast.matrix <- makeContrasts(deg = f1 - f0, levels = design)
# Fit contrasts
fit2 <- contrasts.fit(fit, contrast.matrix)
# Apply empirical Bayes moderation
fit3 <- eBayes(fit2)
out <- topTable(fit3, adjust.method = "none", number = Inf, sort.by = "none")
sign <- rownames(out)[out$adj.P.Val < 0.05]
length(sign)
out <- topTable(fit3, adjust.method = "BH", number = Inf, sort.by = "none")
sign <- rownames(out)[out$adj.P.Val < 0.05]
length(sign)
out <- topTable(fit3, adjust.method = "holm", number = Inf, sort.by = "none")
sign <- rownames(out)[out$adj.P.Val < 0.05]
length(sign)
This document demonstrated:
Key Takeaway: Always correct for multiple testing when performing many simultaneous hypothesis tests. Choose FWER methods when strict control of false positives is critical, or FDR methods when greater power is needed and some false positives are acceptable.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.0.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.53
## [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [9] lifecycle_1.0.4 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4
## [13] compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1 evaluate_1.0.5
## [17] bslib_0.9.0 yaml_2.3.10 rlang_1.1.6 jsonlite_2.0.0