Introduction

This document demonstrates statistical hypothesis testing concepts in R, including:

Two-sample t-tests
Multiple hypothesis testing problems
Family-wise error rate (FWER) correction methods
False discovery rate (FDR) methods
Differential expression analysis using limma

Two-Sample T-Test

Basic T-Test Example

Here we demonstrate a basic two-sample t-test comparing two groups of measurements. This test assesses whether the means of two groups are statistically different from each other.

# Create two groups of data
group1 <- c(2013.7, 2141.9, 2040.2, 1973.3, 2162.2, 1994.8, 1913.3, 2068.7)
group2 <- c(1974.6, 2027.6, 1914.8, 1955.8, 1963, 2025.5, 1865.1, 1922.4)

# Check sample sizes
length(group1)

## [1] 8

length(group2)

## [1] 8

# Calculate summary statistics
mean(group1)

## [1] 2038.513

mean(group2)

## [1] 1956.1

var(group1)

## [1] 7051.284

var(group2)

## [1] 3062.991

Manual T-Test Calculation

The t-statistic is calculated manually using the formula for independent samples with potentially unequal variances (Welch’s t-test):

\[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

# Calculate t-value manually
t.value <- (mean(group1) - mean(group2)) / 
           sqrt(var(group1)/length(group1) + var(group2)/length(group2))
t.value

## [1] 2.317772

Using R’s Built-in T-Test Function

# Perform t-test using R's function
t.test(group1, group2)

## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = 2.3178, df = 12.116, p-value = 0.03873
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##    5.023166 159.801834
## sample estimates:
## mean of x mean of y 
##  2038.513  1956.100

Visualizing the T-Distribution

Here we visualize the t-distribution with the appropriate degrees of freedom and calculate the p-value manually.

# Create sequence for plotting
t <- seq(-3.5, 3.5, by = 0.01)

# Plot the t-distribution
plot(t, dt(t, df = 12.116), type = "l", 
     main = "T-Distribution (df = 12.116)",
     xlab = "t-value", ylab = "Density")
abline(v = t.value, col = "red", lty = 2)

# Calculate p-value manually
# One-tailed p-value
pt(t.value, df = 12.116, lower.tail = FALSE)

## [1] 0.01936709

# Two-tailed p-value
2 * pt(t.value, df = 12.116, lower.tail = FALSE)

## [1] 0.03873418

Alternative T-Test Formulation

An equivalent way to perform the t-test using a formula interface with a grouping variable:

# Create grouping variable (0 for group1, 1 for group2)
group <- rep(0:1, each = 8)

# Combine all data
all <- c(group1, group2)

# Perform t-test using formula notation
t.test(all ~ group)

## 
##  Welch Two Sample t-test
## 
## data:  all by group
## t = 2.3178, df = 12.116, p-value = 0.03873
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##    5.023166 159.801834
## sample estimates:
## mean in group 0 mean in group 1 
##        2038.513        1956.100

# Extract p-value from test results
out <- t.test(all ~ group)
names(out)

##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

out$p.value

## [1] 0.03873408

Multiple Hypothesis Testing

The Multiple Testing Problem

When performing many statistical tests simultaneously, the probability of false positives (Type I errors) increases dramatically. This section demonstrates this issue through simulation.

Simulating Null Data

We generate 10,000 features (e.g., genes) with no true difference between groups. Under the null hypothesis, we expect approximately 5% false positives at α = 0.05.

# Create matrix for 10,000 tests with 16 samples each
data <- matrix(NA, ncol = 16, nrow = 10000)

# Generate null data: same distribution for all samples
for (i in 1:10000) {
  data[i, ] <- rnorm(16, mean = 2000, sd = 100)
}

# Show first row as example
data[1, ]

##  [1] 1968.278 1965.951 1924.416 2061.096 1971.210 1981.762 1962.739 1886.672
##  [9] 1905.874 2073.810 2035.566 2025.355 2073.500 2064.784 2007.549 2006.342

Performing Multiple T-Tests

# Perform t-test for each feature
p.value <- numeric()
for (i in 1:10000) {
  p.value[i] <- t.test(data[i, ] ~ group)$p.value
}

# Visualize p-value distribution (should be uniform under null)
hist(p.value, breaks = 50, main = "Distribution of P-values Under Null",
     xlab = "P-value", col = "lightblue")

# Count significant tests at α = 0.05
sum(p.value < 0.05)

## [1] 497

Interpretation: Even though there are no true differences (all data comes from the same distribution), we still observe approximately 500 “significant” results (5% of 10,000 tests). This demonstrates the need for multiple testing correction.

Simulating Data with True Differences

We generate 10,000 features (e.g., genes) where approximately 1,000 have true differences between groups, while 9,000 are null (no difference). This simulates a realistic differential expression scenario.

set.seed(123)  # For reproducibility

# Define group structure
group <- rep(0:1, each = 8)

# Create matrix for 10,000 tests with 16 samples each
data <- matrix(NA, ncol = 16, nrow = 10000)

# Number of truly differentially expressed features
n_true_diff <- 1000

# Randomly select which features will be differentially expressed
true_diff_indices <- sample(1:10000, n_true_diff)

# Generate data for each feature
for (i in 1:10000) {
  if (i %in% true_diff_indices) {
    # Differentially expressed: different means between groups
    # Group 0 (samples 1-8): mean = 2000
    # Group 1 (samples 9-16): mean = 2100 (difference of 100)
    data[i, group == 0] <- rnorm(8, mean = 2000, sd = 100)
    data[i, group == 1] <- rnorm(8, mean = 2100, sd = 100)
  } else {
    # Not differentially expressed: same distribution for all samples
    data[i, ] <- rnorm(16, mean = 2000, sd = 100)
  }
}

# Show examples
cat("Example of null feature (row 1):\n")

## Example of null feature (row 1):

print(round(data[1, ], 1))

##  [1] 1943.2 2009.8 1955.7 1819.2 2024.5 2167.3 1861.7 2115.0 2200.6 2139.3
## [11] 2011.4 1962.8 2252.1 2157.2 2124.0 2129.3

cat("\nMean group 0:", round(mean(data[1, group == 0]), 1))

## 
## Mean group 0: 1987

cat("\nMean group 1:", round(mean(data[1, group == 1]), 1))

## 
## Mean group 1: 2122.1

cat("\n\nExample of differentially expressed feature (row", true_diff_indices[1], "):\n")

## 
## 
## Example of differentially expressed feature (row 2463 ):

print(round(data[true_diff_indices[1], ], 1))

##  [1] 1939.6 1957.0 1720.3 1898.8 1943.3 1948.5 1932.6 1913.1 2179.7 2049.3
## [11] 2146.9 2148.8 2327.7 1859.9 2233.7 2017.0

cat("\nMean group 0:", round(mean(data[true_diff_indices[1], group == 0]), 1))

## 
## Mean group 0: 1906.7

cat("\nMean group 1:", round(mean(data[true_diff_indices[1], group == 1]), 1))

## 
## Mean group 1: 2120.4

# Store true differential expression status for later evaluation
true_status <- rep("null", 10000)
true_status[true_diff_indices] <- "DE"
table(true_status)

## true_status
##   DE null 
## 1000 9000

Performing Multiple T-Tests

# Perform t-test for each feature
p.value <- numeric()
for (i in 1:10000) {
  p.value[i] <- t.test(data[i, ] ~ group)$p.value
}

# Visualize p-value distribution (should be uniform under null)
hist(p.value, breaks = 50, main = "Distribution of P-values Under Null",
     xlab = "P-value", col = "lightblue")

# Count significant tests at α = 0.05
sum(p.value < 0.05)

## [1] 890

# Create confusion matrix to evaluate performance
predicted_sig <- p.value < 0.05
table(True = true_status, Predicted = ifelse(predicted_sig, "Significant", "Not Sig"))

##       Predicted
## True   Not Sig Significant
##   DE       540         460
##   null    8570         430

# Calculate performance metrics
true_positives <- sum(predicted_sig & true_status == "DE")
false_positives <- sum(predicted_sig & true_status == "null")
false_negatives <- sum(!predicted_sig & true_status == "DE")
true_negatives <- sum(!predicted_sig & true_status == "null")

cat("\nPerformance at α = 0.05 (no correction):\n")

## 
## Performance at α = 0.05 (no correction):

cat("True Positives:", true_positives, "\n")

## True Positives: 460

cat("False Positives:", false_positives, "\n")

## False Positives: 430

cat("False Negatives:", false_negatives, "\n")

## False Negatives: 540

cat("True Negatives:", true_negatives, "\n")

## True Negatives: 8570

cat("Sensitivity (Power):", round(true_positives / 1000, 3), "\n")

## Sensitivity (Power): 0.46

cat("False Discovery Rate:", round(false_positives / sum(predicted_sig), 3), "\n")

## False Discovery Rate: 0.483

Interpretation: With 1,000 truly differentially expressed features and 9,000 null features, we can now evaluate the performance of different correction methods. Without correction, we detect many true positives but also have a high false discovery rate (many false positives among the significant results).

Family-Wise Error Rate (FWER) Corrections

FWER methods control the probability of making at least one false positive across all tests.

Single-Step Methods

g <- 10000
alpha <- 0.05

# Bonferroni correction: most conservative
bonferroni <- alpha / g
bonferroni

## [1] 5e-06

# Šidák single-step correction: slightly less conservative
SidakSS <- 1 - (1 - alpha)^(1/g)
SidakSS

## [1] 5.129316e-06

# Count significant results
sum(p.value < bonferroni)

## [1] 1

sum(p.value < SidakSS)

## [1] 1

# Evaluate performance with Bonferroni correction
bonf_sig <- p.value < bonferroni
cat("\nBonferroni Performance:\n")

## 
## Bonferroni Performance:

cat("Significant features:", sum(bonf_sig), "\n")

## Significant features: 1

cat("True Positives:", sum(bonf_sig & true_status == "DE"), "\n")

## True Positives: 1

cat("False Positives:", sum(bonf_sig & true_status == "null"), "\n")

## False Positives: 0

cat("Sensitivity:", round(sum(bonf_sig & true_status == "DE") / 1000, 3), "\n")

## Sensitivity: 0.001

if(sum(bonf_sig) > 0) {
  cat("FDR:", round(sum(bonf_sig & true_status == "null") / sum(bonf_sig), 3), "\n")
}

## FDR: 0

Step-Down Methods

Step-down methods are more powerful than single-step methods as they adjust significance thresholds based on the rank of each p-value.

# Holm's step-down method
holm <- numeric()
SidakSD <- numeric()

for (i in 1:g) {
  holm[i] <- alpha / (g - i + 1)
  SidakSD[i] <- 1 - (1 - alpha)^(1/(g - i + 1))
}

# Show first 10 thresholds
head(holm, 10)

##  [1] 5.000000e-06 5.000500e-06 5.001000e-06 5.001500e-06 5.002001e-06
##  [6] 5.002501e-06 5.003002e-06 5.003502e-06 5.004003e-06 5.004504e-06

head(SidakSD, 10)

##  [1] 5.129316e-06 5.129829e-06 5.130342e-06 5.130856e-06 5.131369e-06
##  [6] 5.131882e-06 5.132396e-06 5.132909e-06 5.133423e-06 5.133937e-06

# Count significant results with step-down methods
sum(sort(p.value) < holm)

## [1] 1

sum(sort(p.value) < SidakSD)

## [1] 1

# Evaluate Holm's method performance
holm_sig <- sort(p.value) < holm
holm_features <- order(p.value)[holm_sig]
cat("\nHolm's Method Performance:\n")

## 
## Holm's Method Performance:

cat("Significant features:", sum(holm_sig), "\n")

## Significant features: 1

cat("True Positives:", sum(true_status[holm_features] == "DE"), "\n")

## True Positives: 1

cat("False Positives:", sum(true_status[holm_features] == "null"), "\n")

## False Positives: 0

cat("Sensitivity:", round(sum(true_status[holm_features] == "DE") / 1000, 3), "\n")

## Sensitivity: 0.001

if(sum(holm_sig) > 0) {
  cat("FDR:", round(sum(true_status[holm_features] == "null") / sum(holm_sig), 3), "\n")
}

## FDR: 0

Real-World Example: DNA Methylation Data

Loading and Exploring the Data

This example uses DNA methylation data to test for differences between males and females across 1,490 CpG sites.

library(Biobase)

# Load methylation data (update path as needed)
load("/Users/mdozmorov/Documents/Work/Teaching/Fall 2015/07_Differential Expression/Methyl.RData")

# Explore the data
ls()
methyl  # ExpressionSet with 1490 features, 95 samples

# View expression values
exprs(methyl)[1:5, 1:2]

# View sample information
pData(methyl)

# Visualize distribution of first sample
plot(density(exprs(methyl)[, 1]), 
     main = "Distribution of Methylation Values",
     xlab = "Methylation Beta Value")

FWER Corrections for Real Data

# Calculate adjusted alpha levels for 1490 tests
g <- 1490
alpha <- 0.05

bonferroni <- alpha / g
SidakSS <- 1 - (1 - alpha)^(1/g)

holm <- numeric()
SidakSD <- numeric()

for (i in 1:g) {
  holm[i] <- alpha / (g - i + 1)
  SidakSD[i] <- 1 - (1 - alpha)^(1/(g - i + 1))
}

# Compare first 10 thresholds
cbind(holm, SidakSD)[1:10, ]

Differential Methylation Analysis

Testing each CpG site for differential methylation between males and females.

# Check gender distribution
table(pData(methyl)$Gender.rep)

# Create binary gender variable (0 = female, 1 = male)
gender <- pData(methyl)$Gender.rep - 1
table(gender)

# Perform t-test for each CpG site
pvalue <- numeric()
for (i in 1:dim(exprs(methyl))[1]) {
  exprs_i <- as.numeric(exprs(methyl)[i, ])
  pvalue[i] <- t.test(exprs_i[gender == 0], exprs_i[gender == 1])$p.value
}

# Count significant results with FWER methods
sum(pvalue < bonferroni)
sum(pvalue < SidakSS)
sum(sort(pvalue) < holm)
sum(sort(pvalue) < SidakSD)

False Discovery Rate (FDR) Methods

FDR methods control the expected proportion of false positives among rejected hypotheses, providing more power than FWER methods.

Benjamini-Hochberg Procedure

The Benjamini-Hochberg (BH) procedure is the most widely used FDR control method.

# Manual implementation of BH procedure
k <- 1:g
sorted.p <- sort(pvalue)

# Reject hypotheses where p <= (alpha * rank) / total_tests
bh <- ifelse(sorted.p <= (0.05 * k) / 1490, TRUE, FALSE)
sum(bh)

Using p.adjust Function

library(stats)

# BH adjustment using built-in function
by <- p.adjust(pvalue, method = "BH")
sum(by < 0.05)

Q-value Method

The q-value approach provides an alternative FDR estimation method.

library(qvalue)

# Calculate q-values
qval <- qvalue(pvalue)
names(qval)

# Count significant features at FDR < 0.05
sum(qval$qvalues < 0.05)

Differential Expression with Limma

The limma package provides a sophisticated approach to differential expression analysis using moderated t-statistics, which “borrows information” across features to improve power.

Setting Up the Linear Model

library(limma)

# Create design matrix
f <- factor(gender, levels = unique(gender))
design <- model.matrix(~0 + f)

# Fit linear model
fit <- lmFit(methyl, design = design)

# Define contrast (difference between genders)
contrast.matrix <- makeContrasts(deg = f1 - f0, levels = design)

# Fit contrasts
fit2 <- contrasts.fit(fit, contrast.matrix)

# Apply empirical Bayes moderation
fit3 <- eBayes(fit2)

Comparing Adjustment Methods in Limma

No Adjustment

out <- topTable(fit3, adjust.method = "none", number = Inf, sort.by = "none")
sign <- rownames(out)[out$adj.P.Val < 0.05]
length(sign)

Benjamini-Hochberg FDR

out <- topTable(fit3, adjust.method = "BH", number = Inf, sort.by = "none")
sign <- rownames(out)[out$adj.P.Val < 0.05]
length(sign)

Holm’s FWER Method

out <- topTable(fit3, adjust.method = "holm", number = Inf, sort.by = "none")
sign <- rownames(out)[out$adj.P.Val < 0.05]
length(sign)

Summary

This document demonstrated:

T-tests: Both manual calculation and built-in functions for comparing two groups
Multiple testing problem: How conducting many tests inflates Type I error rates
FWER corrections: Conservative methods (Bonferroni, Šidák, Holm) that control family-wise error
FDR methods: More powerful approaches (BH, q-value) that control false discovery rate
Limma package: Advanced differential expression analysis with moderated t-statistics

Key Takeaway: Always correct for multiple testing when performing many simultaneous hypothesis tests. Choose FWER methods when strict control of false positives is critical, or FDR methods when greater power is needed and some false positives are acceptable.

Session Information

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.0.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.6.1          fastmap_1.2.0     xfun_0.53        
##  [5] cachem_1.1.0      knitr_1.50        htmltools_0.5.8.1 rmarkdown_2.29   
##  [9] lifecycle_1.0.4   cli_3.6.5         sass_0.4.10       jquerylib_0.1.4  
## [13] compiler_4.5.1    rstudioapi_0.17.1 tools_4.5.1       evaluate_1.0.5   
## [17] bslib_0.9.0       yaml_2.3.10       rlang_1.1.6       jsonlite_2.0.0

Statistical Testing and Multiple Hypothesis Correction in R

Mikhail Dozmorov

2025-10-20

Introduction

Two-Sample T-Test

Basic T-Test Example

Manual T-Test Calculation

Using R’s Built-in T-Test Function

Visualizing the T-Distribution

Alternative T-Test Formulation

Multiple Hypothesis Testing

The Multiple Testing Problem

Simulating Null Data

Performing Multiple T-Tests

Simulating Data with True Differences

Performing Multiple T-Tests

Family-Wise Error Rate (FWER) Corrections

Single-Step Methods

Step-Down Methods

Real-World Example: DNA Methylation Data

Loading and Exploring the Data

FWER Corrections for Real Data

Differential Methylation Analysis

False Discovery Rate (FDR) Methods

Benjamini-Hochberg Procedure

Using p.adjust Function

Q-value Method

Differential Expression with Limma

Setting Up the Linear Model

Comparing Adjustment Methods in Limma

No Adjustment

Benjamini-Hochberg FDR

Holm’s FWER Method

Summary

Session Information