Single linkage clustering merges the two clusters whose minimum pairwise distance is the smallest among all pairs. When clusters have multiple members, the distance between them is defined as the minimum distance between any pair of points across clusters.
We start with a simple dataset of 8 individuals with weight (kg) and height (cm).
weight.kg <- c(15, 49, 13, 45, 85, 66, 12, 10)
height.cm <- c(95, 156, 95, 160, 178, 176, 90, 78)
data <- matrix(cbind(weight.kg, height.cm), nrow = 8)
dimnames(data)[[2]] <- c("weight.kg", "height.cm")
data
## weight.kg height.cm
## [1,] 15 95
## [2,] 49 156
## [3,] 13 95
## [4,] 45 160
## [5,] 85 178
## [6,] 66 176
## [7,] 12 90
## [8,] 10 78
plot(data[,1], data[,2], type = "n",
xlab = "Weight (kg)", ylab = "Height (cm)",
main = "Sample points (1–8)")
text(data[,1], data[,2], labels = 1:8)
This scatter plot helps visualize the relative proximity of the data points.
We compute the Euclidean distances between all pairs of observations.
# Pairwise distances
dist_matrix <- dist(data, diag = TRUE, upper = TRUE)
round(dist_matrix, 2)
## 1 2 3 4 5 6 7 8
## 1 0.00 69.84 2.00 71.59 108.58 95.72 5.83 17.72
## 2 69.84 0.00 70.83 5.66 42.19 26.25 75.66 87.21
## 3 2.00 70.83 0.00 72.45 109.88 96.80 5.10 17.26
## 4 71.59 5.66 72.45 0.00 43.86 26.40 77.39 89.16
## 5 108.58 42.19 109.88 43.86 0.00 19.10 114.34 125.00
## 6 95.72 26.25 96.80 26.40 19.10 0.00 101.55 112.87
## 7 5.83 75.66 5.10 77.39 114.34 101.55 0.00 12.17
## 8 17.72 87.21 17.26 89.16 125.00 112.87 12.17 0.00
dist() calculates the Euclidean distance by default. You
can verify distances manually:
sqrt((49 - 45)^2 + (156 - 160)^2)
## [1] 5.656854
The hclust() function performs hierarchical clustering.
We specify method = "single" to use minimum
inter-cluster distance.
hclust.single <- hclust(dist(data), method = "single")
# Structure of result
names(hclust.single)
## [1] "merge" "height" "order" "labels" "method"
## [6] "call" "dist.method"
hclust.single$height
## [1] 2.000000 5.099020 5.656854 12.165525 19.104973 26.248809 69.835521
$height gives the distances (linkage heights) at which
clusters merge.
plot(hclust.single, main = "Single Linkage Dendrogram",
xlab = "Observation", ylab = "Distance")
This plot shows how clusters are formed iteratively based on minimum distance.
We can manually verify the algorithm. First, compute all pairwise distances and find the minimum.
dist <- dist(data, diag = TRUE)
round(dist, 2)
## 1 2 3 4 5 6 7 8
## 1 0.00
## 2 69.84 0.00
## 3 2.00 70.83 0.00
## 4 71.59 5.66 72.45 0.00
## 5 108.58 42.19 109.88 43.86 0.00
## 6 95.72 26.25 96.80 26.40 19.10 0.00
## 7 5.83 75.66 5.10 77.39 114.34 101.55 0.00
## 8 17.72 87.21 17.26 89.16 125.00 112.87 12.17 0.00
min(as.dist(dist)[as.dist(dist) != 0]) # smallest nonzero distance
## [1] 2
The smallest distance corresponds to the pair (1,3). These two points will be merged first.
We manually update the distance matrix using the single linkage rule — distance between clusters = minimum pairwise distance.
old.dist <- as.matrix(dist)
new.dist <- old.dist
# Set intra-cluster distance to 0 for merged points
new.dist[1,3] <- new.dist[3,1] <- 0
# Update distances for cluster (1,3)
for (i in 1:8) {
new.dist[1,i] <- new.dist[i,1] <- min(old.dist[1,i], old.dist[3,i])
new.dist[3,i] <- new.dist[i,3] <- min(old.dist[1,i], old.dist[3,i])
}
# Verify matrix symmetry
all.equal(new.dist, t(new.dist))
## [1] TRUE
round(as.dist(new.dist), 2)
## 1 2 3 4 5 6 7
## 2 69.84
## 3 0.00 69.84
## 4 71.59 5.66 71.59
## 5 108.58 42.19 108.58 43.86
## 6 95.72 26.25 95.72 26.40 19.10
## 7 5.10 75.66 5.10 77.39 114.34 101.55
## 8 17.26 87.21 17.26 89.16 125.00 112.87 12.17
We continue by finding the next smallest distance and merging.
min(as.dist(new.dist)[as.dist(new.dist) != 0]) # 5.099
## [1] 5.09902
# Merge cluster (1,3) with point 7
new.dist2 <- new.dist
for (i in 1:8) {
new.dist2[1,i] <- new.dist2[i,1] <-
min(old.dist[1,i], old.dist[3,i], old.dist[7,i])
}
all.equal(new.dist2, t(new.dist2))
## [1] TRUE
round(as.dist(new.dist2), 3)
## 1 2 3 4 5 6 7
## 2 69.836
## 3 0.000 69.836
## 4 71.589 5.657 71.589
## 5 108.577 42.190 108.577 43.863
## 6 95.718 26.249 95.718 26.401 19.105
## 7 0.000 75.664 5.099 77.389 114.337 101.548
## 8 12.166 87.207 17.263 89.157 125.000 112.872 12.166
Similarly, we merge points (2,4) and update distances.
min(as.dist(new.dist2)[as.dist(new.dist2) != 0]) # 5.657
## [1] 5.09902
new.dist3 <- new.dist2
for (i in 1:8) {
new.dist3[2,i] <- new.dist3[i,2] <-
min(old.dist[2,i], old.dist[4,i])
}
round(as.dist(new.dist3), 3)
## 1 2 3 4 5 6 7
## 2 69.836
## 3 0.000 70.831
## 4 71.589 0.000 71.589
## 5 108.577 42.190 108.577 43.863
## 6 95.718 26.249 95.718 26.401 19.105
## 7 0.000 75.664 5.099 77.389 114.337 101.548
## 8 12.166 87.207 17.263 89.157 125.000 112.872 12.166
The smallest distance after this step corresponds to the next merge (8 with cluster 1,3,7).
hclust.single$height
## [1] 2.000000 5.099020 5.656854 12.165525 19.104973 26.248809 69.835521
The merge heights from hclust() match our manual
computation: 2.00, 5.10, 5.66, 12.17, ...
We can “cut” the dendrogram at a specific number of clusters
(k) or a given height (h).
# Cut into 2 clusters
cut.single <- cutree(hclust.single, k = 2)
cut.single
## [1] 1 2 1 2 2 2 1 1
# Alternatively, cut at height threshold
cut.single <- cutree(hclust.single, h = 20)
cut.single
## [1] 1 2 1 2 3 3 1 1
These assignments group samples based on single linkage distance.
hclust() reproduces manual results efficiently.