Introduction

Single linkage clustering merges the two clusters whose minimum pairwise distance is the smallest among all pairs. When clusters have multiple members, the distance between them is defined as the minimum distance between any pair of points across clusters.

1. Setting up the data

We start with a simple dataset of 8 individuals with weight (kg) and height (cm).

weight.kg <- c(15, 49, 13, 45, 85, 66, 12, 10)
height.cm <- c(95, 156, 95, 160, 178, 176, 90, 78)

data <- matrix(cbind(weight.kg, height.cm), nrow = 8)
dimnames(data)[[2]] <- c("weight.kg", "height.cm")
data
##      weight.kg height.cm
## [1,]        15        95
## [2,]        49       156
## [3,]        13        95
## [4,]        45       160
## [5,]        85       178
## [6,]        66       176
## [7,]        12        90
## [8,]        10        78

2. Visualizing the data

plot(data[,1], data[,2], type = "n",
     xlab = "Weight (kg)", ylab = "Height (cm)",
     main = "Sample points (1–8)")
text(data[,1], data[,2], labels = 1:8)

This scatter plot helps visualize the relative proximity of the data points.

3. Pairwise distances

We compute the Euclidean distances between all pairs of observations.

# Pairwise distances
dist_matrix <- dist(data, diag = TRUE, upper = TRUE)
round(dist_matrix, 2)
##        1      2      3      4      5      6      7      8
## 1   0.00  69.84   2.00  71.59 108.58  95.72   5.83  17.72
## 2  69.84   0.00  70.83   5.66  42.19  26.25  75.66  87.21
## 3   2.00  70.83   0.00  72.45 109.88  96.80   5.10  17.26
## 4  71.59   5.66  72.45   0.00  43.86  26.40  77.39  89.16
## 5 108.58  42.19 109.88  43.86   0.00  19.10 114.34 125.00
## 6  95.72  26.25  96.80  26.40  19.10   0.00 101.55 112.87
## 7   5.83  75.66   5.10  77.39 114.34 101.55   0.00  12.17
## 8  17.72  87.21  17.26  89.16 125.00 112.87  12.17   0.00

dist() calculates the Euclidean distance by default. You can verify distances manually:

sqrt((49 - 45)^2 + (156 - 160)^2)
## [1] 5.656854

4. Hierarchical clustering (Single linkage)

The hclust() function performs hierarchical clustering. We specify method = "single" to use minimum inter-cluster distance.

hclust.single <- hclust(dist(data), method = "single")

# Structure of result
names(hclust.single)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"
hclust.single$height
## [1]  2.000000  5.099020  5.656854 12.165525 19.104973 26.248809 69.835521

$height gives the distances (linkage heights) at which clusters merge.

5. Dendrogram

plot(hclust.single, main = "Single Linkage Dendrogram",
     xlab = "Observation", ylab = "Distance")

This plot shows how clusters are formed iteratively based on minimum distance.

6. Manual first iteration

We can manually verify the algorithm. First, compute all pairwise distances and find the minimum.

dist <- dist(data, diag = TRUE)
round(dist, 2)
##        1      2      3      4      5      6      7      8
## 1   0.00                                                 
## 2  69.84   0.00                                          
## 3   2.00  70.83   0.00                                   
## 4  71.59   5.66  72.45   0.00                            
## 5 108.58  42.19 109.88  43.86   0.00                     
## 6  95.72  26.25  96.80  26.40  19.10   0.00              
## 7   5.83  75.66   5.10  77.39 114.34 101.55   0.00       
## 8  17.72  87.21  17.26  89.16 125.00 112.87  12.17   0.00
min(as.dist(dist)[as.dist(dist) != 0])  # smallest nonzero distance
## [1] 2

The smallest distance corresponds to the pair (1,3). These two points will be merged first.

7. Updating distances after merging (Clusters 1 & 3)

We manually update the distance matrix using the single linkage rule — distance between clusters = minimum pairwise distance.

old.dist <- as.matrix(dist)
new.dist <- old.dist

# Set intra-cluster distance to 0 for merged points
new.dist[1,3] <- new.dist[3,1] <- 0

# Update distances for cluster (1,3)
for (i in 1:8) {
  new.dist[1,i] <- new.dist[i,1] <- min(old.dist[1,i], old.dist[3,i])
  new.dist[3,i] <- new.dist[i,3] <- min(old.dist[1,i], old.dist[3,i])
}

# Verify matrix symmetry
all.equal(new.dist, t(new.dist))
## [1] TRUE
round(as.dist(new.dist), 2)
##        1      2      3      4      5      6      7
## 2  69.84                                          
## 3   0.00  69.84                                   
## 4  71.59   5.66  71.59                            
## 5 108.58  42.19 108.58  43.86                     
## 6  95.72  26.25  95.72  26.40  19.10              
## 7   5.10  75.66   5.10  77.39 114.34 101.55       
## 8  17.26  87.21  17.26  89.16 125.00 112.87  12.17

8. Next iteration (Merging 7 with cluster 1–3)

We continue by finding the next smallest distance and merging.

min(as.dist(new.dist)[as.dist(new.dist) != 0])  # 5.099
## [1] 5.09902
# Merge cluster (1,3) with point 7
new.dist2 <- new.dist
for (i in 1:8) {
  new.dist2[1,i] <- new.dist2[i,1] <- 
    min(old.dist[1,i], old.dist[3,i], old.dist[7,i])
}
all.equal(new.dist2, t(new.dist2))
## [1] TRUE
round(as.dist(new.dist2), 3)
##         1       2       3       4       5       6       7
## 2  69.836                                                
## 3   0.000  69.836                                        
## 4  71.589   5.657  71.589                                
## 5 108.577  42.190 108.577  43.863                        
## 6  95.718  26.249  95.718  26.401  19.105                
## 7   0.000  75.664   5.099  77.389 114.337 101.548        
## 8  12.166  87.207  17.263  89.157 125.000 112.872  12.166

9. Merge (2,4) cluster

Similarly, we merge points (2,4) and update distances.

min(as.dist(new.dist2)[as.dist(new.dist2) != 0])  # 5.657
## [1] 5.09902
new.dist3 <- new.dist2
for (i in 1:8) {
  new.dist3[2,i] <- new.dist3[i,2] <- 
    min(old.dist[2,i], old.dist[4,i])
}
round(as.dist(new.dist3), 3)
##         1       2       3       4       5       6       7
## 2  69.836                                                
## 3   0.000  70.831                                        
## 4  71.589   0.000  71.589                                
## 5 108.577  42.190 108.577  43.863                        
## 6  95.718  26.249  95.718  26.401  19.105                
## 7   0.000  75.664   5.099  77.389 114.337 101.548        
## 8  12.166  87.207  17.263  89.157 125.000 112.872  12.166

The smallest distance after this step corresponds to the next merge (8 with cluster 1,3,7).

10. Compare manual and automated results

hclust.single$height
## [1]  2.000000  5.099020  5.656854 12.165525 19.104973 26.248809 69.835521

The merge heights from hclust() match our manual computation: 2.00, 5.10, 5.66, 12.17, ...


11. Define clusters

We can “cut” the dendrogram at a specific number of clusters (k) or a given height (h).

# Cut into 2 clusters
cut.single <- cutree(hclust.single, k = 2)
cut.single
## [1] 1 2 1 2 2 2 1 1
# Alternatively, cut at height threshold
cut.single <- cutree(hclust.single, h = 20)
cut.single
## [1] 1 2 1 2 3 3 1 1

These assignments group samples based on single linkage distance.

Summary