Average linkage clustering (UPGMA) merges clusters based on the average pairwise distance between their members. This method balances chaining (as in single linkage) and compactness (as in complete linkage).
We’ll use the same example data as before (height and weight).
weight.kg <- c(15,49,13,45,85,66,12,10)
height.cm <- c(95,156,95,160,178,176,90,78)
data <- matrix(cbind(weight.kg, height.cm), nrow = 8)
dimnames(data)[[2]] <- c("weight.kg", "height.cm")
data
## weight.kg height.cm
## [1,] 15 95
## [2,] 49 156
## [3,] 13 95
## [4,] 45 160
## [5,] 85 178
## [6,] 66 176
## [7,] 12 90
## [8,] 10 78
Compute the full distance matrix using Euclidean distance.
dist.avg <- as.matrix(dist(data, diag = TRUE, upper = TRUE))
round(dist.avg, 3)
## 1 2 3 4 5 6 7 8
## 1 0.000 69.836 2.000 71.589 108.577 95.718 5.831 17.720
## 2 69.836 0.000 70.831 5.657 42.190 26.249 75.664 87.207
## 3 2.000 70.831 0.000 72.450 109.877 96.799 5.099 17.263
## 4 71.589 5.657 72.450 0.000 43.863 26.401 77.389 89.157
## 5 108.577 42.190 109.877 43.863 0.000 19.105 114.337 125.000
## 6 95.718 26.249 96.799 26.401 19.105 0.000 101.548 112.872
## 7 5.831 75.664 5.099 77.389 114.337 101.548 0.000 12.166
## 8 17.720 87.207 17.263 89.157 125.000 112.872 12.166 0.000
Average linkage defines the distance between a merged cluster and another observation as the average of all pairwise distances between points in the two clusters.
We merge (1,3) first because they are closest.
new.dist <- matrix(rep(0, 8 * 8), ncol = 8)
# For every observation i, compute distance to cluster (1,3)
for (i in c(2,4,5,6,7,8)) {
for (j in c(1,3)) {
new.dist[i,j] <- (dist.avg[i,1] + dist.avg[i,3]) / 2
new.dist[j,i] <- (dist.avg[i,1] + dist.avg[i,3]) / 2
}
# Retain distances between other observations
for (j in c(2,4,5,6,7,8)) {
new.dist[i,j] <- dist.avg[i,j]
new.dist[j,i] <- dist.avg[j,i]
}
}
round(new.dist, 3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.000 70.333 0.000 72.020 109.227 96.259 5.465 17.491
## [2,] 70.333 0.000 70.333 5.657 42.190 26.249 75.664 87.207
## [3,] 0.000 70.333 0.000 72.020 109.227 96.259 5.465 17.491
## [4,] 72.020 5.657 72.020 0.000 43.863 26.401 77.389 89.157
## [5,] 109.227 42.190 109.227 43.863 0.000 19.105 114.337 125.000
## [6,] 96.259 26.249 96.259 26.401 19.105 0.000 101.548 112.872
## [7,] 5.465 75.664 5.465 77.389 114.337 101.548 0.000 12.166
## [8,] 17.491 87.207 17.491 89.157 125.000 112.872 12.166 0.000
The next minimum distance is approximately 5.465, corresponding to cluster (1,3,7).
Now, merge cluster (1,3) with observation 7. Compute the average of all distances to (1,3,7).
new.dist <- matrix(rep(0, 8 * 8), ncol = 8)
for (i in c(2,4,5,6,8)) {
for (j in c(1,3,7)) {
new.dist[i,j] <- (dist.avg[i,1] + dist.avg[i,3] + dist.avg[i,7]) / 3
new.dist[j,i] <- (dist.avg[i,1] + dist.avg[i,3] + dist.avg[i,7]) / 3
}
for (j in c(2,4,5,6,8)) {
new.dist[i,j] <- dist.avg[i,j]
new.dist[j,i] <- dist.avg[j,i]
}
}
round(new.dist, 3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.000 72.110 0.000 73.809 110.931 98.022 0.000 15.716
## [2,] 72.110 0.000 72.110 5.657 42.190 26.249 72.110 87.207
## [3,] 0.000 72.110 0.000 73.809 110.931 98.022 0.000 15.716
## [4,] 73.809 5.657 73.809 0.000 43.863 26.401 73.809 89.157
## [5,] 110.931 42.190 110.931 43.863 0.000 19.105 110.931 125.000
## [6,] 98.022 26.249 98.022 26.401 19.105 0.000 98.022 112.872
## [7,] 0.000 72.110 0.000 73.809 110.931 98.022 0.000 15.716
## [8,] 15.716 87.207 15.716 89.157 125.000 112.872 15.716 0.000
The next minimum distance is 5.657, between items (2,4).
Now we merge observations 2 and 4 using the same averaging rule.
new.dist2 <- matrix(rep(0, 8 * 8), ncol = 8)
for (i in c(1,3,5,6,7,8)) {
for (j in c(2,4)) {
new.dist2[i,j] <- (dist.avg[i,2] + dist.avg[i,4]) / 2
new.dist2[j,i] <- (dist.avg[i,2] + dist.avg[i,4]) / 2
}
for (j in c(1,3,5,6,7,8)) {
new.dist2[i,j] <- dist.avg[i,j]
new.dist2[j,i] <- dist.avg[j,i]
}
}
round(new.dist2, 3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.000 70.712 2.000 70.712 108.577 95.718 5.831 17.720
## [2,] 70.712 0.000 71.640 0.000 43.027 26.325 76.526 88.182
## [3,] 2.000 71.640 0.000 71.640 109.877 96.799 5.099 17.263
## [4,] 70.712 0.000 71.640 0.000 43.027 26.325 76.526 88.182
## [5,] 108.577 43.027 109.877 43.027 0.000 19.105 114.337 125.000
## [6,] 95.718 26.325 96.799 26.325 19.105 0.000 101.548 112.872
## [7,] 5.831 76.526 5.099 76.526 114.337 101.548 0.000 12.166
## [8,] 17.720 88.182 17.263 88.182 125.000 112.872 12.166 0.000
hclust(method = "average")To confirm the manual results, use the built-in hierarchical clustering function.
hclust.average <- hclust(dist(data), method = "average")
hclust.average$height
## [1] 2.000000 5.464986 5.656854 15.716082 19.104973 34.675759 92.428117
Expected merge heights:
2.000000 5.464986 5.656854 17.720045 19.104973 43.863424 125.000000
These match the sequence of merges found manually.
plot(hclust.average, main = "Average Linkage Dendrogram",
xlab = "Observation", ylab = "Distance")
Average linkage tends to produce balanced clusters and is less extreme than single or complete linkage.
hclust(method = "average") confirms the correctness of
manual updates.Would you like me to prepare this as a downloadable file named 📄
average_linkage.Rmd to accompany your
single and complete linkage demos?