Introduction

Clustering validation is essential for evaluating the quality of clustering results. This tutorial demonstrates various clustering validation metrics in R using multiple packages and the famous iris dataset.

Key concepts covered:

  • Internal validation measures (Silhouette, Dunn Index, Connectivity)
  • Determining optimal number of clusters
  • Stability measures
  • Visual validation methods

Load Required Packages

# Install packages if needed
# install.packages(c("cluster", "factoextra", "clValid", "NbClust", "ggplot2"))

library(cluster)      # For silhouette, pam, and clustering functions
library(factoextra)   # For visualization functions
library(clValid)      # For comprehensive validation
library(NbClust)      # For multiple indices at once
library(ggplot2)      # For additional plotting

Prepare Data

We’ll use the iris dataset, removing the species column for unsupervised clustering.

# Load and prepare data
data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Remove species column and scale the data
iris_data <- iris[, -5]
iris_scaled <- scale(iris_data)

# Check the scaled data
head(iris_scaled)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667

Part 1: Determining Optimal Number of Clusters

1.1 Elbow Method (Within-Cluster Sum of Squares)

The elbow method plots the within-cluster sum of squares (WSS) against the number of clusters. The “elbow” point indicates a good number of clusters.

# Elbow method using factoextra
fviz_nbclust(iris_scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2, color = "red") +
  labs(title = "Elbow Method for Optimal k",
       subtitle = "Look for the 'elbow' in the curve",
       x = "Number of clusters (k)",
       y = "Total within-cluster sum of squares") +
  theme_minimal()

Interpretation: The plot shows the total WSS decreases as k increases. Look for the point where the decrease slows down dramatically (the “elbow”). For iris data, this appears around k=3.

1.2 Silhouette Method

The silhouette method measures how similar an object is to its own cluster compared to other clusters. Higher average silhouette width indicates better clustering.

# Silhouette method
fviz_nbclust(iris_scaled, kmeans, method = "silhouette") +
  labs(title = "Silhouette Method for Optimal k",
       subtitle = "Higher average silhouette width = better clustering",
       x = "Number of clusters (k)",
       y = "Average Silhouette Width") +
  theme_minimal()

Interpretation: The peak of the curve indicates the optimal number of clusters. Silhouette values range from -1 to 1:

  • Close to 1: Well-clustered observations
  • Close to 0: Borderline between two clusters
  • Negative: Likely assigned to wrong cluster

1.3 Gap Statistic Method

The gap statistic compares the within-cluster variation to a null reference distribution of the data.

# Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris_scaled, FUN = kmeans, nstart = 25, 
                    K.max = 10, B = 50)

# Print results
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = iris_scaled, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 3
##           logW   E.logW       gap     SE.sim
##  [1,] 4.534565 4.755428 0.2208634 0.02534324
##  [2,] 4.021316 4.490212 0.4688953 0.02670070
##  [3,] 3.806577 4.293793 0.4872159 0.02124741
##  [4,] 3.699263 4.140237 0.4409736 0.02177507
##  [5,] 3.589284 4.051459 0.4621749 0.01882154
##  [6,] 3.522810 3.975009 0.4521993 0.01753073
##  [7,] 3.448288 3.908834 0.4605460 0.01774025
##  [8,] 3.379870 3.852475 0.4726054 0.01727207
##  [9,] 3.310088 3.803931 0.4938436 0.01649671
## [10,] 3.278659 3.759003 0.4803440 0.01576050
# Visualize gap statistic
fviz_gap_stat(gap_stat) +
  labs(title = "Gap Statistic Method for Optimal k",
       subtitle = "Choose k where gap statistic is largest") +
  theme_minimal()

Interpretation: The optimal number of clusters is where the gap statistic is maximized. The method compares the observed within-cluster variation to that expected under a null reference distribution.

1.4 NbClust: Multiple Indices Simultaneously

NbClust provides 30 different indices for determining optimal cluster number in one function call.

# Run NbClust with multiple indices
# Note: This may take some time as it computes 30 different indices
set.seed(123)
nb_result <- NbClust(data = iris_scaled, 
                     distance = "euclidean",
                     min.nc = 2, 
                     max.nc = 10, 
                     method = "kmeans", 
                     index = "all")
# View the optimal number suggested by different indices
cat("Optimal number of clusters based on majority rule:\n")
## Optimal number of clusters based on majority rule:
print(nb_result$Best.nc[1,])
##         KL         CH   Hartigan        CCC      Scott    Marriot     TrCovW 
##          3          2          3          3          3          3          4 
##     TraceW   Friedman      Rubin     Cindex         DB Silhouette       Duda 
##          3          6          3          3          2          2          2 
##   PseudoT2      Beale  Ratkowsky       Ball PtBiserial       Frey    McClain 
##          2          2          2          3          2          3          2 
##       Dunn     Hubert    SDindex     Dindex       SDbw 
##          2          0          2          0         10
# View how many indices proposed each k
table(nb_result$Best.nc[1,])
## 
##  0  2  3  4  6 10 
##  2 11 10  1  1  1

Interpretation: NbClust applies 30 different indices and suggests the optimal k based on majority rule. This provides a comprehensive assessment across multiple validation criteria.

Part 2: Detailed Silhouette Analysis

After determining optimal k, we perform detailed silhouette analysis.

# Perform k-means clustering with k=3
set.seed(123)
km_result <- kmeans(iris_scaled, centers = 3, nstart = 25)

# Compute silhouette information
sil <- silhouette(km_result$cluster, dist(iris_scaled))

# Summary of silhouette
summary(sil)
## Silhouette of 150 units in 3 clusters from silhouette.default(x = km_result$cluster, dist = dist(iris_scaled)) :
##  Cluster sizes and average silhouette widths:
##        50        53        47 
## 0.6363162 0.3933772 0.3473922 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.02489  0.35914  0.47113  0.45995  0.58883  0.73419
# Plot silhouette
fviz_silhouette(sil) +
  labs(title = "Silhouette Plot for k=3 Clusters",
       subtitle = "Width indicates clustering quality for each observation") +
  theme_minimal()
##   cluster size ave.sil.width
## 1       1   50          0.64
## 2       2   53          0.39
## 3       3   47          0.35

Identify Poorly Clustered Observations

# Find observations with negative silhouette values
neg_sil_index <- which(sil[, "sil_width"] < 0)

if(length(neg_sil_index) > 0) {
  cat("Observations with negative silhouette (poorly clustered):\n")
  print(sil[neg_sil_index, , drop = FALSE])
  cat("\nThese", length(neg_sil_index), 
      "observations may be assigned to the wrong cluster.\n")
} else {
  cat("No observations with negative silhouette values.\n")
}
## Observations with negative silhouette (poorly clustered):
##      cluster neighbor   sil_width
## [1,]       3        2 -0.01058434
## [2,]       3        2 -0.02489394
## 
## These 2 observations may be assigned to the wrong cluster.
# Find observations with low silhouette (< 0.25)
low_sil_index <- which(sil[, "sil_width"] < 0.25 & sil[, "sil_width"] >= 0)
if(length(low_sil_index) > 0) {
  cat("\nObservations with low silhouette (< 0.25):\n")
  print(head(sil[low_sil_index, , drop = FALSE]))
  cat("\nThese observations are borderline between clusters.\n")
}
## 
## Observations with low silhouette (< 0.25):
##      cluster neighbor  sil_width
## [1,]       1        2 0.07766666
## [2,]       3        2 0.16627565
## [3,]       2        3 0.13270610
## [4,]       3        2 0.23206569
## [5,]       2        3 0.03768897
## [6,]       3        2 0.18906699
## 
## These observations are borderline between clusters.

Compare Different Numbers of Clusters

# Compare silhouettes for k = 2 to 6
par(mfrow = c(2, 3))
for(k in 2:6) {
  set.seed(123)
  km_temp <- kmeans(iris_scaled, centers = k, nstart = 25)
  sil_temp <- silhouette(km_temp$cluster, dist(iris_scaled))
  plot(sil_temp, main = paste("k =", k), 
       col = 2:(k+1), border = NA)
}
par(mfrow = c(1, 1))

Part 3: Comprehensive Validation with clValid

The clValid package provides internal, stability, and biological validation measures.

3.1 Internal Validation

Internal measures include Connectivity, Dunn Index, and Silhouette Width.

# Perform internal validation
# Compare multiple clustering methods and numbers of clusters
clmethods <- c("hierarchical", "kmeans", "pam")

intern <- clValid(iris_scaled, 
                  nClust = 2:6,
                  clMethods = clmethods,
                  validation = "internal")

# Summary of results
summary(intern)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                  2       3       4       5       6
##                                                                   
## hierarchical Connectivity   0.9762  5.5964  7.5492 18.0508 24.7306
##              Dunn           0.2674  0.1874  0.2060  0.0700  0.0762
##              Silhouette     0.5818  0.4803  0.4067  0.3746  0.3248
## kmeans       Connectivity   0.9762 23.8151 25.9183 40.3198 40.1524
##              Dunn           0.2674  0.0265  0.0700  0.0808  0.0808
##              Silhouette     0.5818  0.4599  0.4189  0.3455  0.3441
## pam          Connectivity   0.9762 23.0726 31.8067 35.7964 44.5413
##              Dunn           0.2674  0.0571  0.0566  0.0642  0.0361
##              Silhouette     0.5818  0.4566  0.4091  0.3574  0.3400
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 0.9762 hierarchical 2       
## Dunn         0.2674 hierarchical 2       
## Silhouette   0.5818 hierarchical 2
# Visualize internal measures
par(mfrow = c(3, 1))
plot(intern, legend = FALSE)

par(mfrow = c(1, 1))

# Get optimal scores
optimalScores(intern)
##                  Score       Method Clusters
## Connectivity 0.9761905 hierarchical        2
## Dunn         0.2674261 hierarchical        2
## Silhouette   0.5817500 hierarchical        2

Interpretation of Internal Measures:

  • Connectivity: Measures how well observations are placed with their nearest neighbors. Minimize this value (range: 0 to ∞).
  • Dunn Index: Ratio of smallest inter-cluster distance to largest intra-cluster distance. Maximize this value (range: 0 to ∞).
  • Silhouette Width: Average silhouette value across all observations. Maximize this value (range: -1 to 1).

3.2 Stability Validation

Stability measures evaluate how consistent clustering is when data is perturbed (removing one sample at a time).

# Perform stability validation
# This takes longer as it requires resampling
stab <- clValid(iris_scaled, 
                nClust = 2:6,
                clMethods = clmethods,
                validation = "stability")

# Summary
summary(stab)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                        2      3      4      5      6
##                                                     
## hierarchical APN  0.0033 0.0370 0.0859 0.1088 0.1342
##              AD   1.4924 1.4601 1.4015 1.3269 1.2251
##              ADM  0.0161 0.1451 0.1448 0.3919 0.3176
##              FOM  0.5957 0.5809 0.5571 0.5334 0.5101
## kmeans       APN  0.0128 0.1034 0.1290 0.2130 0.2950
##              AD   1.5060 1.2685 1.1568 1.1269 1.0949
##              ADM  0.0555 0.2152 0.2147 0.4499 0.5147
##              FOM  0.6049 0.5206 0.4888 0.4805 0.4910
## pam          APN  0.0128 0.1162 0.1420 0.1655 0.1886
##              AD   1.5060 1.2721 1.1665 1.0726 1.0043
##              ADM  0.0555 0.2266 0.2500 0.2834 0.2814
##              FOM  0.6049 0.5032 0.4828 0.4789 0.4558
## 
## Optimal Scores:
## 
##     Score  Method       Clusters
## APN 0.0033 hierarchical 2       
## AD  1.0043 pam          6       
## ADM 0.0161 hierarchical 2       
## FOM 0.4558 pam          6
# Optimal scores
optimalScores(stab)
##           Score       Method Clusters
## APN 0.003266667 hierarchical        2
## AD  1.004288856          pam        6
## ADM 0.016087089 hierarchical        2
## FOM 0.455750052          pam        6

Interpretation of Stability Measures:

  • APN (Average Proportion of Non-overlap): Average proportion of observations not placed in same cluster. Minimize (range: 0 to 1).
  • AD (Average Distance): Average distance between observations in same cluster. Minimize.
  • ADM (Average Distance between Means): Average distance between cluster centers. Minimize.
  • FOM (Figure of Merit): Average intra-cluster variance. Minimize.

Part 4: Dunn Index Calculation

The Dunn Index is the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.

# Calculate Dunn Index for different k values
dunn_values <- sapply(2:10, function(k) {
  set.seed(123)
  km <- kmeans(iris_scaled, centers = k, nstart = 25)
  dunn(distance = dist(iris_scaled), clusters = km$cluster)
})

# Plot Dunn Index
dunn_df <- data.frame(k = 2:10, dunn = dunn_values)

ggplot(dunn_df, aes(x = k, y = dunn)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "steelblue", size = 3) +
  geom_vline(xintercept = which.max(dunn_values) + 1, 
             linetype = "dashed", color = "red") +
  labs(title = "Dunn Index for Different Number of Clusters",
       subtitle = "Higher values indicate better clustering",
       x = "Number of clusters (k)",
       y = "Dunn Index") +
  theme_minimal()

cat("Optimal k based on Dunn Index:", which.max(dunn_values) + 1, "\n")
## Optimal k based on Dunn Index: 2
cat("Maximum Dunn Index value:", max(dunn_values), "\n")
## Maximum Dunn Index value: 0.2674261

Part 5: PAM Clustering with Validation

Partitioning Around Medoids (PAM) is more robust to outliers than k-means.

# Perform PAM clustering
set.seed(123)
pam_result <- pam(iris_scaled, k = 3)

# Summary
summary(pam_result)
## Medoids:
##       ID Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   8   -1.0184372   0.7861738   -1.2791040  -1.3110521
## [2,] 113    1.1553023  -0.1315388    0.9868021   1.1816087
## [3,]  56   -0.1730941  -0.5903951    0.4203256   0.1320673
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3
##  [75] 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
## [112] 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 3 2
## [149] 2 3
## Objective function:
##     build      swap 
## 0.9205107 0.8757051 
## 
## Numerical information per cluster:
##      size max_diss   av_diss diameter separation
## [1,]   50 2.600722 0.8137338 5.034198  1.5533592
## [2,]   45 2.326061 0.9215472 3.537354  0.2875493
## [3,]   55 2.135226 0.8945355 3.565921  0.2875493
## 
## Isolated clusters:
##  L-clusters: character(0)
##  L*-clusters: character(0)
## 
## Silhouette plot information:
##     cluster neighbor    sil_width
## 1         1        3  0.732331909
## 41        1        3  0.731502867
## 8         1        3  0.729250202
## 18        1        3  0.726758490
## 5         1        3  0.726426711
## 40        1        3  0.723018364
## 38        1        3  0.722510100
## 12        1        3  0.720312003
## 28        1        3  0.719430309
## 29        1        3  0.712728182
## 50        1        3  0.709480595
## 27        1        3  0.702764618
## 25        1        3  0.699621242
## 7         1        3  0.696959631
## 23        1        3  0.694761433
## 22        1        3  0.687304888
## 49        1        3  0.686257475
## 3         1        3  0.676816831
## 47        1        3  0.675831828
## 36        1        3  0.675208799
## 20        1        3  0.674402267
## 11        1        3  0.671093249
## 30        1        3  0.666943215
## 48        1        3  0.665359771
## 37        1        3  0.662779322
## 44        1        3  0.660886620
## 21        1        3  0.657931899
## 32        1        3  0.645080275
## 45        1        3  0.644919666
## 43        1        3  0.636667019
## 24        1        3  0.635162616
## 10        1        3  0.631083664
## 35        1        3  0.627938856
## 31        1        3  0.625187352
## 4         1        3  0.620223527
## 17        1        3  0.617108039
## 6         1        3  0.605451110
## 33        1        3  0.585673488
## 19        1        3  0.581181517
## 13        1        3  0.578692634
## 2         1        3  0.568432481
## 46        1        3  0.559577438
## 39        1        3  0.553081401
## 14        1        3  0.549935134
## 15        1        3  0.548227538
## 26        1        3  0.545021794
## 34        1        3  0.536066950
## 9         1        3  0.489473466
## 16        1        3  0.452020027
## 42        1        3  0.087105075
## 121       2        3  0.549474840
## 144       2        3  0.542034392
## 140       2        3  0.536632666
## 103       2        3  0.528619213
## 125       2        3  0.514255167
## 126       2        3  0.512021709
## 142       2        3  0.511091107
## 141       2        3  0.509607781
## 113       2        3  0.502430899
## 145       2        3  0.500307464
## 106       2        3  0.480989126
## 136       2        3  0.478096351
## 146       2        3  0.463122794
## 110       2        3  0.461807801
## 108       2        3  0.451364289
## 105       2        3  0.437332298
## 130       2        3  0.436193424
## 116       2        3  0.430590190
## 131       2        3  0.424017001
## 111       2        3  0.420687967
## 101       2        3  0.414905109
## 123       2        3  0.412445224
## 118       2        3  0.406484273
## 137       2        3  0.404906392
## 132       2        3  0.393424798
## 148       2        3  0.366261248
## 149       2        3  0.362084673
## 119       2        3  0.345893722
## 53        2        3  0.339282258
## 117       2        3  0.332716128
## 138       2        3  0.330590766
## 78        2        3  0.323775785
## 51        2        3  0.319841375
## 133       2        3  0.250925584
## 87        2        3  0.248118057
## 129       2        3  0.227977193
## 57        2        3  0.170109727
## 66        2        3  0.153483839
## 104       2        3  0.131831019
## 52        2        3  0.107493689
## 109       2        3  0.037797063
## 77        2        3  0.031894097
## 76        2        3  0.013559254
## 112       2        3 -0.002669112
## 115       2        3 -0.080312684
## 95        3        2  0.582655152
## 91        3        2  0.578701704
## 90        3        2  0.577222969
## 93        3        2  0.574603266
## 70        3        2  0.574473478
## 83        3        2  0.568099242
## 100       3        2  0.558550244
## 81        3        2  0.556796287
## 80        3        2  0.552319144
## 68        3        2  0.549501434
## 82        3        2  0.547972120
## 56        3        2  0.545207379
## 60        3        2  0.543076360
## 54        3        2  0.531061917
## 97        3        2  0.514917805
## 65        3        2  0.507409811
## 72        3        2  0.467849831
## 89        3        2  0.462954353
## 107       3        2  0.462901473
## 63        3        2  0.456962461
## 96        3        2  0.454046401
## 74        3        2  0.436426281
## 85        3        2  0.427653666
## 94        3        1  0.425567639
## 99        3        1  0.421239774
## 67        3        2  0.420353214
## 61        3        1  0.411444151
## 88        3        2  0.403764529
## 79        3        2  0.399151916
## 120       3        2  0.398594873
## 58        3        1  0.397039293
## 84        3        2  0.390664770
## 69        3        2  0.385994051
## 62        3        2  0.375643074
## 114       3        2  0.364265434
## 98        3        2  0.362032693
## 64        3        2  0.357409637
## 73        3        2  0.354258583
## 135       3        2  0.339487300
## 102       3        2  0.339189836
## 143       3        2  0.339189836
## 122       3        2  0.320877439
## 92        3        2  0.282532101
## 75        3        2  0.243451149
## 147       3        2  0.211740024
## 134       3        2  0.209794644
## 127       3        2  0.187703672
## 124       3        2  0.175575066
## 55        3        2  0.146397276
## 150       3        2  0.145989432
## 139       3        2  0.140457286
## 59        3        2  0.062723857
## 128       3        2  0.047655569
## 71        3        2  0.007296921
## 86        3        2 -0.067854819
## Average silhouette width per cluster:
## [1] 0.6346397 0.3496332 0.3823817
## Average silhouette width of total data set:
## [1] 0.4566432
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
# Visualize PAM clustering
fviz_cluster(pam_result, 
             data = iris_scaled,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal(),
             main = "PAM Clustering (k=3)")

# Silhouette plot for PAM
fviz_silhouette(pam_result) +
  labs(title = "Silhouette Plot for PAM Clustering (k=3)") +
  theme_minimal()
##   cluster size ave.sil.width
## 1       1   50          0.63
## 2       2   45          0.35
## 3       3   55          0.38

Part 6: Hierarchical Clustering with Validation

# Perform hierarchical clustering
set.seed(123)
hc_result <- hcut(iris_scaled, k = 3, hc_method = "ward.D2")

# Visualize dendrogram
fviz_dend(hc_result, 
          rect = TRUE,
          cex = 0.5,
          main = "Hierarchical Clustering Dendrogram (k=3)",
          k_colors = c("#E7B800", "#00AFBB", "#FC4E07"))

# Silhouette for hierarchical clustering
fviz_silhouette(hc_result) +
  labs(title = "Silhouette Plot for Hierarchical Clustering (k=3)") +
  theme_minimal()
##   cluster size ave.sil.width
## 1       1   49          0.63
## 2       2   30          0.44
## 3       3   71          0.32

Part 7: Cluster Visualization and Comparison

# Compare three clustering methods side by side
set.seed(123)

# K-means
km3 <- kmeans(iris_scaled, centers = 3, nstart = 25)
p1 <- fviz_cluster(list(data = iris_scaled, cluster = km3$cluster),
                   geom = "point",
                   ellipse.type = "convex",
                   palette = "jco",
                   main = "K-means Clustering") +
  theme_minimal()

# PAM
pam3 <- pam(iris_scaled, k = 3)
p2 <- fviz_cluster(pam3,
                   geom = "point",
                   ellipse.type = "convex",
                   palette = "jco",
                   main = "PAM Clustering") +
  theme_minimal()

# Hierarchical
hc3 <- hcut(iris_scaled, k = 3, hc_method = "ward.D2")
p3 <- fviz_cluster(list(data = iris_scaled, cluster = hc3$cluster),
                   geom = "point",
                   ellipse.type = "convex",
                   palette = "jco",
                   main = "Hierarchical Clustering") +
  theme_minimal()

# Arrange plots
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)

Part 8: Comparing Results to True Labels

Since iris has known species labels, we can compare clustering results.

# Compare k-means results to true species
table(Predicted = km3$cluster, Actual = iris$Species)
##          Actual
## Predicted setosa versicolor virginica
##         1     50          0         0
##         2      0         39        14
##         3      0         11        36
# Calculate adjusted Rand index
library(mclust)
adjustedRandIndex(km3$cluster, iris$Species)
## [1] 0.6201352
# For PAM
cat("\nAdjusted Rand Index for PAM:", 
    adjustedRandIndex(pam3$clustering, iris$Species), "\n")
## 
## Adjusted Rand Index for PAM: 0.6416027
# For Hierarchical
cat("Adjusted Rand Index for Hierarchical:", 
    adjustedRandIndex(hc3$cluster, iris$Species), "\n")
## Adjusted Rand Index for Hierarchical: 0.615323

Interpretation: Adjusted Rand Index (ARI) measures agreement between two clusterings, corrected for chance:

  • ARI = 1: Perfect agreement
  • ARI = 0: Random clustering
  • ARI < 0: Worse than random

Summary and Best Practices

Key Findings for Iris Dataset

# Create summary of all methods
summary_df <- data.frame(
  Method = c("Elbow", "Silhouette", "Gap Statistic", 
             "NbClust Majority", "Dunn Index", "clValid Internal"),
  Optimal_k = c(3, 2, 3, 3, which.max(dunn_values) + 1, 2),
  Notes = c("Visual inspection of elbow",
            "Maximum average silhouette",
            "First maximum of gap statistic",
            "Majority vote across 30 indices",
            "Maximum Dunn index",
            "Best connectivity & silhouette")
)

knitr::kable(summary_df, 
             caption = "Summary of Optimal k by Different Methods")
Summary of Optimal k by Different Methods
Method Optimal_k Notes
Elbow 3 Visual inspection of elbow
Silhouette 2 Maximum average silhouette
Gap Statistic 3 First maximum of gap statistic
NbClust Majority 3 Majority vote across 30 indices
Dunn Index 2 Maximum Dunn index
clValid Internal 2 Best connectivity & silhouette

Best Practices for Clustering Validation

  1. Use Multiple Validation Measures: Different measures may suggest different optimal k. Consider the consensus.

  2. Visual Inspection: Always visualize your clusters and silhouette plots to understand the clustering structure.

  3. Domain Knowledge: Statistical measures should be combined with domain expertise. The “optimal” k should make sense in your application context.

  4. Data Scaling: Always scale/standardize your data before clustering, especially when variables have different units.

  5. Stability Check: Use stability measures to ensure your clustering is robust to small perturbations.

  6. Interpretation Guidelines:

    • Average Silhouette > 0.7: Strong clustering
    • Average Silhouette > 0.5: Reasonable clustering
    • Average Silhouette > 0.25: Weak clustering
    • Average Silhouette < 0.25: No substantial structure
  7. Computational Considerations: For large datasets, use sampling or efficient algorithms (CLARA for PAM, sampling for validation).

Session Information

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mclust_6.1.2     NbClust_3.0.1    clValid_0.7      factoextra_1.0.7
## [5] ggplot2_4.0.1    cluster_2.1.8.1 
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.5      sass_0.4.10        generics_0.1.4     tidyr_1.3.1       
##  [5] rstatix_0.7.2      class_7.3-23       digest_0.6.39      magrittr_2.0.4    
##  [9] evaluate_1.0.5     grid_4.5.1         RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] jsonlite_2.0.0     ggrepel_0.9.6      backports_1.5.0    Formula_1.2-5     
## [17] gridExtra_2.3      purrr_1.2.0        viridisLite_0.4.2  scales_1.4.0      
## [21] jquerylib_0.1.4    abind_1.4-8        cli_3.6.5          rlang_1.1.6       
## [25] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        tools_4.5.1       
## [29] ggsignif_0.6.4     dplyr_1.1.4        ggpubr_0.6.1       broom_1.0.9       
## [33] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4    car_3.1-3         
## [37] dendextend_1.19.1  pkgconfig_2.0.3    pillar_1.11.1      bslib_0.9.0       
## [41] gtable_0.3.6       glue_1.8.0         Rcpp_1.1.0         xfun_0.54         
## [45] tibble_3.3.0       tidyselect_1.2.1   rstudioapi_0.17.1  knitr_1.50        
## [49] dichromat_2.0-0.1  farver_2.1.2       htmltools_0.5.8.1  ggsci_3.2.0       
## [53] labeling_0.4.3     rmarkdown_2.30     carData_3.0-5      compiler_4.5.1    
## [57] S7_0.2.1

References

  1. Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.

  2. Dunn, J.C. (1974). Well separated clusters and fuzzy partitions. Journal on Cybernetics, 4, 95-104.

  3. Brock, G., Pihur, V., Datta, S., & Datta, S. (2008). clValid: An R package for cluster validation. Journal of Statistical Software, 25(4), 1-22.

  4. Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: An R package for determining the relevant number of clusters in a data set. Journal of Statistical Software, 61(6), 1-36.

  5. Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B, 63(2), 411-423.