Clustering validation is essential for evaluating the quality of
clustering results. This tutorial demonstrates various clustering
validation metrics in R using multiple packages and the famous
iris dataset.
Key concepts covered:
# Install packages if needed
# install.packages(c("cluster", "factoextra", "clValid", "NbClust", "ggplot2"))
library(cluster) # For silhouette, pam, and clustering functions
library(factoextra) # For visualization functions
library(clValid) # For comprehensive validation
library(NbClust) # For multiple indices at once
library(ggplot2) # For additional plotting
We’ll use the iris dataset, removing the species column for unsupervised clustering.
# Load and prepare data
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Remove species column and scale the data
iris_data <- iris[, -5]
iris_scaled <- scale(iris_data)
# Check the scaled data
head(iris_scaled)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] -0.8976739 1.01560199 -1.335752 -1.311052
## [2,] -1.1392005 -0.13153881 -1.335752 -1.311052
## [3,] -1.3807271 0.32731751 -1.392399 -1.311052
## [4,] -1.5014904 0.09788935 -1.279104 -1.311052
## [5,] -1.0184372 1.24503015 -1.335752 -1.311052
## [6,] -0.5353840 1.93331463 -1.165809 -1.048667
The elbow method plots the within-cluster sum of squares (WSS) against the number of clusters. The “elbow” point indicates a good number of clusters.
# Elbow method using factoextra
fviz_nbclust(iris_scaled, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2, color = "red") +
labs(title = "Elbow Method for Optimal k",
subtitle = "Look for the 'elbow' in the curve",
x = "Number of clusters (k)",
y = "Total within-cluster sum of squares") +
theme_minimal()
Interpretation: The plot shows the total WSS decreases as k increases. Look for the point where the decrease slows down dramatically (the “elbow”). For iris data, this appears around k=3.
The silhouette method measures how similar an object is to its own cluster compared to other clusters. Higher average silhouette width indicates better clustering.
# Silhouette method
fviz_nbclust(iris_scaled, kmeans, method = "silhouette") +
labs(title = "Silhouette Method for Optimal k",
subtitle = "Higher average silhouette width = better clustering",
x = "Number of clusters (k)",
y = "Average Silhouette Width") +
theme_minimal()
Interpretation: The peak of the curve indicates the optimal number of clusters. Silhouette values range from -1 to 1:
The gap statistic compares the within-cluster variation to a null reference distribution of the data.
# Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris_scaled, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
# Print results
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = iris_scaled, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 3
## logW E.logW gap SE.sim
## [1,] 4.534565 4.755428 0.2208634 0.02534324
## [2,] 4.021316 4.490212 0.4688953 0.02670070
## [3,] 3.806577 4.293793 0.4872159 0.02124741
## [4,] 3.699263 4.140237 0.4409736 0.02177507
## [5,] 3.589284 4.051459 0.4621749 0.01882154
## [6,] 3.522810 3.975009 0.4521993 0.01753073
## [7,] 3.448288 3.908834 0.4605460 0.01774025
## [8,] 3.379870 3.852475 0.4726054 0.01727207
## [9,] 3.310088 3.803931 0.4938436 0.01649671
## [10,] 3.278659 3.759003 0.4803440 0.01576050
# Visualize gap statistic
fviz_gap_stat(gap_stat) +
labs(title = "Gap Statistic Method for Optimal k",
subtitle = "Choose k where gap statistic is largest") +
theme_minimal()
Interpretation: The optimal number of clusters is where the gap statistic is maximized. The method compares the observed within-cluster variation to that expected under a null reference distribution.
NbClust provides 30 different indices for determining optimal cluster number in one function call.
# Run NbClust with multiple indices
# Note: This may take some time as it computes 30 different indices
set.seed(123)
nb_result <- NbClust(data = iris_scaled,
distance = "euclidean",
min.nc = 2,
max.nc = 10,
method = "kmeans",
index = "all")
# View the optimal number suggested by different indices
cat("Optimal number of clusters based on majority rule:\n")
## Optimal number of clusters based on majority rule:
print(nb_result$Best.nc[1,])
## KL CH Hartigan CCC Scott Marriot TrCovW
## 3 2 3 3 3 3 4
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## 3 6 3 3 2 2 2
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## 2 2 2 3 2 3 2
## Dunn Hubert SDindex Dindex SDbw
## 2 0 2 0 10
# View how many indices proposed each k
table(nb_result$Best.nc[1,])
##
## 0 2 3 4 6 10
## 2 11 10 1 1 1
Interpretation: NbClust applies 30 different indices and suggests the optimal k based on majority rule. This provides a comprehensive assessment across multiple validation criteria.
After determining optimal k, we perform detailed silhouette analysis.
# Perform k-means clustering with k=3
set.seed(123)
km_result <- kmeans(iris_scaled, centers = 3, nstart = 25)
# Compute silhouette information
sil <- silhouette(km_result$cluster, dist(iris_scaled))
# Summary of silhouette
summary(sil)
## Silhouette of 150 units in 3 clusters from silhouette.default(x = km_result$cluster, dist = dist(iris_scaled)) :
## Cluster sizes and average silhouette widths:
## 50 53 47
## 0.6363162 0.3933772 0.3473922
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.02489 0.35914 0.47113 0.45995 0.58883 0.73419
# Plot silhouette
fviz_silhouette(sil) +
labs(title = "Silhouette Plot for k=3 Clusters",
subtitle = "Width indicates clustering quality for each observation") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 50 0.64
## 2 2 53 0.39
## 3 3 47 0.35
# Find observations with negative silhouette values
neg_sil_index <- which(sil[, "sil_width"] < 0)
if(length(neg_sil_index) > 0) {
cat("Observations with negative silhouette (poorly clustered):\n")
print(sil[neg_sil_index, , drop = FALSE])
cat("\nThese", length(neg_sil_index),
"observations may be assigned to the wrong cluster.\n")
} else {
cat("No observations with negative silhouette values.\n")
}
## Observations with negative silhouette (poorly clustered):
## cluster neighbor sil_width
## [1,] 3 2 -0.01058434
## [2,] 3 2 -0.02489394
##
## These 2 observations may be assigned to the wrong cluster.
# Find observations with low silhouette (< 0.25)
low_sil_index <- which(sil[, "sil_width"] < 0.25 & sil[, "sil_width"] >= 0)
if(length(low_sil_index) > 0) {
cat("\nObservations with low silhouette (< 0.25):\n")
print(head(sil[low_sil_index, , drop = FALSE]))
cat("\nThese observations are borderline between clusters.\n")
}
##
## Observations with low silhouette (< 0.25):
## cluster neighbor sil_width
## [1,] 1 2 0.07766666
## [2,] 3 2 0.16627565
## [3,] 2 3 0.13270610
## [4,] 3 2 0.23206569
## [5,] 2 3 0.03768897
## [6,] 3 2 0.18906699
##
## These observations are borderline between clusters.
# Compare silhouettes for k = 2 to 6
par(mfrow = c(2, 3))
for(k in 2:6) {
set.seed(123)
km_temp <- kmeans(iris_scaled, centers = k, nstart = 25)
sil_temp <- silhouette(km_temp$cluster, dist(iris_scaled))
plot(sil_temp, main = paste("k =", k),
col = 2:(k+1), border = NA)
}
par(mfrow = c(1, 1))
The clValid package provides internal, stability, and biological validation measures.
Internal measures include Connectivity, Dunn Index, and Silhouette Width.
# Perform internal validation
# Compare multiple clustering methods and numbers of clusters
clmethods <- c("hierarchical", "kmeans", "pam")
intern <- clValid(iris_scaled,
nClust = 2:6,
clMethods = clmethods,
validation = "internal")
# Summary of results
summary(intern)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 0.9762 5.5964 7.5492 18.0508 24.7306
## Dunn 0.2674 0.1874 0.2060 0.0700 0.0762
## Silhouette 0.5818 0.4803 0.4067 0.3746 0.3248
## kmeans Connectivity 0.9762 23.8151 25.9183 40.3198 40.1524
## Dunn 0.2674 0.0265 0.0700 0.0808 0.0808
## Silhouette 0.5818 0.4599 0.4189 0.3455 0.3441
## pam Connectivity 0.9762 23.0726 31.8067 35.7964 44.5413
## Dunn 0.2674 0.0571 0.0566 0.0642 0.0361
## Silhouette 0.5818 0.4566 0.4091 0.3574 0.3400
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 0.9762 hierarchical 2
## Dunn 0.2674 hierarchical 2
## Silhouette 0.5818 hierarchical 2
# Visualize internal measures
par(mfrow = c(3, 1))
plot(intern, legend = FALSE)
par(mfrow = c(1, 1))
# Get optimal scores
optimalScores(intern)
## Score Method Clusters
## Connectivity 0.9761905 hierarchical 2
## Dunn 0.2674261 hierarchical 2
## Silhouette 0.5817500 hierarchical 2
Interpretation of Internal Measures:
Stability measures evaluate how consistent clustering is when data is perturbed (removing one sample at a time).
# Perform stability validation
# This takes longer as it requires resampling
stab <- clValid(iris_scaled,
nClust = 2:6,
clMethods = clmethods,
validation = "stability")
# Summary
summary(stab)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical APN 0.0033 0.0370 0.0859 0.1088 0.1342
## AD 1.4924 1.4601 1.4015 1.3269 1.2251
## ADM 0.0161 0.1451 0.1448 0.3919 0.3176
## FOM 0.5957 0.5809 0.5571 0.5334 0.5101
## kmeans APN 0.0128 0.1034 0.1290 0.2130 0.2950
## AD 1.5060 1.2685 1.1568 1.1269 1.0949
## ADM 0.0555 0.2152 0.2147 0.4499 0.5147
## FOM 0.6049 0.5206 0.4888 0.4805 0.4910
## pam APN 0.0128 0.1162 0.1420 0.1655 0.1886
## AD 1.5060 1.2721 1.1665 1.0726 1.0043
## ADM 0.0555 0.2266 0.2500 0.2834 0.2814
## FOM 0.6049 0.5032 0.4828 0.4789 0.4558
##
## Optimal Scores:
##
## Score Method Clusters
## APN 0.0033 hierarchical 2
## AD 1.0043 pam 6
## ADM 0.0161 hierarchical 2
## FOM 0.4558 pam 6
# Optimal scores
optimalScores(stab)
## Score Method Clusters
## APN 0.003266667 hierarchical 2
## AD 1.004288856 pam 6
## ADM 0.016087089 hierarchical 2
## FOM 0.455750052 pam 6
Interpretation of Stability Measures:
The Dunn Index is the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
# Calculate Dunn Index for different k values
dunn_values <- sapply(2:10, function(k) {
set.seed(123)
km <- kmeans(iris_scaled, centers = k, nstart = 25)
dunn(distance = dist(iris_scaled), clusters = km$cluster)
})
# Plot Dunn Index
dunn_df <- data.frame(k = 2:10, dunn = dunn_values)
ggplot(dunn_df, aes(x = k, y = dunn)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "steelblue", size = 3) +
geom_vline(xintercept = which.max(dunn_values) + 1,
linetype = "dashed", color = "red") +
labs(title = "Dunn Index for Different Number of Clusters",
subtitle = "Higher values indicate better clustering",
x = "Number of clusters (k)",
y = "Dunn Index") +
theme_minimal()
cat("Optimal k based on Dunn Index:", which.max(dunn_values) + 1, "\n")
## Optimal k based on Dunn Index: 2
cat("Maximum Dunn Index value:", max(dunn_values), "\n")
## Maximum Dunn Index value: 0.2674261
Partitioning Around Medoids (PAM) is more robust to outliers than k-means.
# Perform PAM clustering
set.seed(123)
pam_result <- pam(iris_scaled, k = 3)
# Summary
summary(pam_result)
## Medoids:
## ID Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 8 -1.0184372 0.7861738 -1.2791040 -1.3110521
## [2,] 113 1.1553023 -0.1315388 0.9868021 1.1816087
## [3,] 56 -0.1730941 -0.5903951 0.4203256 0.1320673
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3
## [75] 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
## [112] 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 3 2
## [149] 2 3
## Objective function:
## build swap
## 0.9205107 0.8757051
##
## Numerical information per cluster:
## size max_diss av_diss diameter separation
## [1,] 50 2.600722 0.8137338 5.034198 1.5533592
## [2,] 45 2.326061 0.9215472 3.537354 0.2875493
## [3,] 55 2.135226 0.8945355 3.565921 0.2875493
##
## Isolated clusters:
## L-clusters: character(0)
## L*-clusters: character(0)
##
## Silhouette plot information:
## cluster neighbor sil_width
## 1 1 3 0.732331909
## 41 1 3 0.731502867
## 8 1 3 0.729250202
## 18 1 3 0.726758490
## 5 1 3 0.726426711
## 40 1 3 0.723018364
## 38 1 3 0.722510100
## 12 1 3 0.720312003
## 28 1 3 0.719430309
## 29 1 3 0.712728182
## 50 1 3 0.709480595
## 27 1 3 0.702764618
## 25 1 3 0.699621242
## 7 1 3 0.696959631
## 23 1 3 0.694761433
## 22 1 3 0.687304888
## 49 1 3 0.686257475
## 3 1 3 0.676816831
## 47 1 3 0.675831828
## 36 1 3 0.675208799
## 20 1 3 0.674402267
## 11 1 3 0.671093249
## 30 1 3 0.666943215
## 48 1 3 0.665359771
## 37 1 3 0.662779322
## 44 1 3 0.660886620
## 21 1 3 0.657931899
## 32 1 3 0.645080275
## 45 1 3 0.644919666
## 43 1 3 0.636667019
## 24 1 3 0.635162616
## 10 1 3 0.631083664
## 35 1 3 0.627938856
## 31 1 3 0.625187352
## 4 1 3 0.620223527
## 17 1 3 0.617108039
## 6 1 3 0.605451110
## 33 1 3 0.585673488
## 19 1 3 0.581181517
## 13 1 3 0.578692634
## 2 1 3 0.568432481
## 46 1 3 0.559577438
## 39 1 3 0.553081401
## 14 1 3 0.549935134
## 15 1 3 0.548227538
## 26 1 3 0.545021794
## 34 1 3 0.536066950
## 9 1 3 0.489473466
## 16 1 3 0.452020027
## 42 1 3 0.087105075
## 121 2 3 0.549474840
## 144 2 3 0.542034392
## 140 2 3 0.536632666
## 103 2 3 0.528619213
## 125 2 3 0.514255167
## 126 2 3 0.512021709
## 142 2 3 0.511091107
## 141 2 3 0.509607781
## 113 2 3 0.502430899
## 145 2 3 0.500307464
## 106 2 3 0.480989126
## 136 2 3 0.478096351
## 146 2 3 0.463122794
## 110 2 3 0.461807801
## 108 2 3 0.451364289
## 105 2 3 0.437332298
## 130 2 3 0.436193424
## 116 2 3 0.430590190
## 131 2 3 0.424017001
## 111 2 3 0.420687967
## 101 2 3 0.414905109
## 123 2 3 0.412445224
## 118 2 3 0.406484273
## 137 2 3 0.404906392
## 132 2 3 0.393424798
## 148 2 3 0.366261248
## 149 2 3 0.362084673
## 119 2 3 0.345893722
## 53 2 3 0.339282258
## 117 2 3 0.332716128
## 138 2 3 0.330590766
## 78 2 3 0.323775785
## 51 2 3 0.319841375
## 133 2 3 0.250925584
## 87 2 3 0.248118057
## 129 2 3 0.227977193
## 57 2 3 0.170109727
## 66 2 3 0.153483839
## 104 2 3 0.131831019
## 52 2 3 0.107493689
## 109 2 3 0.037797063
## 77 2 3 0.031894097
## 76 2 3 0.013559254
## 112 2 3 -0.002669112
## 115 2 3 -0.080312684
## 95 3 2 0.582655152
## 91 3 2 0.578701704
## 90 3 2 0.577222969
## 93 3 2 0.574603266
## 70 3 2 0.574473478
## 83 3 2 0.568099242
## 100 3 2 0.558550244
## 81 3 2 0.556796287
## 80 3 2 0.552319144
## 68 3 2 0.549501434
## 82 3 2 0.547972120
## 56 3 2 0.545207379
## 60 3 2 0.543076360
## 54 3 2 0.531061917
## 97 3 2 0.514917805
## 65 3 2 0.507409811
## 72 3 2 0.467849831
## 89 3 2 0.462954353
## 107 3 2 0.462901473
## 63 3 2 0.456962461
## 96 3 2 0.454046401
## 74 3 2 0.436426281
## 85 3 2 0.427653666
## 94 3 1 0.425567639
## 99 3 1 0.421239774
## 67 3 2 0.420353214
## 61 3 1 0.411444151
## 88 3 2 0.403764529
## 79 3 2 0.399151916
## 120 3 2 0.398594873
## 58 3 1 0.397039293
## 84 3 2 0.390664770
## 69 3 2 0.385994051
## 62 3 2 0.375643074
## 114 3 2 0.364265434
## 98 3 2 0.362032693
## 64 3 2 0.357409637
## 73 3 2 0.354258583
## 135 3 2 0.339487300
## 102 3 2 0.339189836
## 143 3 2 0.339189836
## 122 3 2 0.320877439
## 92 3 2 0.282532101
## 75 3 2 0.243451149
## 147 3 2 0.211740024
## 134 3 2 0.209794644
## 127 3 2 0.187703672
## 124 3 2 0.175575066
## 55 3 2 0.146397276
## 150 3 2 0.145989432
## 139 3 2 0.140457286
## 59 3 2 0.062723857
## 128 3 2 0.047655569
## 71 3 2 0.007296921
## 86 3 2 -0.067854819
## Average silhouette width per cluster:
## [1] 0.6346397 0.3496332 0.3823817
## Average silhouette width of total data set:
## [1] 0.4566432
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
# Visualize PAM clustering
fviz_cluster(pam_result,
data = iris_scaled,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "PAM Clustering (k=3)")
# Silhouette plot for PAM
fviz_silhouette(pam_result) +
labs(title = "Silhouette Plot for PAM Clustering (k=3)") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 50 0.63
## 2 2 45 0.35
## 3 3 55 0.38
# Perform hierarchical clustering
set.seed(123)
hc_result <- hcut(iris_scaled, k = 3, hc_method = "ward.D2")
# Visualize dendrogram
fviz_dend(hc_result,
rect = TRUE,
cex = 0.5,
main = "Hierarchical Clustering Dendrogram (k=3)",
k_colors = c("#E7B800", "#00AFBB", "#FC4E07"))
# Silhouette for hierarchical clustering
fviz_silhouette(hc_result) +
labs(title = "Silhouette Plot for Hierarchical Clustering (k=3)") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 49 0.63
## 2 2 30 0.44
## 3 3 71 0.32
# Compare three clustering methods side by side
set.seed(123)
# K-means
km3 <- kmeans(iris_scaled, centers = 3, nstart = 25)
p1 <- fviz_cluster(list(data = iris_scaled, cluster = km3$cluster),
geom = "point",
ellipse.type = "convex",
palette = "jco",
main = "K-means Clustering") +
theme_minimal()
# PAM
pam3 <- pam(iris_scaled, k = 3)
p2 <- fviz_cluster(pam3,
geom = "point",
ellipse.type = "convex",
palette = "jco",
main = "PAM Clustering") +
theme_minimal()
# Hierarchical
hc3 <- hcut(iris_scaled, k = 3, hc_method = "ward.D2")
p3 <- fviz_cluster(list(data = iris_scaled, cluster = hc3$cluster),
geom = "point",
ellipse.type = "convex",
palette = "jco",
main = "Hierarchical Clustering") +
theme_minimal()
# Arrange plots
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
Since iris has known species labels, we can compare clustering results.
# Compare k-means results to true species
table(Predicted = km3$cluster, Actual = iris$Species)
## Actual
## Predicted setosa versicolor virginica
## 1 50 0 0
## 2 0 39 14
## 3 0 11 36
# Calculate adjusted Rand index
library(mclust)
adjustedRandIndex(km3$cluster, iris$Species)
## [1] 0.6201352
# For PAM
cat("\nAdjusted Rand Index for PAM:",
adjustedRandIndex(pam3$clustering, iris$Species), "\n")
##
## Adjusted Rand Index for PAM: 0.6416027
# For Hierarchical
cat("Adjusted Rand Index for Hierarchical:",
adjustedRandIndex(hc3$cluster, iris$Species), "\n")
## Adjusted Rand Index for Hierarchical: 0.615323
Interpretation: Adjusted Rand Index (ARI) measures agreement between two clusterings, corrected for chance:
# Create summary of all methods
summary_df <- data.frame(
Method = c("Elbow", "Silhouette", "Gap Statistic",
"NbClust Majority", "Dunn Index", "clValid Internal"),
Optimal_k = c(3, 2, 3, 3, which.max(dunn_values) + 1, 2),
Notes = c("Visual inspection of elbow",
"Maximum average silhouette",
"First maximum of gap statistic",
"Majority vote across 30 indices",
"Maximum Dunn index",
"Best connectivity & silhouette")
)
knitr::kable(summary_df,
caption = "Summary of Optimal k by Different Methods")
| Method | Optimal_k | Notes |
|---|---|---|
| Elbow | 3 | Visual inspection of elbow |
| Silhouette | 2 | Maximum average silhouette |
| Gap Statistic | 3 | First maximum of gap statistic |
| NbClust Majority | 3 | Majority vote across 30 indices |
| Dunn Index | 2 | Maximum Dunn index |
| clValid Internal | 2 | Best connectivity & silhouette |
Use Multiple Validation Measures: Different measures may suggest different optimal k. Consider the consensus.
Visual Inspection: Always visualize your clusters and silhouette plots to understand the clustering structure.
Domain Knowledge: Statistical measures should be combined with domain expertise. The “optimal” k should make sense in your application context.
Data Scaling: Always scale/standardize your data before clustering, especially when variables have different units.
Stability Check: Use stability measures to ensure your clustering is robust to small perturbations.
Interpretation Guidelines:
Computational Considerations: For large datasets, use sampling or efficient algorithms (CLARA for PAM, sampling for validation).
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mclust_6.1.2 NbClust_3.0.1 clValid_0.7 factoextra_1.0.7
## [5] ggplot2_4.0.1 cluster_2.1.8.1
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.5 sass_0.4.10 generics_0.1.4 tidyr_1.3.1
## [5] rstatix_0.7.2 class_7.3-23 digest_0.6.39 magrittr_2.0.4
## [9] evaluate_1.0.5 grid_4.5.1 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] jsonlite_2.0.0 ggrepel_0.9.6 backports_1.5.0 Formula_1.2-5
## [17] gridExtra_2.3 purrr_1.2.0 viridisLite_0.4.2 scales_1.4.0
## [21] jquerylib_0.1.4 abind_1.4-8 cli_3.6.5 rlang_1.1.6
## [25] withr_3.0.2 cachem_1.1.0 yaml_2.3.10 tools_4.5.1
## [29] ggsignif_0.6.4 dplyr_1.1.4 ggpubr_0.6.1 broom_1.0.9
## [33] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4 car_3.1-3
## [37] dendextend_1.19.1 pkgconfig_2.0.3 pillar_1.11.1 bslib_0.9.0
## [41] gtable_0.3.6 glue_1.8.0 Rcpp_1.1.0 xfun_0.54
## [45] tibble_3.3.0 tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.50
## [49] dichromat_2.0-0.1 farver_2.1.2 htmltools_0.5.8.1 ggsci_3.2.0
## [53] labeling_0.4.3 rmarkdown_2.30 carData_3.0-5 compiler_4.5.1
## [57] S7_0.2.1
Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.
Dunn, J.C. (1974). Well separated clusters and fuzzy partitions. Journal on Cybernetics, 4, 95-104.
Brock, G., Pihur, V., Datta, S., & Datta, S. (2008). clValid: An R package for cluster validation. Journal of Statistical Software, 25(4), 1-22.
Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: An R package for determining the relevant number of clusters in a data set. Journal of Statistical Software, 61(6), 1-36.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B, 63(2), 411-423.