Clustering Validation

class: center, middle, inverse, title-slide

.title[
# Clustering Validation
]
.subtitle[
## Quality Control and Evaluation Methods
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-12-03
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## Why Clustering Validation Matters

- **Critical problem**: Clustering algorithms will return clusters even if the data contains NO meaningful cluster structure

- Clustering validation is necessary to:
  - Assess clustering tendency before analysis
  - Validate the quality of results after clustering
  - Avoid finding patterns in random data
  - Compare different clustering algorithms objectively

- Without validation, we risk:
  - Interpreting noise as meaningful patterns
  - Making incorrect biological or scientific conclusions
  - Selecting inappropriate clustering methods

---
## Why Clustering Validation Matters

- **Critical problem**: Clustering algorithms will return clusters even if the data contains NO meaningful cluster structure

- Without validation, we risk:
  - Interpreting noise as meaningful patterns
  - Making incorrect biological or scientific conclusions
  - Selecting inappropriate clustering methods

---
## Clustering Tendency Assessment

**Before clustering**, assess whether the data actually contains meaningful cluster structure:

**Hopkins Statistic**
  - Tests the spatial randomness of data
  - Values close to 0.5 indicate random data (no clusters)
  - Values close to 0 or 1 suggest clusterable data

**Visual Assessment Plot (VAT)**
  - Reorders the dissimilarity matrix to visualize cluster structure
  - Dark blocks along diagonal indicate potential clusters
  - Uniform shading suggests no cluster structure

.small[ https://en.wikipedia.org/wiki/Hopkins_statistic ]

---

## Clustering Validation Criteria

As the goal of clustering is to make objects within the same cluster similar and objects in different clusters distinct, internal validation measures are often based on the following two criteria:

- **Compactness** - measures how closely related the objects in a cluster are
  - Example metrics: variance, average pairwise distance within a cluster

- **Separation** - measures how distinct or well-separated a cluster is from other clusters
  - Example metrics: average pairwise distances between cluster centers

---
## Additional Validation Criteria

**Connectivity**
  - Measures to what extent items are placed in the same cluster as their nearest neighbors in the data space
  - Values range from 0 to infinity
  - Should be minimized for good clustering

**Combined Indices**
  - Most internal validation indices combine multiple criteria:

`$Index = \frac{\alpha \times Separation}{\beta \times Compactness}$`

- Balance between separation and compactness
  - Different indices weight these criteria differently

---
## General Clustering Validation Procedure

1. Initialize a list of clustering algorithms which will be applied to the data set

2. For each clustering algorithm, use different combinations of parameters to get different clustering results

3. Compute the corresponding **internal validation index** of each partition obtained in step 2

4. Choose the best partition and the optimal cluster number according to the criteria

---

## Assess Cluster Fit and Stability

- Most often ignored

- Cluster structure is treated as reliable and precise

- BUT! Clustering is generally VERY sensitive to noise and to outliers

- Measure cluster quality based on how "tight" the clusters are

- Do genes in a cluster appear more similar to each other than genes in other clusters?

---

## Clustering Evaluation Methods

- Sum of squares

- Homogeneity and Separation

- Cluster Silhouettes and Silhouette coefficient: how similar genes within a cluster are to genes in other clusters

- Rand index

- Gap statistics

- Cross-validation

---
## Determining Optimal Number of Clusters

A fundamental question in clustering: **How many clusters should we use?**

**Common approaches:**

1. **Elbow method** - plot within-cluster variation vs. number of clusters, look for "elbow"

2. **Gap statistic** - compare within-cluster variation to null reference distribution

3. **Silhouette analysis** - examine silhouette scores across different cluster numbers

4. **Multiple indices** - use several validation measures and look for consensus

---
## Determining Optimal Number of Clusters

Different validation measures may suggest different "optimal" numbers. The final decision should consider:

- Statistical evidence from multiple measures

- Domain knowledge and interpretability

- Practical constraints

---

## Sum of Squares

- A good clustering yields clusters where genes have small within-cluster sum-of-squares (and high between-cluster sum-of-squares)

---

## Homogeneity

- **Homogeneity** is calculated as the average distance between each gene expression profile and the center of the cluster it belongs to

`$$H_{k}=\frac{1}{N_g} \sum_{i \in k} d(X_i,C(X_i))$$`

where `$N_g$` is the total number of genes in the cluster

---

## Separation

- **Separation** is calculated as the weighted average distance between cluster centers

`$$S_{ave}=\frac{1}{\sum_{k \neq l}{N_kN_l}} \sum_{k \neq l}{N_kN_ld(C_k,C_l)}$$`

---

## Homogeneity and Separation

- Homogeneity reflects the compactness of the clusters while Separation reflects the overall distance between clusters

- Decreasing Homogeneity or increasing Separation suggest an improvement in the clustering results

---

## Variance Ratio Criterion (VRC, the Caliński-Harabasz index)

`$$VRC_k=(SS_B/(K-1))/(SS_W/(N-K))$$`

- `$SS_B$` – between-cluster variation
- `$SS_W$` – within-cluster variation

The goal is to select the `$K$` number of clusters that maximize `$VRC_k$`

<!-- `$$\kappa_k=(VRC_{k+1} - VRC_k) - (VRC_k - VRC_{k-1})$$`

- Select `$K$` to minimize the value of `$\kappa_k$`-->

.small[ Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1–27. https://doi.org/10.1080/03610927408827101 ]

---

## Silhouette

- Good clusters are those where the genes are close to each other compared to their next closest cluster

`$$s(i)=\frac{b(i)-a(i)}{max(a(i),b(i))}$$`

.small[ Rousseeuw, Peter J. "**Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.**" Journal of Computational and Applied Mathematics 1987 https://doi.org/10.1016/0377-0427(87)90125-7 ]

---

## Silhouette Plot

.pull-left[
- Displays how close each point in one cluster is to points in the neighboring clusters

- Silhouette width near +1 indicates points that are very distant from neighbor clusters

- Silhouette width near 0 indicates points that are not distinctly among clusters

- Negative width indicates points are probably assigned to the wrong cluster
]
.pull-right[
<img src="img/silhouette.png" width="500px" style="display: block; margin: auto;" />
]

---

## Rand Index

Cluster multiple times

- Clustering A: 1, 2, 2, 1, 1
- Clustering B: 2, 1, 2, 1, 1

Compare pairs

- `$a: \; = \; and \; =$`, the number of pairs assigned to the same cluster in A and in B
- `$b: \; \neq \; and \; \neq$`, ... different clusters in A and in B
- `$c: \; \neq \; and \; =$`, ... same in A, different in B
- `$d: \; = \; and \; \neq$`, ... same in B, different in A

---

## Rand Index Formula

`$$R=\frac{a+b}{a+b+c+d}$$`

- Adjust the Rand index to make it vary between -1 and 1 (negative if less than expected)

- `$AdjRand = (Rand – expect(Rand)) / (max(Rand) – expect(Rand))$`

<!---

## Rand Index Definition

`$$RI = (a + b) / \binom{N}{2}$$`

where `$a$` is the number of pairs that belong to the same true subtype and are clustered together, `$b$` is the number of pairs that belong to different true subtypes and are not clustered together, and `$N$` is the number of possible pairs that can be formed from the `$N$` samples.

Intuitively, `$RI$` is the fraction of pairs that are grouped in the same way (either together or not) in the two partitions compared (e.g., 0.9 means 90% of pairs are grouped in the same way).
-->

---

## Adjusted Rand Index

The Adjusted Rand Index (ARI) is the corrected-for-chance version of the Rand Index. The ARI takes values from -1 to 1, with the ARI expected to be 0 for a random subtyping.

Rand index and adjusted Rand index:

- https://davetang.org/muse/2017/09/21/the-rand-index/
- https://davetang.org/muse/2017/09/21/adjusted-rand-index/

---

## Gap Statistics

- Cluster the observed data, varying the total number of clusters `$k=1, 2, ... K$`

- For each cluster, calculate the sum of the pairwise distances for all points

`$$D_r=\sum_{i,i' \in C_r}{d_{ii'}}$$`

- Calculate within-cluster dispersion measures

`$$W_k=\sum_{r=1}^k{\frac{1}{2n_r}D_r}$$`

---

## Gap Statistics Visualization

---

## Cross-Validation Approaches

- Cluster while leaving out `$k$` experiments (or genes)

- Measure how well cluster groups are preserved in left-out experiment(s)

- Or, measure agreement between test and training set

---

## Clustering Validity

- Hypothesis: if the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the distance vector

---
## Cophenetic Distance

- The Cophenetic Distance between two objects in a cluster tree (dendrogram) is the intercluster distance at which the objects are first joined into the same cluster.

- It represents the height of the link that first joins the two objects.

- Metric: The Cophenetic Correlation Coefficient (CPCC) is the correlation between the original pairwise distances of the data points and their corresponding cophenetic distances derived from the dendrogram.

- Purpose: The `$\text{CPCC}$` is a measure of how faithfully the dendrogram preserves the original pairwise distances between the data points.

- Interpretation: 
  - Values close to +1 indicate that the hierarchical clustering structure is a very good representation of the original data's distance matrix.
  - A low value suggests the hierarchical method poorly reflects the actual similarities/dissimilarities in the data, indicating a poor fit.

---

## WADP - Robustness of Clustering

- WADP = Weighted Average Discrepancy Rate

- If the input data deviate slightly from their current value, will we get the same clustering?

- Important in gene expression data analysis because of constant noise

.small[ Bittner M. et al. "**Molecular classification of cutaneous malignant melanoma by gene expression profiling**" Nature 2000 https://doi.org/10.1038/35020115 ]

---

## WADP - Robustness of Clustering (cont.)

- Perturb each original gene expression profile by `$N(0, 0.01)$`

- Re-normalize the data, cluster

- Cluster-specific discrepancy rate: `$D/M$`. That is, for the `$M$` pairs of genes in an original cluster, count the number of gene pairs, `$D$`, that do not remain together in the clustering of the perturbed data, and take their ratio

- The overall discrepancy ratio is the weighted average of the cluster-specific discrepancy rates

---

## WADP - Calculation

- If there were originally `$m_j$` genes in the cluster `$j$`, then there are `$M_j=m_j(m_j-1)/2$` pairs of genes

- In the new clustering, identify how many of these pairs (`$D_j$`) still remain in the cluster

- Calculate `$D_j/M_j$`

`$$WADP=\frac{\sum_{j=1}^k{m_jD_j/M_j}}{\sum_{j=1}^k{m_j}}$$`

---

## Other Internal Clustering Validation Measures

.small[ Liu, Yanchi, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. "Understanding of Internal Clustering Validation Measures," 911–16. IEEE, 2010. https://doi.org/10.1109/ICDM.2010.35. ]

---
## Comparative Performance of Validation Measures

Different internal validation measures have strengths and limitations:

**Considerations for measure selection:**

- **Monotonicity**: Does the measure behave consistently as cluster quality changes?

- **Noise sensitivity**: How does the measure perform with noisy data?

- **Density variations**: Can it handle clusters with different densities?

- **Subclusters**: Does it correctly identify nested or close clusters?

- **Skewed distributions**: Performance with non-spherical cluster shapes

**Key finding**: No single measure performs optimally in all scenarios. Consider using multiple measures for robust validation.

---
## Clustering Pitfalls

- Any data – even noise – can be clustered

- It is quite possible for there to be several different classifications of the same set of objects

- It should be clear that any clustering produced should be related to the features in which the investigator is interested

---
## Best Practices for Clustering Validation

1. **Always validate**: Never trust clustering results without validation

2. **Use multiple measures**: Different indices may give different insights

3. **Assess clustering tendency first**: Verify that clusters actually exist before clustering

4. **Consider stability**: Robust clusters should be stable under small perturbations

5. **Evaluate biological/domain relevance**: Statistical validity doesn't guarantee scientific meaningfulness

6. **Document your process**: Report which validation measures were used and why

<!--
## Clustering validation

- Clustering validation, which evaluates the goodness of clustering results, is essential to the success of clustering applications
- Two main clustering validation approaches: **external** and **internal**. 
    - **External** approach uses external information, e.g. given class labels estimate "purity" of clusters using entropy
    - Since the “true” clustering is known in advance, external approach is mainly used for choosing an optimal clustering algorithm on a specific data set.
    - **Internal** approach only relies on information in the data
    - Internal validation measures can be used to choose the best clustering algorithm as well as the optimal cluster number without any additional information.

## Clustering validation

As the goal of clustering is to make objects within the same cluster similar and objects in different clusters distinct, internal validation measures are often based on the following two criteria:

- **Compactness** - measures how closely related the objects in a cluster are. Example metrics: variance, average pairwise distance within a cluster
- **Separation** - It measures how distinct or well-separated a cluster is from other clusters. Example metrics: average pairwise distances between cluster centers

## General clustering validation procedure

1. Initialize a list of clustering algorithms which will be applied to the data set.
2. For each clustering algorithm, use different combinations of parameters to get different clustering results.
3. Compute the corresponding **internal validation index** of each partition obtained in step 2.
4. Choose the best partition and the optimal cluster number according to the criteria.

## Assess cluster fit and stability {.larger}

- Most often ignored. 
- Cluster structure is treated as reliable and precise 
- BUT! Clustering is generally VERY sensitive to noise and to outliers 
- Measure cluster quality based on how “tight” the clusters are.
- Do genes in a cluster appear more similar to each other than genes in other clusters?

## Clustering evaluation methods {.larger}

- Sum of squares
- Homogeneity and Separation 
- Cluster Silhouettes and Silhouette coefficient: how similar genes within a cluster are to genes in other clusters
- Rand index
- Gap statistics
- Cross-validation

## Sum of squares {.larger}

- A good clustering yields clusters where genes have small within-cluster sum-of-squares (and high between-cluster sum-of-squares).

## Homogeneity {.larger}

- **Homogeneity** is calculated as the average distance between each gene expression profile and the center of
the cluster it belongs to

`$$H_{k}=\frac{1}{N_g} \sum_{i \in k} d(X_i,C(X_i))$$`

`$N_g$` - total number of genes in the cluster

## Separation {.larger}

- **Separation** is calculated as the weighted average distance between cluster centers

`$$S_{ave}=\frac{1}{\sum_{k \neq l}{N_kN_l}} \sum_{k \neq l}{N_kN_ld(C_k,C_l)}$$`

## Homogeneity and Separation {.larger}

- Homogeneity reflects the compactness of the clusters while Separation reflects the overall distance between clusters

- Decreasing Homogeneity or increasing Separation suggest an improvement in the clustering results

## Variance Ratio Criterion (VCR) {.larger}

`$$VRC_k=(SS_B/(K-1))/(SS_W/(N-K))$$`

- `$SS_B$` – between-cluster variation
- `$SS_W$` – within-cluster variation

The goal is to maximize `$VRC_k$` over the clusters

`$$\kappa_k=(VRC_{k+1} - VRC_k) - (VRC_k - VRC_{k-1})$$`

- Select `$K$` to minimize the value of `$kappa_k$`
- Calinski & Harabasz (1974) http://www.tandfonline.com/doi/abs/10.1080/03610927408827101

## Silhouette

- Good clusters are those where the genes are close to each other compared to their next closest cluster.

`$$s(i)=\frac{b(i)-a(i)}{max(a(i),b(i))}$$`

- `$b(i) = min(AVGD_{BETWEEN}(i,k))$`
- `$a(i) = AVGD_{WITHIN}(i)$`
- How well observation `$i$` matches the cluster assignment. Ranges `$-1 < s(i) < 1$`
- Overall silhouette: `$SC=\frac{1}{N_g}\sum_{i=1}^{N_g}{s(i)}$`
- Rousseeuw, Peter J. “**Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.**” Journal of Computational and Applied Mathematics 1987 http://www.sciencedirect.com/science/article/pii/0377042787901257

## Silhouette plot

- The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. 
- Silhouette width near +1 indicates points that are very distant from neighboring clusters
- Silhouette width near 0 indicate points that are not distinctly in one cluster or another
- Negative width indicates points are probably assigned to the wrong cluster.

## Rand index

Cluster multiple times

- Clustering A: 1, 2, 2, 1, 1
- Clustering B: 2, 1, 2, 1, 1

Compare pairs

## Rand index {.larger}

`$$R=\frac{a+b}{a+b+c+d}$$`

- Adjust the Rand index to make it vary between -1 and 1 (negative if less than expected)

- `$AdjRand = (Rand – expect(Rand)) / (max(Rand) – expect(Rand))$`

## Rand index {.larger}

`$$RI = (a + b) / \binom{N}{2}$$`
where `$a$` is the number of pairs that belong to the same true subtype and are clustered together, `$b$` is the number of pairs that belong to different true subtypes and are not clustered together, and  `$N$`  is the number of possible pairs that can be formed from the `$N$` samples.

Intuitively, `$RI$` is the fraction of pairs that are grouped in the same way (either together or not) in the two partitions compared (e.g. 0.9 means 90% of pairs are grouped in the same way).

## Rand index {.larger}

The Adjusted Rand Index (ARI) is the corrected-for-chance version of the Rand Index. The ARI takes values from -1 to 1, with the ARI expected to be 0 for a random subtyping.

- Rand index and adjusted Rand index, https://davetang.org/muse/2017/09/21/the-rand-index/, https://davetang.org/muse/2017/09/21/adjusted-rand-index/

## Gap statistics {.larger}

- Cluster the observed data, varying the total number of clusters `$k=1, 2, ... K$`
- For each cluster, calculate the sum of the pairwise distances for all points

`$$D_r=\sum_{i,i' \in C_r}{d_{ii'}}$$`

- Calculate within-cluster dispersion measures

`$$W_k=\sum_{r=1}^k{\frac{1}{2n_r}D_r}$$`

## Gap statistics

## Cross-validation approaches {.larger}

- Cluster while leave-out `$k$` experiments (or genes)

- Measure how well cluster groups are preserved in left out experiment(s)

- Or, measure agreement between test and training set

## Clustering validity {.larger}

- Hypothesis: if the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the distance vector

## WADP - robustness of clustering {.larger}

- If the input data deviate slightly from their current value, will we get the same clustering?  
- Important in Microarray expression data analysis because of constant noise

Bittner M. et.al. "**Molecular classification of cutaneous malignant melanoma by gene expression profiling**" Nature 2000  http://www.nature.com/nature/journal/v406/n6795/full/406536A0.html

## WADP - robustness of clustering {.larger}

- Perturb each original gene expression profile by `$N(0, 0.01)$`
- Re-normalize the data, cluster
- Cluster-specific discrepancy rate: `$D/M$`. That is, for the `$M$` pairs of genes in an original cluster, count the number of gene pairs, `$D$`, that do not remain together in the clustering of the perturbed data, and take their ratio.
- The overall discrepancy ratio is the weighted average of the cluster-specific discrepancy rates.

## WADP - robustness of clustering {.larger}

- If there were originally `$m_j$` genes in the cluster `$j$`, then there are `$M_j=m_j(m_j-1)/2$` pairs of genes
- In the new clustering, identify how many of these paris (`$D_j$`) still remain in the cluster
- Calculate `$D_j/M_j$`

`$$WADP=\frac{\sum_{j=1}^k{m_jD_j/M_j}}{\sum_{j=1}^k{m_j}}$$`

## Other internal clustering validation measures

internal_clustering_validation_metrics.png

Liu, Yanchi, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. “Understanding of Internal Clustering Validation Measures,” 911–16. IEEE, 2010. https://doi.org/10.1109/ICDM.2010.35.

# Summary

## Clustering pitfalls {.larger}

- Any data – even noise – can be clustered

- It is quite possible for there to be several different classifications of the same set of objects.

- It should be clear that any clustering produced should be related to the features in which the investigator in interested.

-->