Hierarchical Clustering

class: center, middle, inverse, title-slide

.title[
# Hierarchical Clustering
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-11-12
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

---
## Machine learning

- **Supervised Learning**: given examples of inputs and corresponding desired outputs, predict outputs on future inputs.
    - Ex: classification, regression, time series prediction

- **Unsupervised Learning**: given only inputs, automatically discover representations, features, structure, etc.
    - Ex: clustering, outlier detection, compression

- **Reinforcement Learning**: given sequences of inputs, actions from a fixed set, and scalar rewards/punishments, learn to select action sequences in a way that maximizes expected reward.

---
## What is clustering

- Partitioning of genes or experiments into groups with similar behavior, assuming they are functionally related
- A cluster is a group of relatively homogeneous cases or observations

---
## What is clustering

Given `$n$` objects, assign them to `$k$` groups (clusters) based on their similarity

- Do not explicitly model the underlying biology
 
 - Clusters are mutually exclusive, while in reality a gene may be a part of multiple biological processes

---
## Clustering algorithm

---
## Clustering example

- The cluster dendrogram is very important to describe the step-by-step merging process.
- We can also evaluate the closeness of the groups each other.

---
## Clustering impossible

- **Scale-invariance** - meters vs. inches

- **Richness** - all partitions as possible solutions

- **Consistency** - increasing distances between clusters and decreasing distances within clusters should yield the same solution

**No function exists that satisfies all three.**

.small[ Kleinberg, J. "An impossibility theorem for clustering. Advances in Neural Information Process-ing Systems." (2003). https://www.cs.cornell.edu/home/kleinber/nips15.pdf ]

---
## Clustering utopia

---
## Clustering reality

---
## Conducting Cluster Analysis

---
## Types of clustering algorithms

Two categories:

- Heuristic algorithms
    - No probabilistic model
    - Optimizing a certain target function or iteratively agglomerating (or dividing) nodes to form bottom-up (top-down) trees
    - Examples: K-means, hierarchical clustering

- Model-based analyses
    - Probabilistic assumptions about the data
    - Expectation Maximization algorithm

---
## Types of clustering algorithms

- Partitioning methods

- Hierarchical methods

- Model based methods

- Density-based methods

- Grid-based methods

---
class: center, middle

# Clustering gene expression

---
## Gene expression matrix

---
## Formulating the Problem

- Most important is **selecting the variables** on which the clustering is based.

- Inclusion of even one or two irrelevant variables may distort a clustering solution.

- Variables selected should describe the similarity between objects in terms that are relevant to the marketing research problem.

- Should be selected based on past research, theory, or a consideration of the hypotheses being tested.

---
## Filtering

- Non-informative genes contribute random terms in the calculation of distances

- The resulting effect is that they hide the useful information provided by other genes

- Therefore, assign non-informative genes zero weight, i.e., exclude them from the cluster analysis

---
## Filtering examples

- **% Present >= X** - remove all genes that have missing values in greater than (100-X) percent of the columns

- **SD (Gene Vector) >= X** - remove all genes that have standard deviations of observed values less than X

- **At least X Observations with abs(Val) >= Y** - remove all genes that do not have at least X observations with absolute values greater than Y

- **MaxVal-MinVal >= X** - remove all genes whose maximum minus minimum values are less than X

---
## Clustering noise

---
## Cluster the right data

Clustering works as expected when the data to be clustered is processed correctly

- **Log Transform Data** - replace all data values x by log2 (x). Why?

- **Center genes [mean or median]** - subtract the row-wise mean or median from the values in each row of data, so that the mean or median value of each row is 0.

- **Center samples [mean or median]** - subtract the column-wise mean or median from the values in each column of data, so that the mean or median value of each column is 0.

---
## Cluster the right data

Clustering works as expected when the data to be clustered is processed correctly

- **Normalize genes** - multiply all values in each row of data by a scale factor S so that the sum of the squares of the values in each row is 1.0 (a separate S is computed for each row).

- **Normalize samples** - multiply all values in each column of data by a scale factor S so that the sum of the squares of the values in each column is 1.0 (a separate S is computed for each column).

---
## Standartization

In many cases, we are not interested in the absolute amplitudes of gene expression, but in the relative changes. Then, we standardize:

`$$g_s=(g-\hat{g})/\sigma(g)$$`

Standardized gene expression vectors have a mean value of zero and a standard deviation of one.

---
## Cluster the right data

- Preprocessing (centering, normalization, scaling, `$log_2$`-transformation) is critical for the interpretable clustering

- Preprocessing operations are not associative, so the order in which these operations is applied is very important

- Log transforming centered genes are not the same as centering log transformed genes.

---
class: center,middle
# How to define (dis)similarity among objects

---
## Distance

- Clustering organizes things that are close into groups

- What does it mean for two genes to be close?

- What does it mean for two samples to be close?

- Once we know this, how do we define groups?

---
## Distance

- We need a mathematical definition of distance between two points

- What are points?

- If each gene is a point, what is the mathematical definition of a point?

---
## Points

`$Gene_1= (E_{11}, E_{12}, ..., E_{1N})$`

`$Gene_2= (E_{21}, E_{22}, ..., E_{2N})$`

&nbsp;

`$Sample_1= (E_{11}, E_{21}, ..., E_{G1})$`

`$Sample_2= (E_{12}, E_{22}, ..., E_{G2})$`

&nbsp;

`$E_{gi}=expression \; gene \; g, \; sample \; i$`

---
## Distance definition

For all objects `$i$`, `$j$`, and `$h$`

---
## Most famous distance

**Euclidean distance**

Example distance between gene 1 and 2:

– Sqrt of Sum of `$(E_{1i} - E_{2i})^2$`, `$i=1,...,N$`

- When N is 2, this is distance as we know it:

- When N is 20,000 you have to think abstractly

---
## Distance measures

>- Disadvantages: not scale invariant, not for negative correlations

---
## Distance measures

- When deciding on an appropriate value of `$q$`, the investigator must decide whether emphasis should be placed on large differences.

- Larger values of `$q$` give relatively more emphasis to larger differences.

---
## Distance measures

- Canberra distance

- A weighted version of Manhattan distance
- Maximum distance between two vectors of `$p=(p_1,p_2,...,p_n)$` and `$q=(q_1,q_2,...,q_n)$`

---
## Similarity definition

- For all objects `$i$`, `$j$`

---
## Similarity measures

- Gene expression profiles represent comparative expression measures

- Euclidean distance may not be meaningful

- Need distance measure that score based on similarity

- The more objects `$i$` and `$j$` are alike (or close) the larger `$s(i,j)$` becomes

---
## Cosine similarity

- Measures similarity between two _non-zero_ vectors.

- Works best when the outcome is within `$[0,1]$` interval

- Frequently used in text mining

- Does not conform with the definition of distance - does not have the triangle inequality property.

---
## Cosine similarity

From Euclidean dot product

`$$a \cdot b = \left\lVert a \right\rVert_2 \left\lVert b \right\rVert_2 cos\theta$$`
`$$similarity = cos\theta = \frac{a \cdot b}{\left\lVert a \right\rVert_2 \left\lVert b \right\rVert_2} = \frac{\sum_{i=1}^n{a_ib_i}}{\sqrt{\sum_{i=1}^n{a_i^2}} \sqrt{\sum_{i=1}^n{b_i^2}}}$$`

- `$cos\theta$` ranges from -1 to 1. 0 indicates orthogonality (decorrelation)

- `$cos(0^o) = 1$` - perfect similarity. Measures _orientation_

- Other options - angular distance

---
## Similarity measures

Pearson correlation coefficient [-1, 1]

Vectors are normalized to the vector’s means

How to convert to dissimilarity [0, 1]

`$$d(i,j)=(1-s(i,j))/2$$`

---
## Distances between gene expression profiles

<!--
## Convert correlation to dissmilarity

`$$d(X_i,X_j)=\frac{1-Cor(X_i, X_j)}{2}$$`
-->

---
## Association versus correlation.

- Correlation is a type of association and measures increasing or decreasing trends quantified using correlation coefficients.

.small[ Altman, Naomi, and Martin Krzywinski. “Points of Significance: Association, Correlation and Causation.” Nature Methods 12, no. 10 (September 29, 2015): 899–900. https://doi.org/10.1038/nmeth.3587. ]

---
## The (dis-)similarity matrixes

---
## The (dis-)similarity matrixes

---
## Distance measures between binary 0/1 vectors

- Jaccard distance

- Overlap coefficient

`$$C = \frac{ \lvert A \cap B \rvert}{min(\lvert A \rvert, \lvert B \rvert)}$$`

Both lie in the range `$[0, 1]$`. `$J, C = 0$` indicating no overlaps between binary vectors. A Jaccard-index `$J = 1$` indicates two identical vectors, whereas the overlap coefficient `$C = 1$` when one vector is a complete subset of the other.

---
## Clustering binary data

- Two columns with binary data, encoded `$0$` and `$1$`

- `$a$` - number of rows where both columns are 1

- `$b$` - number of rows where this and not the other column is 1

- `$c$` - number of rows where the other and not this column is 1

- `$d$` - number of rows where both columns are 0

**Jaccard distance**

`$$\frac{a}{a+b+c}$$`

---
## Clustering binary data

- Two columns with binary data, encoded `$0$` and `$1$`

- `$a$` - number of rows where both columns are 1

- `$b$` - number of rows where this and not the other column is 1

- `$c$` - number of rows where the other and not this column is 1

- `$d$` - number of rows where both columns are 0

**Tanimoto distance**

`$$\frac{a+d}{a+d+2(b+c)}$$`

---
## Clustering categorical data

---
## Clustering mixed data

**Gower distance**

- Idea: Use distance measure between 0 and 1 for each pair of variables: `$d_{ij}^{(f)}$`
- Aggregate: `$d(i,j)=\frac{1}{p}\sum_{i=1}^p{d_{ij}^{(f)}}$`

.small[J. C. Gower "**A General Coefficient of Similarity and Some of Its Properties**" Biometrics 1971 https://doi.org/10.2307/2528823 ]

---
## Gower distance

How to calculate distance measure for each pair of variables

- **Quantitative**: interval-scaled distance `$d_{ij}^{(f)}=\frac{|x_{if} - x_{jf}|}{R_f}$`, where `$x_{if}$` is the value for object `$i$` in variable `$f$`, and `$R_f$` is the range of variable `$f$` for all objects

- **Categorical**: use "1" when `$x_{if}$` and `$x_{jf}$` agree, and "0" otherwise

- **Ordinal**: Use normalized ranks, then like interval-scaled based on range

---
## Choose (dis-)similarity metric

- Think hard about this step!

- Remember: garbage in - garbage out

- The metric that you pick should be a valid measure of the distance/similarity of genes.

**Examples**

- Applying correlation to highly skewed data will provide misleading results.

- Applying Euclidean distance to data measured on categorical scale will be invalid.

---
## Distances in R

| Function                       | Package          | Distances                                   |
|--------------------------------|------------------|---------------------------------------------|
| dist                           | stats            | Euclidean, Manhattan, Canberra, max, binary |
| daisy                          | cluster, bioDist | Euclidean, Manhattan                        |
| distancematrix, distancevector | hopach           | Euclidean, cor, cosine-angle (abs versions) |
| vegdist                        | vegan            | Jaccard, Gower, many others                 |

&nbsp;

Other packages: `cclust`, `e1071`, `flexmix`, `fpc`, `mclust`, `Mfuzz`, `class`

---
class: middle,center

# Assembling objects into clusters

---
## Assembling objects into clusters

- The number of ways to partition a set of `$n$` objects into `$k$` non-empty classes

- `$S(n,1)=1$` - one way to partition `$n$` object in to 1 group, or `$n$` disjoint groups

- `$S(n,2)=2^{n-1}-1$` - number of ways to partition `$n$` objects into two non-empty groups

---
## Classification of Clustering Procedures

---
## Hierarchical Clustering

- Allows organization of the clustering data to be represented in a tree (dendrogram)

- **Agglomerative** (Bottom Up): each observation starts as own cluster. Clusters are merged based on similarities

- **Divisive** (Top Down):  all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

---
## Agglomerative clustering (bottom-up)

- Idea: ensure nearby points end up in the same cluster

- Starts with as each gene in its own cluster

- Joins the two most similar clusters

- Then, joins next two most similar clusters

- Continues until all genes are in one cluster

---
## Agglomerative clustering (bottom-up)

---
## Divisive clustering (top-down)

- Starts with all genes in one cluster

- Choose split so that genes in the two clusters are most similar (maximize “distance” between clusters)

- Find next split in same manner

- Continue until all genes are in single gene clusters

---
## Dendrograms

- We can then make dendrograms showing divisions 
- The y-axis represents the distance between the groups divided at that point

---
## Dendrograms

- Note: Left and right is assigned arbitrarily. Vertical distance is what's matter
- Look at the height of division to find out distance. For example, S5 and S16 are very far.
 
<center><img src="img/dendrogram.png" height="450px" /></center>

---
## Which to use?

- Both agglomerative and divisive are only ‘step-wise’ optimal: at each step the optimal split or merge is performed

- Outliers will irreversibly change clustering structure

---
## Which to use?

**Agglomerative/Bottom-Up**

– Computationally simpler, and more available.

– More "precision" at bottom of tree

– When looking for small clusters and/or many clusters, use 
agglomerative

---
## Which to use?

**Divisive/Top-Down**

– More "precision" at top of tree.

– When looking for large and/or few clusters, use divisive

**Results ARE sensitive to choice!**

---
## Which to use?
 
<center><img src="img/aggdiv.png" height="550px" /></center>

# Linking objects based on the distance between them

---
## Linkage between clusters

- **Single Linkage** - join clusters whose distance between closest genes is smallest (elliptical)

- **Complete Linkage** - join clusters whose distance between furthest genes is smallest (spherical)

- **Average Linkage** - join clusters whose average distance is the smallest.

---
## Linkage between clusters

---
## Single linkage

---
## Complete linkage

---
## Average linkage

---
## UPGMC (Unweighted Pair Group Method using Centroids)

`$$centroid = \hat{Y} = \frac{1}{N_Y}\sum_{i \in Y}{X_{i,j}}$$`

Define distance between clusters as distance between centroids `$\delta_{Y,Z} = d_{\hat(Y), \hat{Z}}$`

---
## UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

`$$\delta_{Y,Z} = \frac{1}{N_Y * N_Z}\sum_{i \in Y}\sum_{j \in Z}{d_{i,j}}$$`

Average of pairwise distances between objects

---
## Ward’s method

- **Ward's procedure** is commonly used. For each cluster, the sum of squares is calculated.  The two clusters with the smallest increase in the overall sum of squares within cluster distances are combined.

- `$\Delta$` - Merging cost of combining the clusters `$A$` and `$B$`. `$m_j$` is the center of cluster `$j$`, and `$n_j$` is the number of points in it.
- The sum of squares starts at 0 (each point is in its own cluster), and grows as clusters are merged. Ward’s method keep this growth to minimum.

---
## Ward’s method

- The distance `$d$` between two clusters `$C_i$` and `$C_j$` is defined as the loss of information (or: the increase in error) in merging two clusters.

- The error of a cluster `$C$` is measured as the sum of distances between the objects in the cluster and the cluster centroid `$cenC$`.

---
## Ward’s method

- When merging two clusters, the error of the merged cluster is larger than the sum or errors of the two individual clusters, and therefore represents a loss of information.

- The merging is performed on those clusters which are most homogeneous, to unify clusters such that the variation inside the merged clusters increases as little as possible.

---
## Ward’s method

- Ward’s method tends to create compact clusters of small size. It is a least squares method, so implicitly assumes a Gaussian model.

.small[ Ward, J. H., Jr. (1963), "Hierarchical Grouping to Optimize an Objective Function", Journal of the American Statistical Association https://doi.org/10.1080/01621459.1963.10500845 ]

---
## Ward’s Method

In R’s `hclust` function, Ward’s method can be specified in two ways:

* **`method = "ward.D"`**
  - Implements the algorithm previously labeled simply `"ward"` in R versions ≤ 3.0.3.
  - Does **not** correspond exactly to Ward’s (1963) original criterion.

* **`method = "ward.D2"`**
  - Implements Ward’s (1963) clustering criterion as described by Murtagh and Legendre (2014).
  - With this option, the dissimilarities are squared internally before cluster updating.
  - For an input data matrix `x`, the typical call is: `hclust(dist(x), method = "ward.D2")` (No need to manually square the distances)

.small[ Murtagh, F., & Legendre, P. (2014). *Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion?* Journal of Classification, 31, 274–295. https://doi.org/10.1007/s00357-014-9161-z ]

---
## How many clusters?

- A method proposed by Han et al. assumes that each cluster for a dataset has about `$\sqrt{2n}$` points for a dataset of `$n$` points, and the number of clusters can be estimated using `$K = \sqrt{\frac{n}{2}}$`

- The elbow method chooses the number of clusters, `$K$`, such that increasing the number of clusters results in no significant change in the within-cluster variance.

.small[ Jiawei, Han, and Kamber Micheline. Data mining: concepts and techniques. Morgan kaufmann, 2006. ISBN 978-0-12-381479-1 ]