K-means Clustering

class: center, middle, inverse, title-slide

.title[
# K-means Clustering
]
.subtitle[
## Non-hierarchical Clustering
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-11-17
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## What is K-means Clustering?

**K-means clustering** is a method of cluster analysis that:

- Partitions `$n$` observations into `$k$` clusters

- Each observation belongs to the cluster with the nearest mean (centroid)

- Similar to expectation-maximization for Gaussian mixtures

- Both attempt to find centers of natural clusters in the data

---
## Properties of K-means Clustering?

- **Non-hierarchical** - creates flat partitions

- **Unsupervised** - no labels needed

- **Distance-based** - uses Euclidean distance

- **Iterative** - alternates between assignment and update steps

**Goal:** Find natural groupings in the data by minimizing within-cluster variation

---
## K-means Objective

**We want to partition data into `$K$` clusters `$C_1, ..., C_K$` where:**

1. Each observation belongs to **at least one** cluster
2. Clusters are **non-overlapping** (no observation in multiple clusters)
3. Within-cluster variation is **minimized**

**Mathematical Objective:**

`$$\min_{C_1, ..., C_K}\left\{\sum_{k=1}^K\frac{1}{|C_k|}\sum_{i,i' \in C_k}\sum_{j=1}^p(x_{ij}-x_{i'j})^2\right\}$$`

This minimizes the sum of all pairwise squared Euclidean distances between observations in each cluster.

---
## K-means Objective

**We want to partition data into `$K$` clusters `$C_1, ..., C_K$` where:**

1. Each observation belongs to **at least one** cluster
2. Clusters are **non-overlapping** (no observation in multiple clusters)
3. Within-cluster variation is **minimized**

**Equivalent formulation:**

`$$\arg\min_C \sum_{k=1}^K \sum_{i \in C_k} ||x_i - \mu_k||^2$$`

where `$\mu_k$` is the centroid (mean) of cluster `$k$`

---

## K-means Algorithm

**Steps:**

1. **Initialize:** Choose `$k$` points as initial cluster means (centroids)

2. **Repeat until convergence:**
   - **Assignment step:** Assign each point `$x_i$` to the cluster with the closest centroid
   - **Update step:** Recalculate the centroid for each cluster

**Centroid calculation:**
`$$\mu_k = \frac{1}{|C_k|}\sum_{i \in C_k}x_i$$`
---
## K-means Algorithm

**Convergence:**

- K-means **always converges**

- Assignment and update steps either reduce the objective function or leave it unchanged

- Typically converges in relatively few iterations

---
## K-means clustering algorithm

.small[ J. B. MacQueen "**Some Methods for classification and Analysis of Multivariate Observations**" 1967 https://projecteuclid.org/euclid.bsmsp/1200512992 ]

---

## K-means: Visual Example (Step 1)

.pull-left[
**Starting point:**
- Expression of two genes for 14 samples
- Some structure is visible in the data

**Next step:** Choose initial centroids
]

.pull-right[
<img src="img/kmeans0.png" width="400px" />
]

---

## K-means: Visual Example (Step 2)

.pull-left[
**Choose `$K$` centroids:**
- User picks starting values (e.g., `$K=3$`)
- Can be chosen randomly or using data-driven methods
- Different initializations may lead to different results

**Important:** Initial centroid placement affects final clustering
]

.pull-right[
<img src="img/kmeans1.png" width="400px" />
]

---

## K-means: Visual Example (Step 3)

.pull-left[
**Assignment step:**
- Find the closest centroid for each point
- Distance is calculated (typically Euclidean)
- Creates "first partition" into `$K$` clusters

**Result:** Each point is now assigned to one of the `$K$` clusters
]

.pull-right[
<img src="img/kmeans2.png" width="400px" />
]

---

## K-means: Visual Example (Step 4)

.pull-left[
**Update step:**
- Calculate the center (mean) of each cluster
- Re-compute centroids based on assigned points
- New centroids will be used in next iteration

**Iteration:** Repeat assignment and update until convergence
]

.pull-right[
<img src="img/kmeans3.png" width="400px" />
]

---

## K-means: Visual Example (Step 5)

.pull-left[
**Convergence:**
- Algorithm repeats until centroids stop moving
- Final clusters are determined
- Within-cluster variation is minimized

.small[MacQueen, J.B. "Some Methods for Classification and Analysis of Multivariate Observations" (1967) https://projecteuclid.org/euclid.bsmsp/1200512992]
]

.pull-right[
<img src="img/kmeans4.png" width="400px" />
]

---

## K-means in R: Basic Example

``` r
# Load iris data
data(iris)
iris_data <- iris[, 1:4]

# Perform k-means clustering with K=3
set.seed(123)
kmeans_result <- kmeans(iris_data, centers = 3, nstart = 25)

# View cluster assignments
table(kmeans_result$cluster, iris$Species)
```

```
##    
##     setosa versicolor virginica
##   1      0         48        14
##   2      0          2        36
##   3     50          0         0
```

**Key parameters:**
- `centers`: Number of clusters (K)
- `nstart`: Number of random initializations (use 25+ for stability)

---

## K-means in R: Basic Example

``` r
# View cluster centers
kmeans_result$centers
```

```
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.901613    2.748387     4.393548    1.433871
## 2     6.850000    3.073684     5.742105    2.071053
## 3     5.006000    3.428000     1.462000    0.246000
```

``` r
# Within-cluster sum of squares
kmeans_result$tot.withinss
```

```
## [1] 78.85144
```

---

## K-means: Visualization

``` r
library(ggplot2)

# Create data frame for plotting
iris_clustered <- data.frame(
  iris_data,
  Cluster = as.factor(kmeans_result$cluster),
  Species = iris$Species
)

# Plot first two dimensions
ggplot(iris_clustered, aes(x = Sepal.Length, y = Sepal.Width, 
                           color = Cluster, shape = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "K-means Clustering of Iris Data (K=3)") +
  theme_minimal()
```

**Interpretation:** Compare cluster assignments with known species labels

---

## K-means: Visualization

**Interpretation:** Compare cluster assignments with known species labels

---

## Choosing K: The Elbow Method

**Problem:** How many clusters should we use?

**Elbow method:**
- Plot within-cluster sum of squares (WSS) vs. number of clusters
- Look for an "elbow" where adding more clusters gives diminishing returns

``` r
# Calculate WSS for different K values
wss <- sapply(1:10, function(k) {
  kmeans(iris_data, centers = k, nstart = 25)$tot.withinss
})

# Plot elbow curve
plot(1:10, wss, type = "b", pch = 19,
     xlab = "Number of Clusters (K)",
     ylab = "Within-cluster Sum of Squares",
     main = "Elbow Method for Optimal K")
```

**Look for:** The point where the curve "bends" (the elbow)

---

## Choosing K: The Elbow Method

**Problem:** How many clusters should we use?

**Elbow method:**
- Plot within-cluster sum of squares (WSS) vs. number of clusters
- Look for an "elbow" where adding more clusters gives diminishing returns

**Look for:** The point where the curve "bends" (the elbow)

---

## Choosing K: Silhouette Method

**Silhouette coefficient** measures how well each point fits in its cluster:

`$$s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}$$`

- `$a(i)$`: Average distance from point `$i$` to points in same cluster
- `$b(i)$`: Average distance from point `$i$` to points in nearest neighboring cluster
- Range: -1 to 1 (higher is better)

---

## Choosing K: Silhouette Method

* `$s(i) \approx 1$`: point is well matched to its cluster
* `$s(i) \approx 0$`: point lies between clusters
* `$s(i) < 0$`: point may be misclassified

``` r
library(cluster)

# Calculate silhouette for different K
avg_sil <- sapply(2:10, function(k) {
  km <- kmeans(iris_data, centers = k, nstart = 25)
  ss <- silhouette(km$cluster, dist(iris_data))
  mean(ss[, 3])
})

# Plot
plot(2:10, avg_sil, type = "b", pch = 19,
     xlab = "Number of Clusters (K)",
     ylab = "Average Silhouette Width")
```

---

## Choosing K: Silhouette Method

1. Compute clustering for a range of `$k$` values (e.g., 2–10).
2. For each `$k$`, calculate the average silhouette coefficient across all points.
3. Plot the average silhouette width vs. `$k$` — typically as a line or bar plot.
4. Choose the optimal number of clusters where samples are most coherently grouped and well-separated from other clusters (higher average silhouette).

---

## K-means Advantages

- **Simple and fast** - efficient for large datasets

- **Scales well** - linear complexity O(nkd) where n = samples, k = clusters, d = dimensions

- **Sharp partitions** - each point belongs to exactly one cluster

- **Easy to implement** - straightforward algorithm

- **Works well** when clusters are:
  - Spherical (roughly circular/globular)
  - Similar in size
  - Similar in density
  - Well-separated

---

## K-means Limitations

- **Must specify K** - number of clusters not determined by algorithm

- **Sensitive to initialization** - different starting points → different results
- Solution: Use `nstart` parameter for multiple random starts

- **Sensitive to outliers** - outliers can distort cluster centroids

- **Assumes spherical clusters** - struggles with elongated or irregular shapes

- **Euclidean distance only** - gives most weight to largest differences
  - Cannot be used with qualitative data

- **Equal-sized clusters** - tends to create clusters of similar sizes

- **Local optima** - may not find global optimum

---

## PAM: K-medoids Clustering

**Alternative to K-means:** Partitioning Around Medoids (PAM)

**Key difference:**
- **K-means:** Uses **centroids** (average of points in cluster)
- **PAM:** Uses **medoids** (actual data point that represents cluster)

**When to use PAM:**
- Presence of outliers
- Need non-Euclidean distance
- Want representative objects from data

---
## PAM: K-medoids Clustering

**Advantages of PAM:**
- More robust to outliers
- Works with any distance metric (not just Euclidean)
- Medoid is an actual observation (more interpretable)

**Disadvantages:**
- Computationally more expensive than K-means
- Slower for large datasets

---

## PAM in R

``` r
library(cluster)

# Perform PAM clustering
pam_result <- pam(iris_data, k = 3)

# View medoids (actual data points)
pam_result$medoids

# Compare with K-means
table(PAM = pam_result$clustering, 
      Kmeans = kmeans_result$cluster)

# Silhouette plot
plot(silhouette(pam_result), 
     main = "Silhouette Plot for PAM Clustering")
```

**Medoids:** The "most representative" points within each cluster

---

## PAM in R

```
##    Kmeans
## PAM  1  2  3
##   1  0  0 50
##   2 62  0  0
##   3  0 38  0
```

**Medoids:** The "most representative" points within each cluster

---

## Fuzzy K-means

**Extension of K-means with soft assignments:**

**Standard K-means:**
- Each point belongs to exactly one cluster (hard assignment)

**Fuzzy K-means:**
- Each point has a **probability** of belonging to each cluster (soft assignment)
- Membership values `$\mu_{ij}$` indicate degree of belonging to cluster `$j$`

---
## Fuzzy K-means

**Algorithm:**
1. **Initialize:** Choose `$k$` cluster centers
2. **Repeat until convergence:**
   - **Assignment:** Calculate probability of each point belonging to each cluster
   - **Update:** Recalculate cluster centers using these probabilities

**Membership constraint:** `$\sum_{j=1}^k \mu_{ij} = 1$` for each point `$i$`

---

## Fuzzy K-means: Mathematical Detail

**Objective function:**

`$$\min \sum_{i=1}^n \sum_{j=1}^k \mu_{ij}^m ||x_i - c_j||^2$$`

where:
- `$\mu_{ij}^m$` is the degree of membership of `$x_i$` in cluster `$j$`
- `$m$` is the fuzziness parameter (typically `$m = 2$`)
- Larger values of `$m$` make clusters more fuzzy

---
## Fuzzy K-means

**Membership update:**

`$$\mu_{ij} = \frac{1}{\sum_{l=1}^k \left(\frac{||x_i - c_j||}{||x_i - c_l||}\right)^{2/(m-1)}}$$`

**Relationship:** Similar to EM algorithm and Gaussian mixture models

**When to use:**
- Overlapping clusters
- Uncertain cluster boundaries
- Need probability of cluster membership

.small[https://www.kaggle.com/code/ysthehurricane/fuzzy-c-means-clustering-tutorial-r/code ]

---

## Fuzzy K-means in R

``` r
library(e1071)

# Perform fuzzy clustering
fuzzy_result <- cmeans(iris_data, centers = 3, m = 2)

# View membership probabilities (first 5 samples)
head(fuzzy_result$membership, 3)
```

```
##                1           2         3
## [1,] 0.001072020 0.002304388 0.9966236
## [2,] 0.007498397 0.016650860 0.9758507
## [3,] 0.006414869 0.013760382 0.9798247
```

``` r
# Assign to cluster with highest membership
fuzzy_clusters <- apply(fuzzy_result$membership, 1, which.max)

# Compare with hard K-means
table(Fuzzy = fuzzy_clusters, 
      Kmeans = kmeans_result$cluster)
```

```
##      Kmeans
## Fuzzy  1  2  3
##     1  2 38  0
##     2 60  0  0
##     3  0  0 50
```

---

## K-means vs Fuzzy K-means: Visual Comparison

``` r
library(e1071)
par(mfrow = c(1, 2))

# K-means (hard clustering)
plot(iris_data[, 1:2], col = kmeans_result$cluster, 
     pch = 19, main = "K-means (Hard Clustering)")
points(kmeans_result$centers[, 1:2], col = 1:3, pch = 8, cex = 2)

# Fuzzy K-means (soft clustering - colored by max membership)
fuzzy_result <- cmeans(iris_data, centers = 3, m = 2)
fuzzy_clusters <- apply(fuzzy_result$membership, 1, which.max)
plot(iris_data[, 1:2], col = fuzzy_clusters, 
     pch = 19, main = "Fuzzy K-means (Soft Clustering)")
points(fuzzy_result$centers[, 1:2], col = 1:3, pch = 8, cex = 2)
```

**Difference:** Fuzzy allows for uncertainty at cluster boundaries

---

## K-means vs Fuzzy K-means: Visual Comparison

**Difference:** Fuzzy allows for uncertainty at cluster boundaries

---

## Practical Considerations

**Best practices for K-means:**

1. **Standardize data** - especially when features have different scales

2. **Use multiple initializations** - set `nstart = 25` or higher

3. **Try different K values** - use elbow or silhouette method

4. **Check cluster quality** - examine within/between cluster variation

5. **Validate results** - use biological knowledge or cross-validation

6. **Consider alternatives** if:
   - Clusters are non-spherical → DBSCAN
   - Hierarchical structure exists → Hierarchical clustering
   - Overlapping clusters → Fuzzy K-means or GMM

<!--
## K-means for Genomics Data

**Gene expression clustering:**

``` r
# Assume expr_matrix is gene expression data (genes × samples)
# Transpose to cluster samples
expr_transposed <- t(expr_matrix)

# Standardize
expr_scaled <- scale(expr_transposed)

# K-means clustering of samples
set.seed(42)
sample_clusters <- kmeans(expr_scaled, centers = 3, nstart = 25)

# Visualize with PCA
pca_result <- prcomp(expr_scaled)
plot(pca_result$x[, 1:2], col = sample_clusters$cluster, pch = 19,
     xlab = "PC1", ylab = "PC2", 
     main = "Sample Clusters (K-means on PCA)")
```

**Applications:** Sample classification, identifying patient subtypes, QC
-->

---

## Summary

**K-means clustering:**
- Partitions data into `$K$` non-overlapping clusters
- Minimizes within-cluster variation
- Iterative algorithm: assignment → update → repeat
- Always converges to local optimum

**Key considerations:**
- Choose appropriate `$K$` (elbow, silhouette methods)
- Use multiple random starts (`nstart`)
- Standardize data when needed
- Check if assumptions are met (spherical clusters, similar sizes)

<!--**Variants:**
- **PAM (K-medoids):** More robust, uses actual data points
- **Fuzzy K-means:** Soft assignments with membership probabilities

**Use when:** Need simple, fast clustering with well-separated, spherical groups-->