Multidimensional scaling (MDS)

class: center, middle, inverse, title-slide

.title[
# Multidimensional scaling (MDS)
]
.subtitle[
## Applications in Genomics
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-12-01
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## What is Multidimensional Scaling (MDS)?

- MDS aims to identify **latent variables** underlying inter-object similarity measures

- **Reduce dimensionality** of data in a *non-linear* fashion

- Reproduce **high-dimensional relationships** on a *lower-dimensional* display

---
## ## What is Multidimensional Scaling (MDS)?

**Key Idea:**

- Start with a **distance matrix** (not raw data)

- Find coordinates in lower dimensions that best reproduce these distances

- Similar objects appear close together, dissimilar objects far apart

---
## Multidimensional Scaling Types

- **Classical MDS (cMDS):** Also called Principal Coordinates Analysis (PCoA)

- **Metric MDS:** Preserves actual distance values

- **Non-metric MDS:** Preserves only the rank order of distances

---
## MDS vs PCA vs SOM

**Comparison:**

| Method | Preserves | Interpretation |
|:-------|:----------|:---------------|
| **PCA** | Covariance of original data | Linear projection maximizing variance |
| **SOM** | Topology (local neighborhood) | Nearby points represent similar items |
| **MDS** | Metric (pairwise distances) | Maintains rank order of dissimilarities |

---
## MDS vs PCA vs SOM

**Similarities:**

- Both PCA and MDS are dimensionality reduction techniques

- Both can be used for visualization

- Classical MDS and PCA give identical results on Euclidean distances

**Key difference:** MDS is **non-linear** dimensionality reduction, while PCA is linear

---
## When to Use MDS?

**Use MDS when:**
- You have a distance/dissimilarity matrix but not raw data
- Working with non-Euclidean distances (e.g., genetic distances, sequence alignments)
- Distance relationships are more meaningful than variance
- Need to visualize phylogenetic relationships
- Comparing complex objects (text, images, proteins)

**Genomics applications:**
- Population structure analysis with genetic distances
- Comparing gene expression profiles with correlation distances
- Microbiome community composition
- Protein structure comparisons

---
## Classical MDS: The Math

Given a **distance matrix** `$D$` with elements `$d_{ij}$` (distance between objects `$i$` and `$j$`):

1. **Create squared distance matrix:** `$D^{(2)}$` with elements `$d_{ij}^2$`

2. **Double centering:** Apply centering transformation
`$B = -\frac{1}{2}HD^{(2)}H$`
where `$H = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$` (centering matrix)

3. **Eigendecomposition:** Solve `$B = V\Lambda V^T$`

4. **Coordinates:** First `$k$` dimensions given by `$X = V_k\Lambda_k^{1/2}$`

**Result:** Classical MDS with Euclidean distances = PCA on original data

---
## MDS Basic Algorithm

**Iterative procedure for metric/non-metric MDS:**

1. Obtain and order the `$M$` pairs of similarities

2. Choose a configuration in `$q$` dimensions
   - Compute inter-item distances and reference numbers
   - Minimize **Kruskal's stress**

3. Move points to improve configuration

4. Repeat until **minimum stress** is achieved

This is the general algorithm used by functions like `isoMDS()`

---
## Example 1: US Cities

Classic MDS example - reconstruct city locations from distances

``` r
# Built-in dataset: road distances between US cities
data(UScitiesD)
UScitiesD

# Perform classical MDS (2 dimensions)
mds_cities <- cmdscale(UScitiesD, k = 2)

# Plot the result
plot(mds_cities[, 1], mds_cities[, 2], 
     type = "n", xlab = "Coordinate 1", ylab = "Coordinate 2",
     main = "MDS of US Cities")
text(mds_cities[, 1], mds_cities[, 2], 
     labels = rownames(mds_cities), cex = 0.8)
```

**Note:** Cities arranged approximately in their geographic locations!

---
## Example 1: Output

MDS reconstructs approximate geographic positions from distances alone!

---
## Example 2: Iris Data

Compare MDS with PCA on the classic iris dataset

``` r
# Load iris data
data(iris)
iris_data <- iris[, 1:4]  # Just the measurements

# Compute Euclidean distance matrix
iris_dist <- dist(iris_data, method = "euclidean")

# Perform classical MDS
mds_iris <- cmdscale(iris_dist, k = 2)

# For comparison: PCA
pca_iris <- prcomp(iris_data, center = TRUE, scale. = FALSE)
```

---
## Example 2: MDS vs PCA Plots

``` r
par(mfrow = c(1, 2))

# MDS plot
plot(mds_iris[, 1], mds_iris[, 2], col = iris$Species, pch = 19,
     xlab = "Dimension 1", ylab = "Dimension 2", main = "Classical MDS")
legend("topright", legend = levels(iris$Species), 
       col = 1:3, pch = 19, cex = 0.7)

# PCA plot
plot(pca_iris$x[, 1], pca_iris$x[, 2], col = iris$Species, pch = 19,
     xlab = "PC1", ylab = "PC2", main = "PCA")
```

**Result:** Nearly identical! (Classical MDS = PCA for Euclidean distances)

---
## Example 3: Gene Expression Data

Using correlation-based distance for gene expression

``` r
# Simulate gene expression data
set.seed(123)
n_samples <- 30
n_genes <- 100

# Create data with 3 groups
expr_data <- matrix(rnorm(n_samples * n_genes), 
                    nrow = n_samples, ncol = n_genes)

# Add group structure
expr_data[1:10, 1:30] <- expr_data[1:10, 1:30] + 2
expr_data[11:20, 31:60] <- expr_data[11:20, 31:60] + 2
expr_data[21:30, 61:90] <- expr_data[21:30, 61:90] + 2

groups <- rep(c("Group1", "Group2", "Group3"), each = 10)
```

---
## Example 3: Computing Distance

``` r
# Correlation-based distance
# Distance = 1 - correlation
cor_matrix <- cor(t(expr_data))
cor_dist <- as.dist(1 - cor_matrix)

# Perform MDS
mds_expr <- cmdscale(cor_dist, k = 2)

# Plot
plot(mds_expr[, 1], mds_expr[, 2], 
     col = as.factor(groups), pch = 19, cex = 1.5,
     xlab = "Dimension 1", ylab = "Dimension 2",
     main = "MDS of Gene Expression (Correlation Distance)")
legend("topright", legend = unique(groups), 
       col = 1:3, pch = 19)
```

**Why correlation distance?** Gene expression patterns often show correlation structure rather than Euclidean distance structure

---
## Example 3: Visualization

![](MDS_files/figure-html/unnamed-chunk-7-1.png)

Clear separation of three expression groups!

---
## Assessing MDS Quality: GOF

**Goodness-of-Fit (GOF):** How well do low-dimensional distances match original distances?

``` r
# cmdscale with eig=TRUE returns eigenvalues
mds_result <- cmdscale(iris_dist, k = 2, eig = TRUE)

# Proportion of variance explained by first k dimensions
eigenvalues <- mds_result$eig[mds_result$eig > 0]
variance_explained <- eigenvalues / sum(eigenvalues)

# First two dimensions
cat("Variance explained by first 2 dimensions:", 
    round(sum(variance_explained[1:2]) * 100, 2), "%\n")
```

```
## Variance explained by first 2 dimensions: 97.77 %
```

**Interpretation:** Similar to PCA's variance explained - higher is better

---
## Kruskal's Stress

**Stress:** Measures how well distances on the MDS map correspond to input distances

`$\text{Stress} = \sqrt{\frac{\sum_{i<j} (d_{ij} - \hat{d}_{ij})^2}{\sum_{i<j} d_{ij}^2}}$`

**Procedure:**
1. Start with original distances `$d_{ij}$`
2. Compute fitted distances `$\hat{d}_{ij}$` in low-dimensional space
3. Subtract, square, and sum the differences
4. Take the square root
5. Divide by a scaling factor (sum of squared original distances)

**Lower stress → better fit**

---

## Stress: Another Quality Metric

**Stress:** Measures discrepancy between original and reproduced distances

``` r
library(MASS)

# Non-metric MDS with stress calculation
nmds_iris <- isoMDS(iris_dist, k = 2)

# Print stress value
cat("Stress:", round(nmds_iris$stress, 2), "\n")
```

**Stress interpretation:**
- < 5%: Excellent
- 5-10%: Good
- 10-20%: Fair
- \> 20%: Poor

Lower stress = better fit

---

## Non-metric MDS Example

Non-metric MDS preserves **rank order** of distances, not exact values

``` r
library(MASS)

# Compute non-metric MDS
set.seed(42)
nmds_iris <- isoMDS(iris_dist, k = 2)

# Plot
plot(nmds_iris$points[, 1], nmds_iris$points[, 2], 
     col = iris$Species, pch = 19,
     xlab = "NMDS1", ylab = "NMDS2",
     main = paste("Non-metric MDS (Stress =", 
                  round(nmds_iris$stress, 2), ")"))
legend("topright", legend = levels(iris$Species), 
       col = 1:3, pch = 19, cex = 0.8)
```

**Use case:** When exact distance values are unreliable but ordering is meaningful

---

## Different Distance Metrics

MDS works with any distance metric!

``` r
# Different distance metrics
euclidean_dist <- dist(iris_data, method = "euclidean")
manhattan_dist <- dist(iris_data, method = "manhattan")
canberra_dist <- dist(iris_data, method = "canberra")

# MDS on each
mds_euclidean <- cmdscale(euclidean_dist, k = 2)
mds_manhattan <- cmdscale(manhattan_dist, k = 2)
mds_canberra <- cmdscale(canberra_dist, k = 2)

# Plot all three for comparison
par(mfrow = c(1, 3))
plot(mds_euclidean, col = iris$Species, main = "Euclidean")
plot(mds_manhattan, col = iris$Species, main = "Manhattan")
plot(mds_canberra, col = iris$Species, main = "Canberra")
```

Choice of distance metric can significantly affect results!

---

## MDS in Practice: Key Functions

**Classical MDS:**

``` r
cmdscale(d, k = 2, eig = TRUE)
```
- `d`: distance matrix
- `k`: number of dimensions
- `eig`: return eigenvalues for GOF

**Non-metric MDS:**

``` r
library(MASS)
isoMDS(d, k = 2)
```

**Creating distance matrices:**

``` r
dist(x, method = "euclidean")  # Euclidean, manhattan, etc.
as.dist(1 - cor(t(x)))         # Correlation distance
```

---

## MDS Advantages

- Works with **any distance/dissimilarity measure**

- No assumption about data distribution

- Intuitive visualization of complex relationships

- Useful when you only have distances (no raw data)

- Classical MDS is computationally efficient

- Can handle missing distances (with appropriate methods)

- Widely used in ecology, psychology, marketing, genomics

---

## MDS Limitations

- Requires pre-computed distance matrix

- Can be sensitive to outliers

- No inherent variable loadings (unlike PCA)

- Interpretation of axes can be difficult

- Non-metric MDS can be slow for large datasets

- Different distance metrics can give very different results

- May need many dimensions to adequately represent data

---

## MDS vs Other Methods

**When to use:**
- **MDS:** Non-Euclidean distances, focus on between-object relationships
- **PCA:** Raw data, focus on variance, need interpretable loadings
- **t-SNE:** Local structure, clusters, non-linear relationships
- **UMAP:** Similar to t-SNE but faster, preserves global structure better

**Genomics context:**
- Genetic distances → MDS
- Gene expression variance → PCA
- Single-cell clustering → t-SNE/UMAP
- Population structure → MDS or PCA

---

## Summary

**Key Takeaways:**

1. MDS visualizes objects based on their pairwise distances

2. Classical MDS = PCA when using Euclidean distances

3. Works with any distance metric (correlation, genetic distance, etc.)

4. Quality assessed by variance explained or stress

5. Non-metric MDS useful when only distance ranks are reliable

6. Widely applicable in genomics for population structure and expression analysis

**R Functions:** `cmdscale()`, `isoMDS()`, `dist()`

---

## References and Resources

- **R Documentation:** `?cmdscale`, `?isoMDS`
- **Book:** Modern Multidimensional Scaling (Borg & Groenen)
- **Packages:** 
  - `stats::cmdscale` (classical MDS)
  - `MASS::isoMDS` (non-metric MDS)
  - `vegan::metaMDS` (ecological data)
  - `smacof` (advanced MDS methods)

**Try it yourself:**

``` r
# Quick example
data(eurodist)
mds <- cmdscale(eurodist, k = 2)
plot(mds, type = "n")
text(mds, labels = rownames(mds))
```