Supervised Learning: Methods for Analyzing Gene Expression Data

class: center, middle, inverse, title-slide

.title[
# Supervised Learning: Methods for Analyzing Gene Expression Data
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-10-08
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## Linear regression overview

We measure tumor grade and expression of gene 1. We are interested in:

1. How useful was expression of gene 1 for predicting tumor grade? `$R^2$`
2. Was that relationship due to chance? `$p-value$`

---
## T-test

The goal of t-test is to compare means and see if they are significantly different from each other.

---
## T-test in terms of linear regression

- Calculate the overall mean
- Calculate the sum of squared residuals around the mean `$SS_{mean}$`

---
## T-test in terms of linear regression

- Fit a line to the data, separately for each group (mean = the least squares fit to the group of data)
- For each group, we can calculate `$SS_{fit}$`

---
## T-test in terms of linear regression

- Combine two lines into a single equation.

- This will make the steps for computing F-statistics exactly the same as for the regression

``` r
# Combines both lines for the first group
y_11 = 1 * mu_1 + 0 * mu_2 + residual_11
y_12 = 1 * mu_1 + 0 * mu_2 + residual_12
y_13 = 1 * mu_1 + 0 * mu_2 + residual_13
y_14 = 1 * mu_1 + 0 * mu_2 + residual_14
# Combines both lines for the second group
y_21 = 0 * mu_1 + 1 * mu_2 + residual_21
y_22 = 0 * mu_1 + 1 * mu_2 + residual_22
y_23 = 0 * mu_1 + 1 * mu_2 + residual_23
y_24 = 0 * mu_1 + 1 * mu_2 + residual_24
```

---
## Design matrix

- 1's and 0's serve as "switches" for each group.

- This is our design matrix `$X$`, one column per group.

```
1  0
1  0
1  0
1  0
0  1
0  1
0  1
0  1
```

---
## Connecting Design Matrices and F-statistics

We can express the model as

`$$Y = X \mu + \varepsilon$$`

Then calculate

`$$F = \frac{(SS_{\text{mean}} - SS_{\text{fit}})/(p_{\text{fit}} - p_{\text{mean}})}{SS_{\text{fit}}/(n - p_{\text{fit}})}$$`

where:

* `$p_{\text{mean}} = 1$` (overall mean)
* `$p_{\text{fit}} = 2$` (two group means)

This generalizes naturally to **ANOVA** when `$k > 2$` groups.

---
## A more common design matrix

```
1  0
1  0
1  0
1  0
1  1
1  1
1  1
1  1
```

- In this setup, all measurements contribute to the mean for the first group

- But only the measurements from the second group contribute to the _difference_ between the first and the second group

- So the second column serves as a switch for the offset from the mean for the second group

---
## A more common design matrix

---
## A more common design matrix

``` r
# Combines both lines for the first group
y_11 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_11
y_12 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_12
y_13 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_13
y_14 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_14
# Combines both lines for the second group
y_21 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_21
y_22 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_22
y_23 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_23
y_24 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_24
```

- Same way to calculate `$SS_{mean}$` and `$SS_{fit}$`

- Same number of equations

- Same number of parameters

---
## Power of design matrices

- Say, in addition to group 1 and group 2, you have age variable.

---
## Power of design matrices

- We need to expand our model like `$y = group1\_intercept + group2\_offset + slope$` - full model

- So, in our design matrix, first columns of 1's mean that both lines intercept the Y-axis, and specify the intercept for group 1

- The second column indicates the offset of group 2 measures

- The third column is the Age variable for each group

```
1  0  1
1  0  2
1  0  3
1  0  4
1  1  1
1  1  2
1  1  3
1  1  4
```

---
## Power of design matrices

- We need to expand our model like `$y = group1\_intercept + group2\_offset + slope$` - full model

- So, in our design matrix, first columns of 1's mean that both lines intercept the Y-axis, and specify the intercept for group 1

- The second column indicates the offset of group 2 measures

- The third column is the Age variable for each group

- Compare with the simple model `$y = overall\_mean$`

- Calculate how much better is the full model: `$F = \frac{ (SS_{simple} - SS_{full}) / (p_{full} - p_{simple}) }{SS_{full} / (n - p_{simple})}$`

---
## Batch effect

- Suppose you have measurements from two labs

---
## Batch effect

- First, add a term for the first lab normal group mean

---
## Batch effect

- Second, add a term for the offset in measurements by the second lab

---
## Batch effect

- Third, add a term for the offset of the tumor measurements

---
## Batch effect

- The final model:
`$$y = lab1\_normal\_mean + lab2\_offset + difference_{tumor - normal}$$` 
and the design matrix:

```
1  0  0
1  0  0
1  0  0
1  0  0
1  0  1
1  0  1
1  0  1
1  0  1
1  1  0
1  1  0
1  1  0
1  1  0
1  1  1
1  1  1
1  1  1
1  1  1
```

---
## Batch effect

- The final model:
`$$y = lab1\_normal\_mean + lab2\_offset + difference_{tumor - normal}$$`

- Does the lab effect matter?

- Compare the final model with a simpler one `$y = lab1\_normal\_mean + difference_{tumor - normal}$`