DESeq2: Statistical Framework for RNA-seq Analysis

class: center, middle, inverse, title-slide

.title[
# DESeq2: Statistical Framework for RNA-seq Analysis
]
.subtitle[
## Normalization, Generalized Linear Models, and Hypothesis Testing
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-10-22
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## DESeq2: Differential expression for count data

Differential expression analysis of RNA-seq data in DESeq2 follows the pipeline:

1. Normalization for library size, 
2. Estimating dispersion per gene, 
3. Fitting a Negative Binomial GLM, 
4. Hypothesis testing (Wald or LRT), 
5. Shrinkage of log-fold changes.

DESeq2 shares information across genes via Empirical Bayes shrinkage to stabilize dispersion and fold-change.

---
## Library size normalization

* **Size factors:** DESeq2 computes one size factor `$s_j$` per sample to normalize for sequencing depth and RNA composition.
  Counts are modeled as Negative Binomial (NB)
  `$K_{ij} \sim \text{NB}(\mu_{ij}, \alpha_i)$` with `$\mu_{ij} = s_j , q_{ij}$`

- We don’t know `$s_j$` or `$q_{ij}$` — both are unknown.
But we can estimate `$s_j$` **up to a constant multiplier** because *ratios* between samples are meaningful.

- The size factors `$s_j$` are estimated by the **median-of-ratios** method: for each sample, take the median of gene-wise ratios of counts to the geometric mean across samples.

---
## The median-of-ratios method

- **Compute a “pseudo-reference” expression** for each gene
   * For each gene `$i$`, compute its **geometric mean** across all samples: `$g_i = \left(\prod_{j=1}^m K_{ij} \right)^{1/m}$`.
     Genes with zero counts  are ignored.

- **Compute ratios** for each sample
   * For each sample `$j$`, compute the ratio of that gene’s count to the pseudo-reference:
     `$r_{ij} = \frac{K_{ij}}{g_i}$`.
     Each ratio tells how much higher/lower gene `$i$` is in sample `$j$` compared to the overall geometric mean.

- **Take the median ratio across genes** for each sample
   * The size factor `$s_j$` is defined as:
     `$s_j = \text{median}_i \left(r_{ij} \right) = \text{median}_i \left(\frac{K_{ij}}{g_i} \right)$`.
     This median summarizes how sample `$j$`’s expression levels compare globally to the pseudo-reference.

---
## Effect of normalization

* The **geometric mean** represents a typical value across samples and handles multiplicative differences well (counts scale multiplicatively with sequencing depth).

* The **median** provides robustness to a few genes with very large or very small fold-changes (outliers).

* The normalized counts are computed as: `$K_{ij}^{\text{norm}} = \frac{K_{ij}}{s_j}$`.

After this, differences between samples reflect **biological variation** rather than sequencing depth or compositional bias.

``` r
# Example: estimate size factors in DESeq2
dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~ condition)
dds <- estimateSizeFactors(dds)
sizeFactors(dds)  # shows the median-ratio size factors
```

---
## Dispersion estimation (Empirical Bayes shrinkage)

- **Gene-wise dispersion (`$\alpha_i$`):** For each gene `$i$`, DESeq2 computes a maximum-likelihood estimate (MLE) of the dispersion `$\alpha_i$` using that gene’s counts. These estimates are noisy with few replicates

Remember DESeq2 models each gene’s counts as: `$K_{ij} \sim \text{NB}(\mu_{ij}, \alpha_i)$`, with `$\mathbb{E}[K_{ij}] = \mu_{ij}$`, `$\quad \mathrm{Var}(K_{ij}) = \mu_{ij} + \alpha_i \mu_{ij}^2$`.

* `$\mu_{ij}$` represents **Poisson noise** (sampling/technical variation).
* `$\alpha_i \mu_{ij}^2$` captures **biological variability** (extra-Poisson variance).

So `$\alpha_i$` controls how much the variance exceeds the mean. For each gene `$i$`:

1. Compute the fitted means `$\hat{\mu}_{ij}$` using the GLM (after normalization).
2. Find the dispersion `$\alpha_i$` that **maximizes the NB likelihood**: `$L(\alpha_i) = \prod_j f_{\text{NB}}(K_{ij} \mid \mu_{ij}, \alpha_i)$`.
3. This yields a **maximum-likelihood estimate (MLE)** `$\hat{\alpha}_i$`.

Formally, it’s the dispersion that makes the observed variability most probable under the NB assumption — not just the raw variance of the counts.

---
## Dispersion estimation (Empirical Bayes shrinkage)

- **Dispersion-mean trend:** DESeq2 then fits a smooth curve of dispersion versus mean count (the “trend”) across all genes. This captures the typical dependence of dispersion on expression strength.

- **Shrinkage (EB):** The gene-wise dispersions are shrunk toward the trend via an empirical Bayes approach. Intuitively, genes with little information (low count or few replicates) borrow strength from the overall trend. The final dispersion is the maximum a posteriori (MAP) estimate. The amount of shrinkage is data-driven: with more samples, shrinkage is weaker. (Outlier genes with very high dispersion may use the gene-wise estimate instead of being shrunk.)

``` r
# Example: estimate dispersions in DESeq2 (called internally by DESeq)
dds <- estimateDispersions(dds)  # fits gene-wise dispersions, trend, and MAP shrinkage
plotDispEsts(dds)               # visualize gene-wise (black), fitted (red), and MAP (blue) dispersions
```

---
## Negative binomial GLM model

- DESeq2 assumes counts `$K_{ij} \sim \mathrm{NB}(\mu_{ij}, \alpha_i)$` for gene `$i$` and sample `$j$`, with a log link function:
`$$\log(\mu_{ij}) = \log(s_j) + X_j \beta_i$$`

where

* `$s_j$` — size factor for sample `$j$`,
* `$X_j$` — design matrix row (e.g., indicators for condition or batch),
* `$\beta_i$` — gene-specific coefficient vector.

---
## Negative binomial GLM model

- DESeq2 assumes counts `$K_{ij} \sim \mathrm{NB}(\mu_{ij}, \alpha_i)$` for gene `$i$` and sample `$j$`, with a log link function:
`$$\log(\mu_{ij}) = \log(s_j) + X_j \beta_i$$`

- **Coefficients:** Typically, `$\beta_i$` includes an **intercept** (log mean expression), and one or more **condition effects** (`$\log_2$`-fold change between treatment and control). That is what gets estimated for each gene.

- **Flexibility:** More complex designs (covariates, interactions, blocking factors) are handled by adding terms to the **design formula** in `$X$`.

``` r
# Example: run DESeq (size factor + dispersion estimation + GLM fitting)
dds <- DESeq(dds)  # performs size factor normalization, dispersion estimation, and GLM fitting
```

---
## Hypothesis testing (Wald)

<!-- **Wald test (default):** DESeq2 tests significance of coefficients (e.g. log fold change) using Wald t-test.

- The estimate `$\hat{\beta}$` is divided by its standard error to form a z-statistic, compared to Normal(0,1).

- This tests `$H_0$`: `$\beta=0$` (no log fold change) unless a log-FC threshold is specified. P-values (for each gene) are adjusted by Benjamini-Hochberg FDR. -->

- **Wald test (default):** DESeq2 tests the significance of individual coefficients (e.g., log2 fold changes) in the fitted negative binomial GLM.

* For each gene and coefficient, DESeq2 estimates `$\hat{\beta}$` and `$SE(\hat{\beta})$`.

* The **Wald statistic** is computed as `$z = \frac{\hat{\beta} }{SE(\hat{\beta})}$` which is approximately `$N(0,1)$` under the null hypothesis `$H_0: \beta = 0$` (no log2 fold change between conditions).

* **Inference:** Two-sided p-values are derived from the normal distribution and adjusted for multiple testing (Benjamini–Hochberg FDR).

* **Use case:** The Wald test is suitable for simple two-group or contrast-based comparisons; for multi-factor designs, the likelihood ratio test (LRT) is preferred.

<!-- **Likelihood ratio test:**  Alternatively, DESeq2 can perform a likelihood ratio test ## Hypothesis testing (LRT)

(LRT) by comparing the full model to a reduced model (e.g. omitting a factor). Use `DESeq(dds, test="LRT", reduced=~)` for this. LRT is useful for multi-factor designs or testing any change in nested coefficients.

- **Independent filtering:** By default, DESeq2 filters out genes with very low mean counts before multiple testing to improve power (independent of the test statistic).

``` r
res <- results(dds)                # Wald test results (e.g. "condition_trt_vs_untrt")
resLRT <- results(dds, test="LRT", reduced=~ condition)  # LRT dropping the condition effect
```
-->

---
## Hypothesis testing (LRT)

* DESeq2 models may include multiple covariates. `$\text{full model: }~ \mu_{g,i} = f(\text{batch},,\text{condition},,\text{time},\ldots)$`. Consider `$\text{reduced model: }~ \mu_{g,i} = f(\text{batch},\text{time},\ldots) \quad (\text{omitting “condition”})$`

* The statistic is `$D = 2 \big( \log L_{\rm full} - \log L_{\rm reduced} \big)$` where `$L_{\rm full}$` is the likelihood of the full model and `$L_{\rm reduced}$` is the likelihood of the reduced model.

* Under the null hypothesis (that the extra terms in the full model do *not* improve fit), the statistic *D* is asymptotically `$\chi^2$`-distributed with degrees of freedom equal to the difference in the number of parameters between full and reduced models.

* Thus for each gene we get a p-value indicating: *“Does including the additional term(s) in the full model significantly improve the model fit?”*

``` r
dds <- DESeq(dds, test="LRT", reduced = ~ …)
```

---
## Shrinkage of log2 fold changes (moderated LFC)

- **Empirical Bayes shrinkage:** DESeq2 applies a zero-centered Normal prior on `$\log_2$` fold changes `$\beta$` to shrink noisy estimates toward zero. After initial GLM fits (MLEs), DESeq2 fits a Normal prior to the distribution of MLE `$\log_2$` fold changes, then computes maximum a posteriori (MAP) estimates. Genes with low information (high dispersion or low count) undergo stronger shrinkage.

- **Benefit:** Shrunk LFCs have lower variance, making gene ranking more reproducible. Wald tests use the shrunken `$\hat{\beta}$` and corresponding posterior standard error.

- **Adaptive shrinkage:** DESeq2 offers advanced shrinkage via the `lfcShrink()` function with methods “ashr” or “apeglm”. These use adaptive priors (Ashr by Stephens (2016) or the apeglm approach for more accurate shrinkage.

---
## Shrinkage of log2 fold changes (moderated LFC)

- **Usage:** For visualization or ranking, call `lfcShrink(dds, coef=..., type="ashr")` after running DESeq2; for results with threshold, you can directly use `results()` with `lfcThreshold`.

``` r
res <- results(dds) 
res_shrunk <- lfcShrink(dds, coef="condition_trt_vs_untrt", type="ashr")
```

---
## Independent filtering
  
- DESeq2 automatically removes genes with very low mean normalized counts **before multiple testing**, as these genes provide little power to detect differential expression.

- The filtering criterion is **independent of the test statistic**, ensuring that type I error is not inflated.

- By default, DESeq2 uses the mean of normalized counts as the filter statistic and chooses an optimal cutoff that **maximizes the number of rejections (discoveries)** at a given FDR threshold.