RNA-seq differential expression

class: center, middle, inverse, title-slide

.title[
# RNA-seq differential expression
]
.subtitle[
## Overview
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-10-16
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## Data generative model for replicated RNA-seq

- Assume data are properly normalized.

- For a sample with M replicates, the counts for gene `$i$` replicate `$j$` is often modeled by following hierarchical model: `$Y_{i,j}|\lambda_i \sim Poisson(\lambda_i), \lambda_i \sim Gamma(\alpha,\beta$`)

- Marginally, the Gamma-Poisson compound distribution is Negative binomial. So the counts for a gene from multiple replicates is often modeled as Negative binomial: `$Y_{i,j} \sim NB(r,p)$`.

---
## Negative Binomial distribution

The Negative Binomial Distribution is a discrete probability distribution that models the number of events (often called "failures") that occur in a sequence of independent and identically distributed Bernoulli trials before a specified, fixed number of "successes" is reached.

It is a more general form of the **Geometric distribution**, which is the special case where the fixed number of successes is `$r=1$`.

The negative binomial distribution is defined by two parameters:

1.  `$r$`: The predetermined number of **successes** until the experiment stops (`$r > 0$`).
2.  `$p$`: The probability of **success** on each individual trial (`$0 \le p \le 1$`).

---
## NB Characteristics

* **Discrete:** The random variable (the number of failures, `$k$`) can only take on integer values (`$\{0, 1, 2, 3, \dots\}$`).

* **Independent Trials:** The outcome of one trial does not affect the outcome of any other trial.
  - Think of each read as an independent observation — each read is mapped to a gene without being influenced by where the previous read landed.

* **Fixed Success Probability:** The probability of success, `$p$`, is the same for every trial.
  - Each gene has its own underlying probability of generating a read (its expression level). This probability is fixed for a given condition or sample but differs across genes.

* **Fixed Successes:** The experiment continues until a fixed number of successes (`$r$`) is achieved. The total number of trials is therefore a random variable.

---
## Probability Mass Function (PMF)

The probability of having exactly `$k$` failures before the `$r$`-th success:

$$
P(X=k) = \binom{k+r-1}{r-1} p^r (1-p)^k \quad \text{for } k=0, 1, 2, \dots
$$

* The term `$\binom{k+r-1}{r-1}$` (the binomial coefficient) counts the number of ways to arrange the `$k$` failures (extra variability) and `$r-1$` successes (dispersion parameter) in the first `$k+r-1$` trials.

* The last trial (the `$(k+r)$`-th trial) must be the `$r$`-th success, which has probability `$p$` (the underlying probability of observing a read from that gene).

* The overall probability of the specific sequence is `$p^r (1-p)^k$`.
.small[ https://www.youtube.com/watch?v=BPlmjp2ymxw ]

---
## A little more about the NB distribution

- NB is over-dispersed Poisson  
    - Poisson: `$var=\mu$`
    - NB: `$var=\mu + \mu^2\phi$`
-
Dispersion parameter `$\phi$` approximates the squared coefficient of variation: `$\phi=\frac{var - \mu}{\mu^2} \approx \frac{var}{\mu^2}$`

- Dispersion `$\phi$` represents the biological variance, so shrinkage should be done for `$\phi$`

- NB distribution can be parameterized by mean and dispersion, but there's no conjugate prior for `$\phi$`

---
## Warnings about the use of negative binomial

Some transcript data does NOT fit the negative binomial model

- Transcripts which have zero counts in some samples and moderate counts in others from the SAME treatment

- Transcripts that are particularly prevalent in some tissues

- These that have high dispersion - possibly should be analyzed separately but not filtered

**RPKM/FPKM and TPM are not supported statistically for differential expression analysis - use counts only!**

---
## Differential expression analysis

- `limma` - Linear Models for Microarray Data

- `voom` - variance modeling at the observational level transformation. Uses the variance of genes to create weights for use in linear models

- After `voom` transformation, the RNA-seq data can be analyzed using `limma`

.small[ https://bioconductor.org/packages/limma/

Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts.” Genome Biology 15, no. 2 (2014): R29. https://doi.org/10.1186/gb-2014-15-2-r29. ]

---
## edgeR

- From a series of papers by Robinson et al. (the same group developed limma): 2007 Bioinformatics, 2008 Biostatistics, 2010 Bioinformatics.

- Uses Empirical Bayes methods to "shrink" gene-specific over-dispersion (`$\phi$`), which controls the within-group variances.

- There is no conjugate prior so shrinkage is not straightforward.

- Employs a conditional weighted likelihood approach to establish an approximate Empirical Bayes estimator for `$\phi$`.

.small[Chen, Yunshun, Lizhong Chen, Aaron T L Lun, Pedro L Baldoni, and Gordon K Smyth. “edgeR v4: Powerful Differential Analysis of Sequencing Data with Expanded Functionality and Improved Support for Small Counts and Larger Datasets.” Nucleic Acids Research 53, no. 2 (2025): gkaf018. https://doi.org/10.1093/nar/gkaf018.
]

---
## DESeq2

- Counts are assumed to follow Negative Binomial distribution, parameterized by mean and variance `$K_{ij} \sim NB(\mu_{ij},\sigma_{ij}^2)$`

- The variance is the sum of shot noise and raw variance `$\sigma_{ij}^2=\mu_{ij} + s_j^2\nu_{i,p(i)}$`

- The raw variance is a smooth function of the mean: assumes that genes with same means will have the same variances

- "This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation"

- DESeq pulls the low dispersions towards the common value, but leaves the high dispersions, which is more conservative

.small[ Anders, S., Huber, W. Differential expression analysis for sequence count data. Genome Biol 11, R106 (2010). https://doi.org/10.1186/gb-2010-11-10-r106 ]