Pathway and Functional Enrichment Analysis Methods

class: center, middle, inverse, title-slide

.title[
# Pathway and Functional Enrichment Analysis Methods
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-10-27
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## Overview

- Why enrichment analysis?

- What is enrichment analysis?

- Gene ontology and pathways enrichment

- Tools and references

---
## Why enrichment analysis?

- Human genome contains ~20,000-25,000 genes

- Each gene has multiple functions

- If 1,000 genes have changed in an experimental condition, it may be difficult to understand what they do

---
## Birds of a feather flock together

- Genes with similar expression patterns share similar functions

- Similar (common) functions characterize a group of genes

.small[ https://genefriends.org/ ]

---
## Birds of a feather flock together

- Genes with similar expression patterns share similar functions

- Similar (common) functions characterize a group of genes

&nbsp;

- People with similar genetic patterns are likely friends

.small[ N.A. Christakis, & J.H. Fowler,  Friendship and natural selection, Proc. Natl. Acad. Sci. U.S.A. 111 (supplement_3) 10796-10801, https://doi.org/10.1073/pnas.1400825111 (2014). ]

---
## Why enrichment analysis?

- Translating changes of **hundreds/thousands of differentially expressed genes** into a few biological processes (reducing dimensionality)

- High level understanding of the biology behind gene expression – **Interpretation!**

---
## What is enrichment analysis

- **Enrichment analysis** - summarizing common functions associated with a group of objects

---
## What is enrichment analysis? – statistical definition

**Enrichment analysis** – detection whether a group of objects has certain properties more (or less) frequent than can be expected by chance

---
## Classification of genes

**Gene sets** - _a priori_ classification of genes into biologically relevant groups

- Members of the same biochemical pathways

- Genes annotated with the same molecular function

- Transcripts expressed in the same cellular compartments

- Co-regulated/co-expressed genes

- Genes located on the same cytogenetic band

- ...

---
## Annotation databases and ontologies

- An annotation database annotates genes with functions or properties - sets of genes with shared functions

- Structured prior knowledge about genes

---
## Gene ontology

- An ontology is a formal (hierarchical) representation of concepts and the relationships between them.

- The objective of GO is to provide controlled vocabularies of terms for the description of gene products.

- These terms are to be used as attributes of gene products, facilitating uniform queries across them.

---
## Gene ontology structure

Gene ontology describes multiple levels of detail of gene function.

- **Molecular Function** - the tasks performed by individual gene products; examples are _transcription factor_ and _DNA helicase_

- **Biological Process** - broad biological goals, such as _mitosis_ or _purine metabolism_, that are accomplished by ordered assemblies of molecular functions

- **Cellular Component** - subcellular structures, locations, and macromolecular complexes; examples include _nucleus_, _telomere_, and _origin recognition complex_

---
## Gene ontology hierarchy

- Terms are related within a hierarchy using "is-a", "part-of" and other connectors

---
## Gene ontology database

.small[ http://geneontology.org/

https://www.ebi.ac.uk/QuickGO/ ]

---
## Gene ontologies are not created equal

- Different levels of evidence: 
    - Experimental
    - Computational analysis
    - Author Statement
    - Curator Statement
    - Inferred from electronic annotation

.small[ https://geneontology.org/docs/guide-go-evidence-codes/ ]

---
## Gene ontologies are not created equal

.small[ http://amigo.geneontology.org/amigo/base_statistics ]

---
## User-friendly Gene Ontology annotations

.small[ http://git.dhimmel.com/gene-ontology/ ]

---
## Gene ontologies for model organisms

.small[
- **Mouse Genome Database** (MGD) and Gene Expression Database (GXD) (Mus musculus) http://www.informatics.jax.org/

- **Rat Genome Database** (RGD) (Rattus norvegicus) http://rgd.mcw.edu/

- **FlyBase** (Drosophila melanogaster) http://flybase.org/

- **Berkeley Drosophila Genome Project** (BDGP) http://www.fruitfly.org/

- **WormBase** (Caenorhabditis elegans) http://www.wormbase.org/

- **Zebrafish Information Network** (ZFIN) (Danio rerio)  http://zfin.org/

- **Saccharomyces Genome Database** (SGD) (Saccharomyces cerevisiae) http://www.yeastgenome.org/

- **The Arabidopsis Information Resource** (TAIR) (Arabidopsis thaliana) https://www.arabidopsis.org/

- **Gramene** (grains, including rice, Oryza) http://www.gramene.org/
]

---
## MSigDb - Molecular Signatures Database

.small[ http://software.broadinstitute.org/gsea/msigdb/ ]

---
## MSigDb - Molecular Signatures Database

.small[
- **H – Hallmark gene sets**: Coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes.

- **C1 – Positional gene sets**: Correspond to human chromosome cytogenetic bands.

- **C2 – Curated gene sets**: From online pathway databases, publications in PubMed, and knowledge of domain experts.

- **C3 – Regulatory target gene sets**: Based on gene target predictions for microRNA seed sequences and predicted transcription factor binding sites.

- **C4 – Computational gene sets**: Defined by mining large collections of cancer-oriented expression data.

- **C5 – Ontology gene sets**: Consist of genes annotated by the same ontology term.

- **C6 – Oncogenic signature gene sets**: Defined directly from microarray gene expression data from cancer gene perturbations.

- **C7 – Immunologic signature gene sets**: Represent cell states and perturbations within the immune system.

- **C8 – Cell type signature gene sets**: Curated from cluster markers identified in single-cell sequencing studies of human tissue.

https://github.com/stephenturner/msigdf ]

---
## Pathways

- An ordered series of molecular events that leads to the creation new molecular product, or a change in a cellular state or process.

- Genes often participate in multiple pathways – think about genes having multiple functions

.small[ http://biochemical-pathways.com/#/map/1 ]

---
## KEGG pathway database

- **KEGG: Kyoto Encyclopedia of Genes and Genomes** is a collection of biological information compiled from published material = curated database.

- Includes information on genes, proteins, metabolic pathways, molecular interactions, and biochemical reactions associated with specific organisms

- Provides a relationship (map) for how these components are organized in a cellular structure or reaction pathway.

.small[ http://www.genome.jp/kegg/ ]

---
## KEGG pathway diagram

---
## Reactome

- Curated human pathways encompassing metabolism, signaling, and other biological processes.

- Every pathway is traceable to primary literature.

.small[ http://www.reactome.org/ ]

---
## Reactome pathway diagram

---
## Other pathway databases

- **pathDIP** version 5 is an annotated database of signalling cascades in human and 16 non-human organisms, comprising 6,535 pathways, and covering 195,148 genes and 5,783 metabolites. https://ophid.utoronto.ca/pathDIP/

- **PathGuide**, lists over 700 pathway related databases, http://www.pathguide.org/

- **WikiPathways**, community-curated pathways, http://wikipathways.org/

---
## Gene annotation databases in R

- **annotables** (https://github.com/stephenturner/annotables) - R data package for annotating/converting Gene IDs

- **msigdf** (https://github.com/stephenturner/msigdf) - Molecular Signatures Database (MSigDB) in a data frame

- **pathview** (https://bioconductor.org/packages/pathview/) - a tool set for pathway based data integration and visualization

---
## Genes to networks

- **GeneMania**, networks based on different properties, http://genemania.org

- **STRING**, protein-protein interaction networks, http://string-db.org

- **Genes2Networks**, protein-protein interaction networks, https://maayanlab.cloud/X2K/#g2n

- **IntAct**, protein-protein interaction data and networks, https://www.ebi.ac.uk/intact/

- **HPRD**, protein-protein interaction database, http://www.hprd.org/

---
class: center,middle

# Enrichment analysis

---
## Types of enrichment analyses

- **First generation** - traditional overrepresentation analyses, hypergeometric distribution-based test whether genes of interest (i.e., differentially expressed) are overrepresented in functional gene sets.

- **Second generation** - tests the tendency of gene set members to appear rather at the top or bottom of the ranked list of all measured genes.

- **Third generation** - network- or topology-based tests, consider relationships among genes.

---
## First generation enrichment analysis: Null hypothesis

- **Self-contained `$H_0$`**: genes in the gene set do not have any association with the pheontype

- Problem: restrictive, use information only from a gene set
                  
<img src="img/self_vs_competitive.png" width="500px" style="display: block; margin: auto;" />

---
## First generation enrichment analysis: Null hypothesis

- **Competitive `$H_0$`**: genes in the gene set have the same level of association with a given phenotype as genes in the complement gene set

- Problem: wrong assumption of independent gene sampling

---
## Hypergeometric test

- `$m$` is the total number of genes

- `$j$` is the number of genes are in the functional category

- `$n$` is the number of differentially expressed genes

- `$k$` is the number of differentially expressed genes in the category

---
## Hypergeometric test

- `$m$` is the total number of genes

- `$j$` is the number of genes are in the functional category

- `$n$` is the number of differentially expressed genes

- `$k$` is the number of differentially expressed genes in the category

The expected value of `$k$` would be `$k_e=(n/m)*j$`.

If `$k > k_e$`, functional category is said to be enriched, with a ratio of enrichment `$r=k/k_e$`

---
## Hypergeometric test

- `$m$` is the total number of genes

- `$j$` is the number of genes are in the functional category

- `$n$` is the number of differentially expressed genes

- `$k$` is the number of differentially expressed genes in the category

|                    | Diff. exp. genes | Not Diff. exp. genes | Total |
|--------------------|:----------------:|:--------------------:|:------|
| In gene set        |        k         |           j-k        | j     |
| Not in gene set    |       n-k        |         m-n-j+k      | m-j   |
| Total              |       n          |           m-n        |  m    |

---
## Hypergeometric test

- `$m$` is the total number of genes

- `$j$` is the number of genes are in the functional category

- `$n$` is the number of differentially expressed genes

- `$k$` is the number of differentially expressed genes in the category

What is the probability of having `$k$` or more genes from the category in the selected `$n$` genes?

`$$P = \sum_{i=k}^n{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}$$`

---
## Hypergeometric test

- `$m$` is the total number of genes

- `$j$` is the number of genes are in the functional category

- `$n$` is the number of differentially expressed genes

- `$k$` is the number of differentially expressed genes in the category

`$k < (n/m)*j$` - underrepresentation. Probability of `$k$` or less genes from the category in the selected `$n$` genes?

`$$P = \sum_{i=0}^k{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}$$`

---
## Interpretation in the Hypergeometric Test

The terms in the formula represent the probability of selecting exactly `$i$` genes from the category  in a selection of `$n$` genes.

1.  **Denominator: `$\binom{m}{n}$`** - The total number of ways to choose `$n$` **differentially expressed genes** from the **total `$m$` genes**. This is the sample space size.

2.  **Numerator: `$\binom{j}{i}$`** - The number of ways to choose `$i$` **genes** from the `$j$` **genes in the functional category**.

3.  **Numerator: `$\binom{m-j}{n-i}$`** - The number of ways to choose the **remaining `$n-i$` genes** from the `$m-j$` **genes that are *not* in the functional category**.

The summation `$\sum_{i=k}^n$` calculates the probability for `$k$` **or more** genes (`$i=k, k+1, \ldots, n$`) and adds these individual probabilities together to get the final `$P$`-value.

---
## Hypergeometric test

1. Find a set of differentially expressed genes (DEGs)
2. Are _DEGs in a set_ more common than _DEGs not in a set_?

- Fisher test `stats::fisher.test()`

- Conditional hypergeometric test, to account for directed hierachy of GO `GOstats::hyperGTest()`

.small[ Falcon, S., and R. Gentleman. “Using GOstats to Test Gene Lists for GO Term Association.” Bioinformatics 23, no. 2 (2007): 257–58. https://doi.org/10.1093/bioinformatics/btl567. ]

---
## Problems with Hypergeometric test

- The outcome of the overrepresentation test depends on the significance threshold used to declare genes differentially expressed.

- Functional categories in which many genes exhibit small changes may go undetected.

- Genes are not independent, so a key assumption of the Fisher’s exact tests is violated.

- Pathways overlap

---
## Correcting for pathway overlap

.small[ Donato M, Xu Z, Tomoiaga A, Granneman JG, Mackenzie RG, Bao R, Than NG, Westfall PH, Romero R, Draghici S. Analysis and correction of crosstalk effects in pathway analysis. Genome Res. 2013 Nov;23(11):1885-93.  https://www.ncbi.nlm.nih.gov/pubmed/23934932 ]

---
## Secong generation: Gene set enrichment analysis (GSEA)

- **Gene set analysis (GSA)**. Mootha et al., 2003; modified by Subramanian, et al. "**Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.**" PNAS 2005 https://doi.org/10.1073/pnas.0506580102

- Main rationale – functionally related genes often display a coordinated expression to accomplish their roles in the cells

- Aims to identify gene sets with "subtle but coordinated"  expression changes that would be missed by DEGs threshold selection