Annotations

class: center, middle, inverse, title-slide

.title[
# Annotations
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-09-17
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## Common Gene Identifiers

- **Gene Symbols (HGNC / MGI / FlyBase, etc.)**  
  - Short, human-readable names like **TP53**, **BRCA1**, **Act5C**  
  - Species-specific naming conventions

- **Entrez Gene ID (NCBI)** - Numeric identifier for a gene record in NCBI  
  - Example: **7157** → TP53

- **Ensembl Gene ID**  - Stable alphanumeric identifier for genes in Ensembl  
  - Example: **ENSG00000141510** → TP53 (human)

- **RefSeq ID**  - Accession numbers for curated reference sequences (mRNA, protein, genomic)  
  - Example: **NM_000546** (TP53 mRNA), **NP_000537** (TP53 protein)

---
## Common Gene Identifiers

- **UniProt / SwissProt ID**  - Protein-centric identifier for curated protein entries  
  - Example: **P04637** → TP53

- **Other species-specific IDs**  
  - MGI (mouse): **MGI:98729**  
  - FlyBase (Drosophila): **FBgn0000017**  
  - WormBase (C. elegans): **WBGene00006711**

- **Key Notes**  
  - IDs are often cross-referenced using databases or Bioconductor packages (e.g., `biomaRt`, `org.Hs.eg.db`)  
  - Choosing the correct ID type is critical for data integration and reproducibility

---
## Standardized gene identifiers

### HUGO Gene Nomenclature Committee (HGNC) Gene Names

This resource lists gene name synonyms, which is useful if you are conducting a comprehensive literature search and need to find articles about a gene that may have been called other names in the past.

.small[ https://www.genenames.org/ ]

### GeneCards

Comprehensive **human gene database** providing Gene function and descriptions, Protein information,  Expression patterns across tissues, Related diseases, pathways, and drugs. Integrates data from multiple sources into a single searchable platform

.small[ http://www.genecards.org/]

---
## Gene ID Cross-Mapping

- **Many identifiers exist** for the same gene (Entrez, Ensembl, RefSeq, UniProt, species-specific IDs)

- **Software tools** typically support only a subset of ID types for analysis

- **Humans** often find **gene symbols** easier to read and interpret

- **Cross-mapping** is essential to integrate datasets, link annotations, and ensure reproducibility

- Tools for mapping: Bioconductor packages (`biomaRt`, `AnnotationDbi`), UniProt ID mapping, NCBI E-utilities

---
## Gene ID Challenges

- **Mapping errors**  
  - Be careful with **1-to-many or many-to-1 mappings**  
  - Cross-referencing across databases may not always be perfect

- **Gene name ambiguity**  
  - Aliases can cause confusion (e.g., **FLJ92943**, **LFS1**, **TRP53**, **p53**)  
  - Prefer using **official gene symbols** (e.g., **TP53**)

- **Spreadsheet pitfalls**  
  - Excel may auto-convert names like **OCT4 → October 4**  
  - Use “paste as text” or alternative tools to avoid data corruption

- **Incomplete cross-mapping**  
  - Some IDs may be missing due to version mismatches or database updates  
  - Combine multiple sources to maximize coverage and reliability

---
## biomaRt — Accessing Bioinformatics Databases in R

- **What it is**  
  - An R/Bioconductor package for querying Ensembl directly from R  
  - Allows retrieval of gene, transcript, and protein annotations  
  - Access a variety of datasets: human, mouse, and other species

- **Typical Use Cases**  
  - Gene ID conversion  
  - Annotating gene lists with functional information  
  - Integrating multiple genomic datasets

.small[ https://bioconductor.org/packages/biomaRt/ ]

---
## biomaRt: `getBM()` function

The `getBM()` function requires three main arguments: **filters**, **attributes**, and **values**

- **Filters** → Define the type of input IDs  
  - Tell biomaRt what identifiers you have (e.g., Ensembl IDs, gene symbols)  
  - Use `listFilters()` to see all available filters for your dataset

- **Attributes** → Specify what information to retrieve  
  - Which identifiers or annotations you want to map to (e.g., Entrez ID, gene symbol, UniProt ID)  
  - Use `listAttributes()` to explore available attributes

- **Values** → Provide the actual input IDs  
  - A vector of the gene/protein IDs you want to query or convert

---
## biomaRt Gotchas

- **Host / Database Version**  
  - Defines which Ensembl version you are querying  
  - For **gene ID conversion**, use the **latest database** to ensure up-to-date mappings

- **Genome Assembly Considerations**  
  - For **genomic coordinates**, select the database that matches the **genome assembly version** of interest (e.g., GRCh38 vs GRCh37)  
  - Mismatched assemblies can lead to incorrect coordinates or failed queries

**Always double-check your dataset’s genome assembly version and Ensembl version before retrieving data**

---
## AnnotationHub — Centralized Access to Genomic Annotations

- **What it is**  
  - A Bioconductor resource providing **centralized access to a wide variety of genomic annotation data**  
  - Supports multiple species and data types (e.g., genes, transcripts, regulatory regions, epigenomic marks)  
  - Integrates data from sources such as Ensembl, UCSC, and GEO

- **Typical Use Cases**  
  - Retrieve gene/genomic region annotation datasets
  - Access epigenomic datasets for analysis

.small[https://bioconductor.org/packages/AnnotationHub/]

---
## ExperimentHub — Access Curated Experimental Data

- **What it is**  
  - A Bioconductor resource providing **centralized access to curated datasets** from experiments, publications, and training materials  
  - Each dataset includes **metadata, tags, and modification dates** for easy filtering and tracking  
  - Interface and usage are **similar to AnnotationHub**, making it easy to learn

- **Use Cases**  
  - Retrieve processed experimental data for reproducible analyses  
  - Access curated training datasets for teaching or benchmarking

.small[https://bioconductor.org/packages/ExperimentHub/]