Gene annotation management in Bioconductor

1 Introduction

In genomic analyses, different databases and software tools use different gene identifiers.
Common types include:

HGNC gene symbols (e.g., TP53, BRCA1)
Entrez Gene IDs (numeric, NCBI)
Ensembl Gene IDs (e.g., ENSG00000141510)

For reproducibility and integration across datasets, we often need to map between these identifiers.

The Bioconductor package org.Hs.eg.db (https://bioconductor.org/packages/org.Hs.eg.db/)
provides a comprehensive annotation database for the human genome.
It includes mappings between gene symbols, Entrez IDs, Ensembl IDs, UniProt, and more.

2 Setup

# Load required libraries
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install annotation package if not already installed
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
    BiocManager::install("org.Hs.eg.db")

library(org.Hs.eg.db)
library(dplyr)

3 Example: Cancer-related genes

Let’s take a small set of well-known cancer genes:

genes <- c("TP53", "BRCA1", "BRCA2", "PTEN", "KRAS", "EGFR", "MYC")
genes

## [1] "TP53"  "BRCA1" "BRCA2" "PTEN"  "KRAS"  "EGFR"  "MYC"

4 Convert Gene Symbols → Entrez IDs

The function mapIds() allows flexible ID conversion:

library(AnnotationDbi)
entrez_ids <- mapIds(org.Hs.eg.db,
                     keys = genes,
                     column = "ENTREZID",
                     keytype = "SYMBOL",
                     multiVals = "first")
entrez_ids

##   TP53  BRCA1  BRCA2   PTEN   KRAS   EGFR    MYC 
## "7157"  "672"  "675" "5728" "3845" "1956" "4609"

5 Convert Gene Symbols → Ensembl IDs

ensembl_ids <- mapIds(org.Hs.eg.db,
                      keys = genes,
                      column = "ENSEMBL",
                      keytype = "SYMBOL",
                      multiVals = "first")
ensembl_ids

##              TP53             BRCA1             BRCA2              PTEN 
## "ENSG00000141510" "ENSG00000012048" "ENSG00000139618" "ENSG00000171862" 
##              KRAS              EGFR               MYC 
## "ENSG00000133703" "ENSG00000146648" "ENSG00000136997"

6 Convert back: Entrez IDs → Gene Symbols

back_to_symbol <- mapIds(org.Hs.eg.db,
                         keys = entrez_ids,
                         column = "SYMBOL",
                         keytype = "ENTREZID",
                         multiVals = "first")
back_to_symbol

##    7157     672     675    5728    3845    1956    4609 
##  "TP53" "BRCA1" "BRCA2"  "PTEN"  "KRAS"  "EGFR"   "MYC"

7 Combine results in a data frame

conversion_table <- data.frame(
  Symbol = genes,
  Entrez = unname(entrez_ids),
  Ensembl = unname(ensembl_ids),
  Symbol_back = unname(back_to_symbol)
)

conversion_table

##   Symbol Entrez         Ensembl Symbol_back
## 1   TP53   7157 ENSG00000141510        TP53
## 2  BRCA1    672 ENSG00000012048       BRCA1
## 3  BRCA2    675 ENSG00000139618       BRCA2
## 4   PTEN   5728 ENSG00000171862        PTEN
## 5   KRAS   3845 ENSG00000133703        KRAS
## 6   EGFR   1956 ENSG00000146648        EGFR
## 7    MYC   4609 ENSG00000136997         MYC

8 Adding gene description

# Convert Gene Symbols → Entrez, Ensembl, Description
results <- AnnotationDbi::select(org.Hs.eg.db,
                                 keys = genes,
                                 columns = c("ENTREZID", "ENSEMBL", "GENENAME"),
                                 keytype = "SYMBOL")

results

##   SYMBOL ENTREZID         ENSEMBL                                      GENENAME
## 1   TP53     7157 ENSG00000141510                             tumor protein p53
## 2  BRCA1      672 ENSG00000012048                   BRCA1 DNA repair associated
## 3  BRCA2      675 ENSG00000139618                   BRCA2 DNA repair associated
## 4   PTEN     5728 ENSG00000171862                phosphatase and tensin homolog
## 5   PTEN     5728 ENSG00000284792                phosphatase and tensin homolog
## 6   KRAS     3845 ENSG00000133703                   KRAS proto-oncogene, GTPase
## 7   EGFR     1956 ENSG00000146648              epidermal growth factor receptor
## 8    MYC     4609 ENSG00000136997 MYC proto-oncogene, bHLH transcription factor

9 Summary

org.Hs.eg.db is a key Bioconductor package for gene ID mapping in humans.
Using mapIds(), we can flexibly convert between symbols, Entrez IDs, Ensembl IDs, UniProt IDs, etc.
This is essential for integrating data from different sources (e.g., GEO, TCGA, Ensembl, KEGG).