1 Introduction

In genomic analyses, different databases and software tools use different gene identifiers.
Common types include:

  • HGNC gene symbols (e.g., TP53, BRCA1)
  • Entrez Gene IDs (numeric, NCBI)
  • Ensembl Gene IDs (e.g., ENSG00000141510)

For reproducibility and integration across datasets, we often need to map between these identifiers.

The Bioconductor package org.Hs.eg.db (https://bioconductor.org/packages/org.Hs.eg.db/)
provides a comprehensive annotation database for the human genome.
It includes mappings between gene symbols, Entrez IDs, Ensembl IDs, UniProt, and more.

2 Setup

# Load required libraries
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install annotation package if not already installed
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
    BiocManager::install("org.Hs.eg.db")

library(org.Hs.eg.db)
library(dplyr)

4 Convert Gene Symbols → Entrez IDs

The function mapIds() allows flexible ID conversion:

library(AnnotationDbi)
entrez_ids <- mapIds(org.Hs.eg.db,
                     keys = genes,
                     column = "ENTREZID",
                     keytype = "SYMBOL",
                     multiVals = "first")
entrez_ids
##   TP53  BRCA1  BRCA2   PTEN   KRAS   EGFR    MYC 
## "7157"  "672"  "675" "5728" "3845" "1956" "4609"

5 Convert Gene Symbols → Ensembl IDs

ensembl_ids <- mapIds(org.Hs.eg.db,
                      keys = genes,
                      column = "ENSEMBL",
                      keytype = "SYMBOL",
                      multiVals = "first")
ensembl_ids
##              TP53             BRCA1             BRCA2              PTEN 
## "ENSG00000141510" "ENSG00000012048" "ENSG00000139618" "ENSG00000171862" 
##              KRAS              EGFR               MYC 
## "ENSG00000133703" "ENSG00000146648" "ENSG00000136997"

6 Convert back: Entrez IDs → Gene Symbols

back_to_symbol <- mapIds(org.Hs.eg.db,
                         keys = entrez_ids,
                         column = "SYMBOL",
                         keytype = "ENTREZID",
                         multiVals = "first")
back_to_symbol
##    7157     672     675    5728    3845    1956    4609 
##  "TP53" "BRCA1" "BRCA2"  "PTEN"  "KRAS"  "EGFR"   "MYC"

7 Combine results in a data frame

conversion_table <- data.frame(
  Symbol = genes,
  Entrez = unname(entrez_ids),
  Ensembl = unname(ensembl_ids),
  Symbol_back = unname(back_to_symbol)
)

conversion_table
##   Symbol Entrez         Ensembl Symbol_back
## 1   TP53   7157 ENSG00000141510        TP53
## 2  BRCA1    672 ENSG00000012048       BRCA1
## 3  BRCA2    675 ENSG00000139618       BRCA2
## 4   PTEN   5728 ENSG00000171862        PTEN
## 5   KRAS   3845 ENSG00000133703        KRAS
## 6   EGFR   1956 ENSG00000146648        EGFR
## 7    MYC   4609 ENSG00000136997         MYC

8 Adding gene description

# Convert Gene Symbols → Entrez, Ensembl, Description
results <- AnnotationDbi::select(org.Hs.eg.db,
                                 keys = genes,
                                 columns = c("ENTREZID", "ENSEMBL", "GENENAME"),
                                 keytype = "SYMBOL")

results
##   SYMBOL ENTREZID         ENSEMBL                                      GENENAME
## 1   TP53     7157 ENSG00000141510                             tumor protein p53
## 2  BRCA1      672 ENSG00000012048                   BRCA1 DNA repair associated
## 3  BRCA2      675 ENSG00000139618                   BRCA2 DNA repair associated
## 4   PTEN     5728 ENSG00000171862                phosphatase and tensin homolog
## 5   PTEN     5728 ENSG00000284792                phosphatase and tensin homolog
## 6   KRAS     3845 ENSG00000133703                   KRAS proto-oncogene, GTPase
## 7   EGFR     1956 ENSG00000146648              epidermal growth factor receptor
## 8    MYC     4609 ENSG00000136997 MYC proto-oncogene, bHLH transcription factor

9 Summary

  • org.Hs.eg.db is a key Bioconductor package for gene ID mapping in humans.
  • Using mapIds(), we can flexibly convert between symbols, Entrez IDs, Ensembl IDs, UniProt IDs, etc.
  • This is essential for integrating data from different sources (e.g., GEO, TCGA, Ensembl, KEGG).