In genomic analyses, different databases and software tools use
different gene identifiers.
Common types include:
For reproducibility and integration across datasets, we often need to map between these identifiers.
The Bioconductor package org.Hs.eg.db
(https://bioconductor.org/packages/org.Hs.eg.db/)
provides a comprehensive annotation database for the human genome.
It includes mappings between gene symbols, Entrez IDs, Ensembl IDs,
UniProt, and more.
The function mapIds() allows flexible
ID conversion:
library(AnnotationDbi)
entrez_ids <- mapIds(org.Hs.eg.db,
keys = genes,
column = "ENTREZID",
keytype = "SYMBOL",
multiVals = "first")
entrez_ids## TP53 BRCA1 BRCA2 PTEN KRAS EGFR MYC
## "7157" "672" "675" "5728" "3845" "1956" "4609"
ensembl_ids <- mapIds(org.Hs.eg.db,
keys = genes,
column = "ENSEMBL",
keytype = "SYMBOL",
multiVals = "first")
ensembl_ids## TP53 BRCA1 BRCA2 PTEN
## "ENSG00000141510" "ENSG00000012048" "ENSG00000139618" "ENSG00000171862"
## KRAS EGFR MYC
## "ENSG00000133703" "ENSG00000146648" "ENSG00000136997"
back_to_symbol <- mapIds(org.Hs.eg.db,
keys = entrez_ids,
column = "SYMBOL",
keytype = "ENTREZID",
multiVals = "first")
back_to_symbol## 7157 672 675 5728 3845 1956 4609
## "TP53" "BRCA1" "BRCA2" "PTEN" "KRAS" "EGFR" "MYC"
conversion_table <- data.frame(
Symbol = genes,
Entrez = unname(entrez_ids),
Ensembl = unname(ensembl_ids),
Symbol_back = unname(back_to_symbol)
)
conversion_table## Symbol Entrez Ensembl Symbol_back
## 1 TP53 7157 ENSG00000141510 TP53
## 2 BRCA1 672 ENSG00000012048 BRCA1
## 3 BRCA2 675 ENSG00000139618 BRCA2
## 4 PTEN 5728 ENSG00000171862 PTEN
## 5 KRAS 3845 ENSG00000133703 KRAS
## 6 EGFR 1956 ENSG00000146648 EGFR
## 7 MYC 4609 ENSG00000136997 MYC
# Convert Gene Symbols → Entrez, Ensembl, Description
results <- AnnotationDbi::select(org.Hs.eg.db,
keys = genes,
columns = c("ENTREZID", "ENSEMBL", "GENENAME"),
keytype = "SYMBOL")
results## SYMBOL ENTREZID ENSEMBL GENENAME
## 1 TP53 7157 ENSG00000141510 tumor protein p53
## 2 BRCA1 672 ENSG00000012048 BRCA1 DNA repair associated
## 3 BRCA2 675 ENSG00000139618 BRCA2 DNA repair associated
## 4 PTEN 5728 ENSG00000171862 phosphatase and tensin homolog
## 5 PTEN 5728 ENSG00000284792 phosphatase and tensin homolog
## 6 KRAS 3845 ENSG00000133703 KRAS proto-oncogene, GTPase
## 7 EGFR 1956 ENSG00000146648 epidermal growth factor receptor
## 8 MYC 4609 ENSG00000136997 MYC proto-oncogene, bHLH transcription factor
org.Hs.eg.db is a key Bioconductor
package for gene ID mapping in humans.mapIds(), we can flexibly
convert between symbols, Entrez IDs, Ensembl IDs, UniProt IDs,
etc.