Bioconductor provides software for the analysis and comprehension of
high-throughput genomic data. It extends R with specialized data
containers such as the ExpressionSet
.
To install Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
To install additional Bioconductor packages, use
BiocManager::install("package_name")
.
Once the base Bioconductor packages have been installed, you can access the vignettes for a specific package as follows:
Please select a vignette:
1: airway - Airway smooth muscle cells
2: Biobase - An introduction to Biobase and ExpressionSets
3: Biobase - esApply Introduction
4: Biobase - Notes for eSet developers
Press “1” to read the first one - it is the foundation of genomics data formats used in R. Or, press “0” to quit.
Object | Origin / Era | Main Use | Structure of Assay Data | Sample Annotations | Feature Annotations | Genomic Coordinates | Metadata | Notes |
---|---|---|---|---|---|---|---|---|
ExpressionSet | Early Bioconductor (microarray era) | Microarray data, early RNA-seq | Matrix (features × samples) | phenoData (AnnotatedDataFrame ) |
featureData (AnnotatedDataFrame ) |
❌ Not built-in | experimentData (MIAME) |
First standardized container; still widely used because many packages expect it. |
SummarizedExperiment (SE) | Modern Bioconductor | RNA-seq, proteomics, general omics | One or more assays (matrix-like, e.g., counts, TPM, logCPM) | colData (DataFrame ) |
rowData (DataFrame ) |
❌ Not required | metadata list |
Designed to be flexible beyond microarrays, supports multiple assays. |
RangedSummarizedExperiment (RSE) | Subclass of SE | Genomics data (RNA-seq, ChIP-seq, ATAC-seq) | Same as SE | Same as SE | rowRanges (GRanges ) |
✅ Yes, built-in via rowRanges |
Same as SE | Extends SE with proper genomic coordinates per feature (chrom, start, end, strand). |
SingleCellExperiment (SCE) | Subclass of SE | Single-cell RNA-seq, multimodal | Multiple assays (counts, logcounts, etc.) | colData for per-cell annotations |
rowData for gene annotations |
Optional (rowRanges ) |
reducedDims , altExps |
Adds dimensionality reduction slots, alternative experiments, useful for scRNA-seq. |
MultiAssayExperiment (MAE) | Higher-level container | Multi-omics (RNA-seq + ATAC-seq + methylation, etc.) | List of multiple SE/RSE objects | Unified sample map across experiments | Experiment-specific | Depends on components | experimentData & metadata |
Organizes several SE-like objects under one umbrella. |
Recall that objects in R can be either a vector
,
factor
, matrix
, array
,
data.frame
, list
, or ts
.
The Biobase package of the Bioconductor project is
fundamental, and established new objects that can be used to store
genomic data.
An ExpressionSet
is a container that organizes:
assayData
— expression matrix (features ×
samples)phenoData
— sample-level annotations (phenotypes,
treatments, conditions)featureData
— feature-level annotations (e.g., genes,
probes)protocolData
— information about the processing
protocolexperimentData
— experiment-level metadataLet’s look at an ExpressionSet
object, built from the
airway dataset.
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples
## element names: exprs, se.exprs
## protocolData: none
## phenoData
## sampleNames: A B ... Z (26 total)
## varLabels: sex type score
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2
For adventurous, let’s peek under the hood to see the slots of the
ExpressionSet object. Access them as
sample.ExpressionSet@experimentData
First, the most important part of the high-throughput genomic
experiment is the matrix of expression values. The underlying structure
of an expression matrix in Bioconductor is that the probes (i.e., genes)
are in rows while the samples are in columns. Let’s read in an example
expression matrix and then store it as an ExpressionSet. Once created,
exprs
is the extractor function that is used to access the
expression values.
Let’s read in a gene expression matrix.
expression <- exprs(sample.ExpressionSet) # This is how to get to the expression matrix !
dim(expression)
## [1] 500 26
## [1] "matrix" "array"
## [1] "AFFX-MurIL2_at" "AFFX-MurIL10_at" "AFFX-MurIL4_at" "AFFX-MurFAS_at"
## [1] "A" "B" "C" "D"
## A B C
## AFFX-MurIL2_at 192.7420 85.75330 176.7570
## AFFX-MurIL10_at 97.1370 126.19600 77.9216
## AFFX-MurIL4_at 45.8192 8.83135 33.0632
## AFFX-MurFAS_at 22.5445 3.60093 14.6883
## AFFX-BioB-5_at 96.7875 30.43800 46.1271
Having just expression values, we can construct minimal expression set.
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples
## element names: exprs
## protocolData: none
## phenoData: none
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
## A B C
## AFFX-MurIL2_at 192.7420 85.75330 176.7570
## AFFX-MurIL10_at 97.1370 126.19600 77.9216
## AFFX-MurIL4_at 45.8192 8.83135 33.0632
## AFFX-MurFAS_at 22.5445 3.60093 14.6883
## AFFX-BioB-5_at 96.7875 30.43800 46.1271
## [1] "AFFX-MurIL2_at" "AFFX-MurIL10_at" "AFFX-MurIL4_at" "AFFX-MurFAS_at"
Phenotypic data provides information about the samples, such as
normal/abnormal, age, gender, etc. The phenotypic data is represented
such that samples appear in rows while the variables appear in columns.
Notice that when including phenotypic data in an ExpressionSet, the
row.names
in the phenoData
must match the
sample names in the expression matrix.
characteristics <- pData(sample.ExpressionSet) # read.csv("lab/data/phenodata.csv", row.names = 1)
head(characteristics)
## sex type score
## A Female Control 0.75
## B Male Case 0.40
## C Male Control 0.73
## D Male Case 0.42
## E Female Case 0.93
## F Male Control 0.22
## sex type score
## Female:11 Case :15 Min. :0.1000
## Male :15 Control:11 1st Qu.:0.3275
## Median :0.4150
## Mean :0.5369
## 3rd Qu.:0.7650
## Max. :0.9800
## [1] TRUE
You will get a warning if there is a mismatch. Before including the
phenoData
into the ExpressionSet
, we may add
some documentation describing information about each covariate (what
does the variable name represent, what units the covariates are measure
in, etc). This is done by creating a metadata table.
# metadata must describe all three columns
metadata <- data.frame(
labelDescription = c("Patient sex (Male or Female)",
"Sample type (Control or Case)",
"Phenotype score (numeric)"),
row.names = c("sex", "type", "score")
)
metadata
## labelDescription
## sex Patient sex (Male or Female)
## type Sample type (Control or Case)
## score Phenotype score (numeric)
# build AnnotatedDataFrame
phenoChar <- new("AnnotatedDataFrame",
data = characteristics,
varMetadata = metadata)
phenoChar
## An object of class 'AnnotatedDataFrame'
## rowNames: A B ... Z (26 total)
## varLabels: sex type score
## varMetadata: labelDescription
## sex type score
## A Female Control 0.75
## B Male Case 0.40
## C Male Control 0.73
## D Male Case 0.42
## E Female Case 0.93
## [1] Female Male Male Male Female
## Levels: Female Male
Once a phenoData set is created, it can be accessed using the pData accessor function. Adding phenoData to samples from your ExpressionSet but ensure the phenotypic characteristics stored with it are properly aligned.
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A B ... Z (26 total)
## varLabels: sex type score
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
## NULL
The following code shows what happens when the phenotypic and expression data do not include matching sample names (output suppressed).
After an analysis, one is usually left with cryptic manufacturer labels of the probes that were significant in your data analysis. To provide meaning to these probes, annotations represent meta data about the probes. The annotation package provides some basic tools for annotation packages.
library(hgu95av2.db)
library(annotate)
# sample.ExpressionSet already includes exprs + phenoData
# just update its annotation slot
withannoSet <- sample.ExpressionSet
annotation(withannoSet) <- "hgu95av2.db"
withannoSet
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples
## element names: exprs, se.exprs
## protocolData: none
## phenoData
## sampleNames: A B ... Z (26 total)
## varLabels: sex type score
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2.db
# Get all probe IDs
all_probes <- featureNames(withannoSet)
# Remove spike-ins (those starting with "AFFX-")
gene_probes <- all_probes[!grepl("^AFFX-", all_probes)]
library(annotate)
# Map symbols/Entrez IDs for actual genes
gene_symbols <- getSYMBOL(gene_probes, "hgu95av2.db")
gene_entrez <- getEG(gene_probes, "hgu95av2.db")
# Inspect first few
head(gene_symbols, n = 20)
## 31307_at 31308_at 31309_r_at 31310_at 31311_at
## NA NA NA "GLRA1" NA
## 31312_at 31313_at 31314_at 31315_at 31316_at
## "KCNB2" "MGAT5" "BMP3" "IGLV2-14" NA
## 31317_r_at 31318_at 31319_at 31320_at 31321_at
## NA "KITLG" "IGKV2OR22-4" "GPR12" NA
## 31322_at 31323_r_at 31324_at 31325_at 31326_at
## "YME1L1" "SLC1A2" NA NA "KPLCE"
## 31307_at 31308_at 31309_r_at 31310_at 31311_at 31312_at
## NA NA NA "2741" NA "9312"
Create featureData
from scratch.
# Assume `expression` is your expression matrix
probes <- rownames(expression)
# Create a data frame with some basic annotations
fdata_df <- data.frame(
ProbeID = probes,
Symbol = getSYMBOL(probes, "hgu95av2.db"),
Entrez = getEG(probes, "hgu95av2.db"),
stringsAsFactors = FALSE
)
# Convert to AnnotatedDataFrame
featureData_adf <- AnnotatedDataFrame(fdata_df)
featureData_adf
## An object of class 'AnnotatedDataFrame'
## rowNames: AFFX-MurIL2_at AFFX-MurIL10_at ... 31739_at (500 total)
## varLabels: ProbeID Symbol Entrez
## varMetadata: labelDescription
Data about the experiment can be stored in the
experimentData
slot.
library(Biobase)
experimentData <- new("MIAME",
name = "Bioconductor Example Author",
lab = "Bioconductor Labs",
contact = "bioc-support@bioconductor.org",
title = "Example Microarray Study using sample.ExpressionSet",
abstract = "Toy dataset included in Biobase demonstrating structure of ExpressionSet with microarray expression data.",
url = "https://bioconductor.org/packages/Biobase",
pubMedIds = "NA",
other = list(notes = "This dataset contains 500 features (probes) across 26 samples, including spike-in controls."))
experimentData
## Experiment data
## Experimenter name: Bioconductor Example Author
## Laboratory: Bioconductor Labs
## Contact information: bioc-support@bioconductor.org
## Title: Example Microarray Study using sample.ExpressionSet
## URL: https://bioconductor.org/packages/Biobase
## PMIDs: NA
##
## Abstract: A 13 word abstract is available. Use 'abstract' method.
## notes:
## notes:
## This dataset contains 500 features (probes) across 26 samples, including s
## pike-in controls.
## [1] "Toy dataset included in Biobase demonstrating structure of ExpressionSet with microarray expression data."
## $notes
## [1] "This dataset contains 500 features (probes) across 26 samples, including spike-in controls."
# Construct the ExpressionSet with featureData
withexpSet <- ExpressionSet(
assayData = as.matrix(expression),
phenoData = phenoChar,
featureData = featureData_adf,
annotation = "hgu95av2.db",
experimentData= experimentData
)
withexpSet
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A B ... Z (26 total)
## varLabels: sex type score
## varMetadata: labelDescription
## featureData
## featureNames: AFFX-MurIL2_at AFFX-MurIL10_at ... 31739_at (500 total)
## fvarLabels: ProbeID Symbol Entrez
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: NA
## Annotation: hgu95av2.db
## Experiment data
## Experimenter name: Bioconductor Example Author
## Laboratory: Bioconductor Labs
## Contact information: bioc-support@bioconductor.org
## Title: Example Microarray Study using sample.ExpressionSet
## URL: https://bioconductor.org/packages/Biobase
## PMIDs: NA
##
## Abstract: A 13 word abstract is available. Use 'abstract' method.
## notes:
## notes:
## This dataset contains 500 features (probes) across 26 samples, including s
## pike-in controls.
## [1] "Toy dataset included in Biobase demonstrating structure of ExpressionSet with microarray expression data."
## A B C D E
## AFFX-MurIL2_at 192.7420 85.75330 176.7570 135.5750 64.49390
## AFFX-MurIL10_at 97.1370 126.19600 77.9216 93.3713 24.39860
## AFFX-MurIL4_at 45.8192 8.83135 33.0632 28.7072 5.94492
## AFFX-MurFAS_at 22.5445 3.60093 14.6883 12.3397 36.86630
## AFFX-BioB-5_at 96.7875 30.43800 46.1271 70.9319 56.17440
## sex type score
## A Female Control 0.75
## B Male Case 0.40
## C Male Control 0.73
## D Male Case 0.42
## E Female Case 0.93
## F Male Control 0.22
## G Male Case 0.96
## H Male Case 0.79
## I Female Case 0.37
## J Male Control 0.63
## K Male Case 0.26
## L Female Control 0.36
## M Male Case 0.41
## N Male Case 0.80
## O Female Case 0.10
## P Female Control 0.41
## Q Female Case 0.16
## R Male Control 0.72
## S Male Case 0.17
## T Female Case 0.74
## U Male Control 0.35
## V Female Control 0.77
## W Male Control 0.27
## X Male Control 0.98
## Y Female Case 0.94
## Z Female Case 0.32
## An object of class 'AnnotatedDataFrame'
## featureNames: AFFX-MurIL2_at AFFX-MurIL10_at ... 31739_at (500 total)
## varLabels: ProbeID Symbol Entrez
## varMetadata: labelDescription
The next generation of an object that can hold annotated ‘omics’ data
is SummarizedExperiment
. It is not limited to genes, but
instead holds information about genomic regions of interest.
We’ll look at an example of the SummarizedExperiment
object in the airway package. The loaded data is a
SummarizedExperiment
, which contains RNA-seq read counts
for human airway smooth muscle cells treated with dexamethasone or
control. The SummarizedExperiment
object has 63,677
rows, which are genes (features), and 8
columns, which are samples. The count data are stored in the
assay called counts
. Each row has metadata in
rowData(se)
, including Ensembl gene IDs, and each column
has sample-level metadata in colData(se)
, including cell
line, treatment, and other phenotypic information. This object structure
allows easy access to expression values, feature annotations, and sample
annotations in a consistent way.
### SummarizedExperiment demo with airway
library(airway)
# Load airway dataset (RNA-seq counts)
data("airway")
airway
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
assay
function can be used get access to the counts of
RNA sequencing reads. colData
function , the column data,
is equivalent to the pData
on the
ExpressionSet
. Each row in this data frame corresponds to a
column in the SummarizedExperiment
. We can see that there
are indeed 27 rows here, which give information about the columns. Each
sample in this case is treated with two treatments or control and we can
see the number of replicates for each, using the as.numeric function
again.
## [1] 63677 8
## SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003 679 448 873
## ENSG00000000005 0 0 0
## ENSG00000000419 467 515 621
# Dimensions of this assay is a matrix, which has the same dimensions as the SummarizedExperiment.
dim(assay(se))
## [1] 63677 8
## DataFrame with 3 rows and 6 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## [1] 8 9
## [1] "SampleName" "cell" "dex" "albut" "Run"
## [6] "avgLength" "Experiment" "Sample" "BioSample"
## [1] untrt trt untrt trt untrt trt untrt trt
## Levels: trt untrt
See https://bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html for the full description.
library(dplyr)
library(pander)
diagnostics <- devtools::session_info()
platform <- data.frame(diagnostics$platform %>% unlist, stringsAsFactors = FALSE)
colnames(platform) <- c("description")
pander(platform)
description | |
---|---|
version | R version 4.5.1 (2025-06-13) |
os | macOS Sequoia 15.6.1 |
system | aarch64, darwin20 |
ui | X11 |
language | (EN) |
collate | en_US.UTF-8 |
ctype | en_US.UTF-8 |
tz | America/New_York |
date | 2025-09-17 |
pandoc | 3.4 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) |
quarto | 1.6.42 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/quarto |
package | ondiskversion | loadedversion | path |
loadedpath | attached | is_base | date | source |
md5ok | library |