1 Bioconductor basics

Bioconductor provides software for the analysis and comprehension of high-throughput genomic data. It extends R with specialized data containers such as the ExpressionSet.

To install Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

To install additional Bioconductor packages, use BiocManager::install("package_name").

2 Bioconductor basics

Once the base Bioconductor packages have been installed, you can access the vignettes for a specific package as follows:

# BiocManager::install("Biobase")
library("Biobase")
openVignette()

Please select a vignette:  

 1: airway - Airway smooth muscle cells
 2: Biobase - An introduction to Biobase and ExpressionSets
 3: Biobase - esApply Introduction
 4: Biobase - Notes for eSet developers

Press “1” to read the first one - it is the foundation of genomics data formats used in R. Or, press “0” to quit.

3 Most common Bioconductor objects

Object	Origin / Era	Main Use	Structure of Assay Data	Sample Annotations	Feature Annotations	Genomic Coordinates	Metadata	Notes
ExpressionSet	Early Bioconductor (microarray era)	Microarray data, early RNA-seq	Matrix (features × samples)	`phenoData` (`AnnotatedDataFrame`)	`featureData` (`AnnotatedDataFrame`)	❌ Not built-in	`experimentData` (MIAME)	First standardized container; still widely used because many packages expect it.
SummarizedExperiment (SE)	Modern Bioconductor	RNA-seq, proteomics, general omics	One or more assays (matrix-like, e.g., counts, TPM, logCPM)	`colData` (`DataFrame`)	`rowData` (`DataFrame`)	❌ Not required	`metadata` list	Designed to be flexible beyond microarrays, supports multiple assays.
RangedSummarizedExperiment (RSE)	Subclass of SE	Genomics data (RNA-seq, ChIP-seq, ATAC-seq)	Same as SE	Same as SE	`rowRanges` (`GRanges`)	✅ Yes, built-in via `rowRanges`	Same as SE	Extends SE with proper genomic coordinates per feature (chrom, start, end, strand).
SingleCellExperiment (SCE)	Subclass of SE	Single-cell RNA-seq, multimodal	Multiple assays (counts, logcounts, etc.)	`colData` for per-cell annotations	`rowData` for gene annotations	Optional (`rowRanges`)	`reducedDims`, `altExps`	Adds dimensionality reduction slots, alternative experiments, useful for scRNA-seq.
MultiAssayExperiment (MAE)	Higher-level container	Multi-omics (RNA-seq + ATAC-seq + methylation, etc.)	List of multiple SE/RSE objects	Unified sample map across experiments	Experiment-specific	Depends on components	`experimentData` & `metadata`	Organizes several SE-like objects under one umbrella.

4 Key Points

ExpressionSet: Legacy, microarray-focused, standardized assays + sample + feature + experiment metadata.
SummarizedExperiment: General-purpose, flexible, supports multiple assays, modern replacement.
RangedSummarizedExperiment: Adds genomic ranges to rows → best for sequencing data with genomic positions.
SingleCellExperiment: Tailored for single-cell; adds PCA/UMAP/t-SNE slots and multimodal support.
MultiAssayExperiment: Higher-level container that combines multiple SEs/RSEs across omics types.

5 ExpressionSet

Recall that objects in R can be either a vector, factor, matrix, array, data.frame, list, or ts.
The Biobase package of the Bioconductor project is fundamental, and established new objects that can be used to store genomic data.
An ExpressionSet is a container that organizes:

assayData — expression matrix (features × samples)
phenoData — sample-level annotations (phenotypes, treatments, conditions)
featureData — feature-level annotations (e.g., genes, probes)
protocolData — information about the processing protocol
experimentData — experiment-level metadata

Let’s look at an ExpressionSet object, built from the airway dataset.

library(Biobase)

data("sample.ExpressionSet")
sample.ExpressionSet

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples 
##   element names: exprs, se.exprs 
## protocolData: none
## phenoData
##   sampleNames: A B ... Z (26 total)
##   varLabels: sex type score
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2

For adventurous, let’s peek under the hood to see the slots of the ExpressionSet object. Access them as sample.ExpressionSet@experimentData

5.1 assayData (gene expression)

First, the most important part of the high-throughput genomic experiment is the matrix of expression values. The underlying structure of an expression matrix in Bioconductor is that the probes (i.e., genes) are in rows while the samples are in columns. Let’s read in an example expression matrix and then store it as an ExpressionSet. Once created, exprs is the extractor function that is used to access the expression values.

Let’s read in a gene expression matrix.

expression <- exprs(sample.ExpressionSet) # This is how to get to the expression matrix !

dim(expression)

## [1] 500  26

class(expression)

## [1] "matrix" "array"

rownames(expression)[1:4]

## [1] "AFFX-MurIL2_at"  "AFFX-MurIL10_at" "AFFX-MurIL4_at"  "AFFX-MurFAS_at"

colnames(expression)[1:4]

## [1] "A" "B" "C" "D"

expression[1:5, 1:3]

##                        A         B        C
## AFFX-MurIL2_at  192.7420  85.75330 176.7570
## AFFX-MurIL10_at  97.1370 126.19600  77.9216
## AFFX-MurIL4_at   45.8192   8.83135  33.0632
## AFFX-MurFAS_at   22.5445   3.60093  14.6883
## AFFX-BioB-5_at   96.7875  30.43800  46.1271

Having just expression values, we can construct minimal expression set.

minimalSet <- ExpressionSet(assayData = as.matrix(expression))
minimalSet

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

exprs(minimalSet)[1:5, 1:3]

##                        A         B        C
## AFFX-MurIL2_at  192.7420  85.75330 176.7570
## AFFX-MurIL10_at  97.1370 126.19600  77.9216
## AFFX-MurIL4_at   45.8192   8.83135  33.0632
## AFFX-MurFAS_at   22.5445   3.60093  14.6883
## AFFX-BioB-5_at   96.7875  30.43800  46.1271

featureNames(minimalSet)[1:4]

## [1] "AFFX-MurIL2_at"  "AFFX-MurIL10_at" "AFFX-MurIL4_at"  "AFFX-MurFAS_at"

5.2 phenoData (sample annotations)

Phenotypic data provides information about the samples, such as normal/abnormal, age, gender, etc. The phenotypic data is represented such that samples appear in rows while the variables appear in columns. Notice that when including phenotypic data in an ExpressionSet, the row.names in the phenoData must match the sample names in the expression matrix.

characteristics <- pData(sample.ExpressionSet) # read.csv("lab/data/phenodata.csv", row.names = 1)
head(characteristics)

##      sex    type score
## A Female Control  0.75
## B   Male    Case  0.40
## C   Male Control  0.73
## D   Male    Case  0.42
## E Female    Case  0.93
## F   Male Control  0.22

summary(characteristics)

##      sex          type        score       
##  Female:11   Case   :15   Min.   :0.1000  
##  Male  :15   Control:11   1st Qu.:0.3275  
##                           Median :0.4150  
##                           Mean   :0.5369  
##                           3rd Qu.:0.7650  
##                           Max.   :0.9800

all.equal(rownames(characteristics), colnames(expression))

## [1] TRUE

You will get a warning if there is a mismatch. Before including the phenoData into the ExpressionSet, we may add some documentation describing information about each covariate (what does the variable name represent, what units the covariates are measure in, etc). This is done by creating a metadata table.

# metadata must describe all three columns
metadata <- data.frame(
  labelDescription = c("Patient sex (Male or Female)",
                       "Sample type (Control or Case)",
                       "Phenotype score (numeric)"),
  row.names = c("sex", "type", "score")
)
metadata

##                    labelDescription
## sex    Patient sex (Male or Female)
## type  Sample type (Control or Case)
## score     Phenotype score (numeric)

# build AnnotatedDataFrame
phenoChar <- new("AnnotatedDataFrame", 
                 data = characteristics, 
                 varMetadata = metadata)

phenoChar

## An object of class 'AnnotatedDataFrame'
##   rowNames: A B ... Z (26 total)
##   varLabels: sex type score
##   varMetadata: labelDescription

pData(phenoChar)[1:5, ]

##      sex    type score
## A Female Control  0.75
## B   Male    Case  0.40
## C   Male Control  0.73
## D   Male    Case  0.42
## E Female    Case  0.93

pData(phenoChar)$sex[1:5]

## [1] Female Male   Male   Male   Female
## Levels: Female Male

Once a phenoData set is created, it can be accessed using the pData accessor function. Adding phenoData to samples from your ExpressionSet but ensure the phenotypic characteristics stored with it are properly aligned.

anotherSet <- ExpressionSet(assayData = as.matrix(expression), phenoData = phenoChar)
anotherSet

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A B ... Z (26 total)
##   varLabels: sex type score
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

males <- anotherSet[, pData(anotherSet)$Gender == "Male"]
pData(males)$Gender

## NULL

The following code shows what happens when the phenotypic and expression data do not include matching sample names (output suppressed).

phony.pheno <- characteristics
rownames(phony.pheno)[1] <- "wrong.sample.name"
phenoPhony <- new("AnnotatedDataFrame", data = phony.pheno, varMetadata = metadata)
phony.pheno[1:3, ]
pData(phenoPhony)[1:3, ]
errorSet <- ExpressionSet(assayData = as.matrix(expression), phenoData = phenoPhony)

5.3 Annotation (featureData, annotation)

After an analysis, one is usually left with cryptic manufacturer labels of the probes that were significant in your data analysis. To provide meaning to these probes, annotations represent meta data about the probes. The annotation package provides some basic tools for annotation packages.

library(hgu95av2.db)
library(annotate)

# sample.ExpressionSet already includes exprs + phenoData
# just update its annotation slot
withannoSet <- sample.ExpressionSet
annotation(withannoSet) <- "hgu95av2.db"
withannoSet

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples 
##   element names: exprs, se.exprs 
## protocolData: none
## phenoData
##   sampleNames: A B ... Z (26 total)
##   varLabels: sex type score
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2.db

# Get all probe IDs
all_probes <- featureNames(withannoSet)

# Remove spike-ins (those starting with "AFFX-")
gene_probes <- all_probes[!grepl("^AFFX-", all_probes)]

library(annotate)
# Map symbols/Entrez IDs for actual genes
gene_symbols <- getSYMBOL(gene_probes, "hgu95av2.db")
gene_entrez  <- getEG(gene_probes, "hgu95av2.db")

# Inspect first few
head(gene_symbols, n = 20)

##      31307_at      31308_at    31309_r_at      31310_at      31311_at 
##            NA            NA            NA       "GLRA1"            NA 
##      31312_at      31313_at      31314_at      31315_at      31316_at 
##       "KCNB2"       "MGAT5"        "BMP3"    "IGLV2-14"            NA 
##    31317_r_at      31318_at      31319_at      31320_at      31321_at 
##            NA       "KITLG" "IGKV2OR22-4"       "GPR12"            NA 
##      31322_at    31323_r_at      31324_at      31325_at      31326_at 
##      "YME1L1"      "SLC1A2"            NA            NA       "KPLCE"

head(gene_entrez)

##   31307_at   31308_at 31309_r_at   31310_at   31311_at   31312_at 
##         NA         NA         NA     "2741"         NA     "9312"

Create featureData from scratch.

# Assume `expression` is your expression matrix
probes <- rownames(expression)

# Create a data frame with some basic annotations
fdata_df <- data.frame(
  ProbeID = probes,
  Symbol  = getSYMBOL(probes, "hgu95av2.db"),
  Entrez  = getEG(probes, "hgu95av2.db"),
  stringsAsFactors = FALSE
)

# Convert to AnnotatedDataFrame
featureData_adf <- AnnotatedDataFrame(fdata_df)
featureData_adf

## An object of class 'AnnotatedDataFrame'
##   rowNames: AFFX-MurIL2_at AFFX-MurIL10_at ... 31739_at (500 total)
##   varLabels: ProbeID Symbol Entrez
##   varMetadata: labelDescription

5.4 experimentData

Data about the experiment can be stored in the experimentData slot.

library(Biobase)

experimentData <- new("MIAME", 
                      name      = "Bioconductor Example Author", 
                      lab       = "Bioconductor Labs", 
                      contact   = "bioc-support@bioconductor.org", 
                      title     = "Example Microarray Study using sample.ExpressionSet", 
                      abstract  = "Toy dataset included in Biobase demonstrating structure of ExpressionSet with microarray expression data.", 
                      url       = "https://bioconductor.org/packages/Biobase", 
                      pubMedIds = "NA", 
                      other     = list(notes = "This dataset contains 500 features (probes) across 26 samples, including spike-in controls."))

experimentData

## Experiment data
##   Experimenter name: Bioconductor Example Author 
##   Laboratory: Bioconductor Labs 
##   Contact information: bioc-support@bioconductor.org 
##   Title: Example Microarray Study using sample.ExpressionSet 
##   URL: https://bioconductor.org/packages/Biobase 
##   PMIDs: NA 
## 
##   Abstract: A 13 word abstract is available. Use 'abstract' method.
##   notes:
##    notes:     
##       This dataset contains 500 features (probes) across 26 samples, including s
## pike-in controls.

abstract(experimentData)

## [1] "Toy dataset included in Biobase demonstrating structure of ExpressionSet with microarray expression data."

notes(experimentData)

## $notes
## [1] "This dataset contains 500 features (probes) across 26 samples, including spike-in controls."

5.4.1 Putting it all together

# Construct the ExpressionSet with featureData
withexpSet <- ExpressionSet(
  assayData     = as.matrix(expression),
  phenoData     = phenoChar,
  featureData   = featureData_adf,
  annotation    = "hgu95av2.db",
  experimentData= experimentData
)

withexpSet

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: A B ... Z (26 total)
##   varLabels: sex type score
##   varMetadata: labelDescription
## featureData
##   featureNames: AFFX-MurIL2_at AFFX-MurIL10_at ... 31739_at (500 total)
##   fvarLabels: ProbeID Symbol Entrez
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: NA 
## Annotation: hgu95av2.db

experimentData(withexpSet)

## Experiment data
##   Experimenter name: Bioconductor Example Author 
##   Laboratory: Bioconductor Labs 
##   Contact information: bioc-support@bioconductor.org 
##   Title: Example Microarray Study using sample.ExpressionSet 
##   URL: https://bioconductor.org/packages/Biobase 
##   PMIDs: NA 
## 
##   Abstract: A 13 word abstract is available. Use 'abstract' method.
##   notes:
##    notes:     
##       This dataset contains 500 features (probes) across 26 samples, including s
## pike-in controls.

abstract(experimentData(withexpSet))

## [1] "Toy dataset included in Biobase demonstrating structure of ExpressionSet with microarray expression data."

exprs(withexpSet)[1:5, 1:5]

##                        A         B        C        D        E
## AFFX-MurIL2_at  192.7420  85.75330 176.7570 135.5750 64.49390
## AFFX-MurIL10_at  97.1370 126.19600  77.9216  93.3713 24.39860
## AFFX-MurIL4_at   45.8192   8.83135  33.0632  28.7072  5.94492
## AFFX-MurFAS_at   22.5445   3.60093  14.6883  12.3397 36.86630
## AFFX-BioB-5_at   96.7875  30.43800  46.1271  70.9319 56.17440

pData(withexpSet)

##      sex    type score
## A Female Control  0.75
## B   Male    Case  0.40
## C   Male Control  0.73
## D   Male    Case  0.42
## E Female    Case  0.93
## F   Male Control  0.22
## G   Male    Case  0.96
## H   Male    Case  0.79
## I Female    Case  0.37
## J   Male Control  0.63
## K   Male    Case  0.26
## L Female Control  0.36
## M   Male    Case  0.41
## N   Male    Case  0.80
## O Female    Case  0.10
## P Female Control  0.41
## Q Female    Case  0.16
## R   Male Control  0.72
## S   Male    Case  0.17
## T Female    Case  0.74
## U   Male Control  0.35
## V Female Control  0.77
## W   Male Control  0.27
## X   Male Control  0.98
## Y Female    Case  0.94
## Z Female    Case  0.32

featureData(withexpSet)

## An object of class 'AnnotatedDataFrame'
##   featureNames: AFFX-MurIL2_at AFFX-MurIL10_at ... 31739_at (500 total)
##   varLabels: ProbeID Symbol Entrez
##   varMetadata: labelDescription

6 SummarizedExperiment

The next generation of an object that can hold annotated ‘omics’ data is SummarizedExperiment. It is not limited to genes, but instead holds information about genomic regions of interest.

library(SummarizedExperiment)
?SummarizedExperiment

We’ll look at an example of the SummarizedExperiment object in the airway package. The loaded data is a SummarizedExperiment, which contains RNA-seq read counts for human airway smooth muscle cells treated with dexamethasone or control. The SummarizedExperiment object has 63,677 rows, which are genes (features), and 8 columns, which are samples. The count data are stored in the assay called counts. Each row has metadata in rowData(se), including Ensembl gene IDs, and each column has sample-level metadata in colData(se), including cell line, treatment, and other phenotypic information. This object structure allows easy access to expression values, feature annotations, and sample annotations in a consistent way.

### SummarizedExperiment demo with airway
library(airway)

# Load airway dataset (RNA-seq counts)
data("airway")
airway

## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

se <- airway  # Treat as SummarizedExperiment

assay function can be used get access to the counts of RNA sequencing reads. colData function , the column data, is equivalent to the pData on the ExpressionSet. Each row in this data frame corresponds to a column in the SummarizedExperiment. We can see that there are indeed 27 rows here, which give information about the columns. Each sample in this case is treated with two treatments or control and we can see the number of replicates for each, using the as.numeric function again.

# Dimension of the SummarizedExperiment
dim(se)

## [1] 63677     8

# Get access to the counts of RNA sequencing reads, using assay function.
assay(se)[1:3,1:3]

##                 SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003        679        448        873
## ENSG00000000005          0          0          0
## ENSG00000000419        467        515        621

# Dimensions of this assay is a matrix, which has the same dimensions as the SummarizedExperiment.
dim(assay(se))

## [1] 63677     8

# Get information about samples
colData(se)[1:3,1:6]

## DataFrame with 3 rows and 6 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126

# dimension of column data
dim(colData(se))

## [1] 8 9

# characteristics of the samples
names(colData(se))

## [1] "SampleName" "cell"       "dex"        "albut"      "Run"       
## [6] "avgLength"  "Experiment" "Sample"     "BioSample"

# Get access to treatment column of sample characteristics
colData(se)$dex

## [1] untrt trt   untrt trt   untrt trt   untrt trt  
## Levels: trt untrt

See https://bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html for the full description.

7 Diagnostics

library(dplyr)
library(pander)
diagnostics <- devtools::session_info()
platform <- data.frame(diagnostics$platform %>% unlist, stringsAsFactors = FALSE)
colnames(platform) <- c("description")
pander(platform)

	description
version	R version 4.5.1 (2025-06-13)
os	macOS Sequoia 15.6.1
system	aarch64, darwin20
ui	X11
language	(EN)
collate	en_US.UTF-8
ctype	en_US.UTF-8
tz	America/New_York
date	2025-09-17
pandoc	3.4 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
quarto	1.6.42 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/quarto

packages <- as.data.frame(diagnostics$packages)
pander(packages[ packages$`*` == "*", ])

Table continues below
package	ondiskversion	loadedversion	path

Table continues below
loadedpath	attached	is_base	date	source

md5ok

library

ExpressionSet and SummarizedExperiment

Mikhail Dozmorov

2025-09-17