In RNA-seq (and originally microarrays), intensity-dependent
bias can occur, where the log-ratio of expression values
(M) depends on the average intensity (A).
This creates a curvature in the MA plot.
We can remove such bias using parametric regression or more commonly LOWESS normalization.
set.seed(123)
# Simulate average expression (A values, log2 scale ~ 3–16 typical range)
x <- runif(50, 0, 16)
# Simulate M values (log fold-changes), with nonlinear bias + random noise
y <- 1 + 0.5 * exp(-0.5 * x) + rnorm(50, 0, 0.1)
# Organize into data frame
data <- data.frame(A = x, M = y)
data <- data[order(data$A), ]
head(data)
## A M
## 35 0.3938190 1.432227
## 18 0.6729525 1.230602
## 6 0.7289040 1.389935
## 15 1.6467949 1.181422
## 46 2.2208970 1.115602
## 41 2.2848004 1.189879
plot(data$A, data$M,
main = "Simulated MA plot with dye bias",
xlab = "A (average log2 expression)",
ylab = "M (log2 fold-change)")
curve(1 + 0.5*exp(-0.5*x), add = TRUE, col = "blue", lwd = 2)
The curve shows the systematic non-linear bias added to the simulated data.
In real data, we usually don’t know the exact functional form of the bias. Here, we attempt to recover the curve using nonlinear least squares (NLS).
# Fit a parametric exponential decay model
fit <- nls(M ~ a + b * exp(-c * A),
data = data,
start = list(a = 1, b = 1, c = 1))
summary(fit)
##
## Formula: M ~ a + b * exp(-c * A)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 1.01708 0.01757 57.896 < 2e-16 ***
## b 0.48989 0.10449 4.688 2.39e-05 ***
## c 0.60544 0.19126 3.166 0.00272 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09482 on 47 degrees of freedom
##
## Number of iterations to convergence: 4
## Achieved convergence tolerance: 4.028e-07
plot(data$A, data$M,
main = "Simulated MA plot with dye bias",
xlab = "A (average log2 expression)",
ylab = "M (log2 fold-change)")
# Add fitted curve
lines(data$A, predict(fit), col = "red", lwd = 2)
We subtract the fitted curve from the observed M values to remove the intensity-dependent bias.
# Corrected M values
M.norm <- data$M - predict(fit)
plot(data$A, M.norm,
main = "Normalized MA plot (parametric fit)",
xlab = "A (average log2 expression)",
ylab = "M (bias-corrected log2 fold-change)")
abline(h = 0, col = "darkgreen", lwd = 2)
Now, the corrected values are centered around zero, as expected if most genes are not differentially expressed.
limma,
cyclic loess).LOESS (Locally Estimated Scatterplot Smoothing) is a
non-parametric regression method.
It works by:
- Selecting a local neighborhood of points around the
target.
- Assigning weights (closer points get higher
weight).
- Fitting a weighted linear regression.
- Repeating for each point of interest.
Here we illustrate the procedure “by hand” using the
wine dataset.
# install.packages("HDclassif") # if not already installed
library(HDclassif)
# Load wine dataset
data(wine)
# Rename some variables for clarity
colnames(wine)[colnames(wine) == "V1"] <- "Alcohol"
colnames(wine)[colnames(wine) == "V10"] <- "Color"
head(wine[, c("Alcohol", "Color")])
## Alcohol Color
## 1 14.23 5.64
## 2 13.20 4.38
## 3 13.16 5.68
## 4 14.37 7.80
## 5 13.24 4.32
## 6 14.20 6.75
attach(wine)
plot(Color, Alcohol,
main = "Alcohol vs. Color (wine dataset)",
xlab = "Color Intensity",
ylab = "Alcohol")
# Pick observation 50
points(Color[50], Alcohol[50], pch=16, col="red")
Color[50] # x value
## [1] 8.9
Alcohol[50] # y value
## [1] 13.94
We compute distances along the x-axis (Color) to all other points.
distance <- abs(Color - Color[50])
head(distance)
## [1] 3.26 4.52 3.22 1.10 4.58 2.15
Let’s choose the 50 nearest neighbors.
neighbor <- ifelse(rank(distance) <= 50, 1, 0)
plot(Color, Alcohol,
main = "Alcohol vs. Color (wine dataset)",
xlab = "Color Intensity",
ylab = "Alcohol")
# Plot neighbors
points(Color[neighbor == 1], Alcohol[neighbor == 1], pch = 19)
Weights are assigned using the tricube kernel:
\[ w_i = \begin{cases} (1 - u_i^3)^3, & 0 \le u_i < 1 \\ 0, & \text{otherwise} \end{cases} \]
where \(u_i = \frac{\text{distance to target}}{\text{max distance in neighborhood}}\).
delta.50 <- max(distance[neighbor == 1])
u <- ifelse(neighbor == 1, distance / delta.50, NA)
weight <- ifelse(u >= 0 & u < 1, (1 - u^3)^3, 0)
# Combine into table
combined_data <- data.frame(distance, neighbor, u, weight)
combined_data[45:55, ]
## distance neighbor u weight
## 45 3.86 0 NA NA
## 46 3.66 0 NA NA
## 47 4.00 0 NA NA
## 48 2.80 1 0.9271523 0.008366574
## 49 2.70 1 0.8940397 0.023243736
## 50 0.00 1 0.0000000 1.000000000
## 51 1.70 1 0.5629139 0.554659147
## 52 3.30 0 NA NA
## 53 1.85 1 0.6125828 0.456752912
## 54 2.60 1 0.8609272 0.047392581
## 55 3.05 0 NA NA
We fit a local weighted linear regression around observation 50.
plot(Color, Alcohol,
main = "Alcohol vs. Color (wine dataset)",
xlab = "Color Intensity",
ylab = "Alcohol")
local.50 <- lm(Alcohol ~ Color, weights = weight)
abline(local.50, col="red")
Let’s repeat for observation 30.
plot(Color, Alcohol,
main = "Alcohol vs. Color (wine dataset)",
xlab = "Color Intensity",
ylab = "Alcohol")
# Mark observation 30
points(Color[30], Alcohol[30], pch=16, col="green")
# Distances
distance.30 <- abs(Color - Color[30])
neighbor.30 <- ifelse(rank(distance.30) <= 50, 1, 0)
# Plot neighbors
points(Color[neighbor.30 == 1], Alcohol[neighbor.30 == 1], pch = 19, col="blue")
# Compute weights
delta.30 <- max(distance.30[neighbor.30 == 1])
u.30 <- ifelse(neighbor.30 == 1, distance.30 / delta.30, NA)
weight.30 <- ifelse(u.30 >= 0 & u.30 < 1, (1 - u.30^3)^3, 0)
# Local regression
local.30 <- lm(Alcohol ~ Color, weights = weight.30)
abline(local.30, lty=2, col="green")
Of course, we usually use the built-in loess() function
rather than computing by hand.
out <- loess(Alcohol ~ Color, data = wine, span = 0.4, degree = 1)
# Plot original data
plot(Color, Alcohol, main="LOESS fit vs. raw data",
xlab="Color Intensity", ylab="Alcohol")
# Add LOESS smoothed line
lines(sort(out$x), out$fitted[order(out$x)], col="red", lwd=2)
# Try larger span (smoother curve)
out2 <- loess(Alcohol ~ Color, data = wine, span = 0.7, degree = 1)
lines(sort(out2$x), out2$fitted[order(out2$x)], col="blue", lwd=2)
legend("topright", legend=c("Span 0.4","Span 0.7"),
col=c("red","blue"), lwd=2)
loess() function automates this process,
with the span parameter controlling smoothness.