0.1 Introduction

In RNA-seq (and originally microarrays), intensity-dependent bias can occur, where the log-ratio of expression values (M) depends on the average intensity (A).

This creates a curvature in the MA plot.

We can remove such bias using parametric regression or more commonly LOWESS normalization.

0.2 Simulate RNA-seq MA data with bias

set.seed(123)

# Simulate average expression (A values, log2 scale ~ 3–16 typical range)
x <- runif(50, 0, 16)

# Simulate M values (log fold-changes), with nonlinear bias + random noise
y <- 1 + 0.5 * exp(-0.5 * x) + rnorm(50, 0, 0.1)

# Organize into data frame
data <- data.frame(A = x, M = y)
data <- data[order(data$A), ]

head(data)
##            A        M
## 35 0.3938190 1.432227
## 18 0.6729525 1.230602
## 6  0.7289040 1.389935
## 15 1.6467949 1.181422
## 46 2.2208970 1.115602
## 41 2.2848004 1.189879

0.3 Visualize the biased MA plot

plot(data$A, data$M,
     main = "Simulated MA plot with dye bias",
     xlab = "A (average log2 expression)",
     ylab = "M (log2 fold-change)")
curve(1 + 0.5*exp(-0.5*x), add = TRUE, col = "blue", lwd = 2)

The curve shows the systematic non-linear bias added to the simulated data.

0.4 Parametric nonlinear fit

In real data, we usually don’t know the exact functional form of the bias. Here, we attempt to recover the curve using nonlinear least squares (NLS).

# Fit a parametric exponential decay model
fit <- nls(M ~ a + b * exp(-c * A),
           data = data,
           start = list(a = 1, b = 1, c = 1))

summary(fit)
## 
## Formula: M ~ a + b * exp(-c * A)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a  1.01708    0.01757  57.896  < 2e-16 ***
## b  0.48989    0.10449   4.688 2.39e-05 ***
## c  0.60544    0.19126   3.166  0.00272 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09482 on 47 degrees of freedom
## 
## Number of iterations to convergence: 4 
## Achieved convergence tolerance: 4.028e-07
plot(data$A, data$M,
     main = "Simulated MA plot with dye bias",
     xlab = "A (average log2 expression)",
     ylab = "M (log2 fold-change)")
# Add fitted curve
lines(data$A, predict(fit), col = "red", lwd = 2)

0.5 Bias correction

We subtract the fitted curve from the observed M values to remove the intensity-dependent bias.

# Corrected M values
M.norm <- data$M - predict(fit)

plot(data$A, M.norm,
     main = "Normalized MA plot (parametric fit)",
     xlab = "A (average log2 expression)",
     ylab = "M (bias-corrected log2 fold-change)")
abline(h = 0, col = "darkgreen", lwd = 2)

Now, the corrected values are centered around zero, as expected if most genes are not differentially expressed.

0.6 Why LOWESS instead of a parametric model?

1 Understanding LOESS Smoothing Step by Step

1.1 Introduction

LOESS (Locally Estimated Scatterplot Smoothing) is a non-parametric regression method.
It works by:
- Selecting a local neighborhood of points around the target.
- Assigning weights (closer points get higher weight).
- Fitting a weighted linear regression.
- Repeating for each point of interest.

Here we illustrate the procedure “by hand” using the wine dataset.

1.2 Load data

# install.packages("HDclassif") # if not already installed
library(HDclassif)

# Load wine dataset
data(wine)

# Rename some variables for clarity
colnames(wine)[colnames(wine) == "V1"] <- "Alcohol"
colnames(wine)[colnames(wine) == "V10"] <- "Color"

head(wine[, c("Alcohol", "Color")])
##   Alcohol Color
## 1   14.23  5.64
## 2   13.20  4.38
## 3   13.16  5.68
## 4   14.37  7.80
## 5   13.24  4.32
## 6   14.20  6.75

1.3 Step 1: Pick a target observation

attach(wine)

plot(Color, Alcohol,
     main = "Alcohol vs. Color (wine dataset)",
     xlab = "Color Intensity",
     ylab = "Alcohol")

# Pick observation 50
points(Color[50], Alcohol[50], pch=16, col="red")

Color[50]   # x value
## [1] 8.9
Alcohol[50] # y value
## [1] 13.94

1.4 Step 2: Compute distances

We compute distances along the x-axis (Color) to all other points.

distance <- abs(Color - Color[50])
head(distance)
## [1] 3.26 4.52 3.22 1.10 4.58 2.15

1.5 Step 3: Define neighborhood

Let’s choose the 50 nearest neighbors.

neighbor <- ifelse(rank(distance) <= 50, 1, 0)

plot(Color, Alcohol,
     main = "Alcohol vs. Color (wine dataset)",
     xlab = "Color Intensity",
     ylab = "Alcohol")
# Plot neighbors
points(Color[neighbor == 1], Alcohol[neighbor == 1], pch = 19)

1.6 Step 4: Scale distances and compute weights

Weights are assigned using the tricube kernel:

\[ w_i = \begin{cases} (1 - u_i^3)^3, & 0 \le u_i < 1 \\ 0, & \text{otherwise} \end{cases} \]

where \(u_i = \frac{\text{distance to target}}{\text{max distance in neighborhood}}\).

delta.50 <- max(distance[neighbor == 1])
u <- ifelse(neighbor == 1, distance / delta.50, NA)

weight <- ifelse(u >= 0 & u < 1, (1 - u^3)^3, 0)

# Combine into table
combined_data <- data.frame(distance, neighbor, u, weight)
combined_data[45:55, ]
##    distance neighbor         u      weight
## 45     3.86        0        NA          NA
## 46     3.66        0        NA          NA
## 47     4.00        0        NA          NA
## 48     2.80        1 0.9271523 0.008366574
## 49     2.70        1 0.8940397 0.023243736
## 50     0.00        1 0.0000000 1.000000000
## 51     1.70        1 0.5629139 0.554659147
## 52     3.30        0        NA          NA
## 53     1.85        1 0.6125828 0.456752912
## 54     2.60        1 0.8609272 0.047392581
## 55     3.05        0        NA          NA

1.7 Step 5: Weighted regression

We fit a local weighted linear regression around observation 50.

plot(Color, Alcohol,
     main = "Alcohol vs. Color (wine dataset)",
     xlab = "Color Intensity",
     ylab = "Alcohol")

local.50 <- lm(Alcohol ~ Color, weights = weight)
abline(local.50, col="red")

1.8 Step 6: Repeat for another observation

Let’s repeat for observation 30.

plot(Color, Alcohol,
     main = "Alcohol vs. Color (wine dataset)",
     xlab = "Color Intensity",
     ylab = "Alcohol")
# Mark observation 30
points(Color[30], Alcohol[30], pch=16, col="green")

# Distances
distance.30 <- abs(Color - Color[30])
neighbor.30 <- ifelse(rank(distance.30) <= 50, 1, 0)

# Plot neighbors
points(Color[neighbor.30 == 1], Alcohol[neighbor.30 == 1], pch = 19, col="blue")

# Compute weights
delta.30 <- max(distance.30[neighbor.30 == 1])
u.30 <- ifelse(neighbor.30 == 1, distance.30 / delta.30, NA)
weight.30 <- ifelse(u.30 >= 0 & u.30 < 1, (1 - u.30^3)^3, 0)

# Local regression
local.30 <- lm(Alcohol ~ Color, weights = weight.30)
abline(local.30, lty=2, col="green")

1.9 Step 7: Built-in LOESS

Of course, we usually use the built-in loess() function rather than computing by hand.

out <- loess(Alcohol ~ Color, data = wine, span = 0.4, degree = 1)

# Plot original data
plot(Color, Alcohol, main="LOESS fit vs. raw data",
     xlab="Color Intensity", ylab="Alcohol")

# Add LOESS smoothed line
lines(sort(out$x), out$fitted[order(out$x)], col="red", lwd=2)

# Try larger span (smoother curve)
out2 <- loess(Alcohol ~ Color, data = wine, span = 0.7, degree = 1)
lines(sort(out2$x), out2$fitted[order(out2$x)], col="blue", lwd=2)
legend("topright", legend=c("Span 0.4","Span 0.7"),
       col=c("red","blue"), lwd=2)

1.10 Summary

  • LOESS is a non-parametric regression that fits weighted linear regressions in local neighborhoods.
  • We manually computed distances, neighbors, and weights to see how it works.
  • The built-in loess() function automates this process, with the span parameter controlling smoothness.