class: center, middle, inverse, title-slide .title[ # Independent Component Analysis (ICA) ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-12-01 ] --- <!-- HTML style block --> <style> .large { font-size: 130%; } .small { font-size: 70%; } .tiny { font-size: 40%; } </style> ## ICA - Independent Component Analysis - PCA assumes multivariate normally distributed data — gene expression data are usually **super-Gaussian** (heavier tails, not Normal) - ICA models observations as a **linear combination of latent variables (components)** that are **as statistically independent as possible** - Analogy: multiple sound sources mixed together — ICA tries to “unmix” them --- ## ICA - Independent Component Analysis - ICA performs a **linear rotation** of data like PCA, but recovers **independent**, not maximally variable, components - In RNA-seq, ICA components often correspond to **transcriptional modules** representing biological processes .small[ Hyvärinen, Aapo, and Erkki Oja. "Independent component analysis: algorithms and applications." Neural networks 13, no. 4-5 (2000): 411-430. https://doi.org/10.1016/S0893-6080(00)00026-5 ] --- ## ICA vs PCA – Conceptual Difference | Aspect | PCA | ICA | |--------|-----|-----| | Goal | Maximize variance | Maximize statistical independence | | Components | Orthogonal | Independent (not necessarily orthogonal) | | Suitable for | Gaussian data | Non-Gaussian (super-Gaussian) data | | Interpretation | Global variation patterns | Underlying sources/processes | - ICA can reveal **hidden biological signals** not captured by PCA .small[ Engreitz, Jesse M., Bernie J. Daigle Jr, Jonathan J. Marshall, and Russ B. Altman. "Independent component analysis: mining microarray data for fundamental human gene expression modules." Journal of biomedical informatics 43, no. 6 (2010): 932-944. https://doi.org/10.1016/j.jbi.2010.07.001] --- ## ICA - Model Formulation - `\(X\)` — an `\(m \times n\)` matrix of `\(n\)` genes and `\(m\)` experiments - ICA decomposes `\(X\)` as: `$$X = AS$$` - `\(A\)` — an `\(m \times k\)` **mixing matrix** - `\(S\)` — a `\(k \times n\)` **source matrix** - `\(k\)` — user-specified number of independent components (`\(k \leq \min(m, n)\)`) - Preprocessing: filter, center, scale (as for PCA) <!-- ## ICA - Model Formulation <img src="img/ica.png" width="800px" style="display: block; margin: auto;" /> --> --- ## ICA Components and Interpretation - Rows of `\(S\)` (source matrix) are **independent components** - Gene weights in each component reflect **independent random variables** - Genes with strong weights in the same component likely participate in the same biological process - Columns of `\(A\)` show the **expression strength of each component** across samples .small[ fastICA R package: https://CRAN.R-project.org/package=fastICA ] --- ## ICA in Practice - The **source matrix `\(S\)`**: interpreted biologically by studying the genes contributing to each component - The **mixing matrix `\(A\)`**: associate components with sample features (e.g., clinical or molecular characteristics) - Identify correlations between **independent components and phenotypes** .small[ MineICA package: https://bioconductor.org/packages/MineICA/ ] .small[ Engreitz, Jesse M., Bernie J. Daigle Jr, Jonathan J. Marshall, and Russ B. Altman. "Independent component analysis: mining microarray data for fundamental human gene expression modules." Journal of biomedical informatics 43, no. 6 (2010): 932-944. https://doi.org/10.1016/j.jbi.2010.07.001] --- ## Example Workflow in R ```r library(fastICA) # Example data: expression matrix (genes x samples) set.seed(123) expr <- matrix(rnorm(1000), nrow = 100, ncol = 10) # Run ICA with 3 components ica_res <- fastICA(expr, n.comp = 3) # Matrices S <- ica_res$S # Source matrix (independent components) A <- ica_res$A # Mixing matrix ``` - Inspect top genes contributing to each component - Plot sample scores (A) colored by phenotype --- ## ICA Demo Using *iris* Dataset We illustrate Independent Component Analysis (ICA) using the well-known *iris* dataset. We use only the **numeric columns** (4 flower measurements) and treat each **row as a sample**. ``` r library(fastICA) # Use only numeric variables from iris data(iris) X <- as.matrix(iris[, 1:4]) # 150 samples × 4 features dim(X) ``` ``` ## [1] 150 4 ``` --- ## Running ICA We extract **3 independent components**. ``` r set.seed(123) ica_res <- fastICA(X, n.comp = 3) S <- ica_res$S # Independent components (sources) A <- ica_res$A # Mixing matrix dim(S) ``` ``` ## [1] 150 3 ``` ``` r dim(A) ``` ``` ## [1] 3 4 ``` In ICA, rows of **S** describe how each independent signal varies across samples. Rows of **A** tell us how each feature loads onto the independent components. --- ## Plotting ICA Components Color the samples by species to visualize whether ICA captures class structure. ``` r library(ggplot2) df_ica <- data.frame( IC1 = S[,1], IC2 = S[,2], Species = iris$Species ) head(df_ica) ``` ``` ## IC1 IC2 Species ## 1 -1.395258 0.4149757 setosa ## 2 -1.336105 -0.7872977 setosa ## 3 -1.330866 -0.3146663 setosa ## 4 -1.202928 -0.5861406 setosa ## 5 -1.367819 0.6446875 setosa ## 6 -1.246761 1.5260826 setosa ``` --- ## Plotting ICA Components Color the samples by species - ICA captures class structure. ``` r ggplot(df_ica, aes(IC1, IC2, color = Species)) + geom_point(size = 3, alpha = 0.8) + theme_minimal() + labs(title = "ICA on iris: First Two Independent Components") ``` <!-- --> --- ## Inspecting Feature Loadings The columns of **A** correspond to the independent components. We inspect how the original measurements contribute to each component. ``` r colnames(X) ``` ``` ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ``` ``` r A ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] 0.6585008 -0.203936026 1.73801839 0.74439307 ## [2,] 0.2418689 0.380352924 0.05743742 0.09667760 ## [3,] -0.4320207 -0.005864442 -0.25699093 0.01473842 ``` * High absolute values indicate a strong contribution of that feature to the component. * Since ICA seeks **statistical independence**, these do not correspond to variance directions (as in PCA). --- ## Summary - ICA separates signals into **independent sources** - Useful for **unsupervised discovery** of biological modules - Complements PCA by uncovering **non-Gaussian, hidden structure** - Applications: transcriptomics, epigenomics, imaging, and multi-omics data integration