Clustering
Hierarchical clustering
Non-hierarchical clustering
K-means
Dimensionality reduction
References
Hennig, Christian. “Cluster Validation by Measurement of Clustering Characteristics Relevant to the User.” ArXiv:1703.09282 [Stat], September 8, 2020. http://arxiv.org/abs/1703.09282 - Cluster characteristics/evaluation of the quality. Methodology/stats paper. Small within-cluster dissimilarities, between-cluster separation, other metrics.
Daxin Jiang, Chun Tang, and Aidong Zhang. “Cluster Analysis for Gene Expression Data: A Survey.” IEEE Transactions on Knowledge and Data Engineering 16, no. 11 (November 2004): 1370–86. https://doi.org/10.1109/TKDE.2004.68. - Clustering overview for gene expression studies. Definitions, proximity measures (Euclidean, Pearson), clustering (K-means, SOM, hierarchical, graph-theoretical, model-based, density, the use of PCA), biclustering. Metrics for clustering QC (homogeneity, separation, Rand, Jaccard, reliability)
Patrik D’haeseleer, “How Does Gene Expression Clustering Work?,” Nature Biotechnology 23, no. 12 (December 2005): 1499–1501, https://doi.org/10.1038/nbt1205-1499. - Clustering distances. Recommendations for gene expression choices of clustering
Satagopan, Jaya M., and Katherine S. Panageas. “A Statistical Perspective on Gene Expression Data Analysis.” Statistics in Medicine 22, no. 3 (February 15, 2003): 481–99. doi:10.1002/sim.1350. - Intro into microarray technology, statistical questions. Hierarchical clustering - clustering metrics. MDS algorithm. Class prediction - linear discriminant analysis algorithm and cross-validation. SAS and S examples
Altman, Naomi, and Martin Krzywinski. “Points of Significance: Clustering.” Nature Methods 14, no. 6 (May 30, 2017): 545–46. doi:10.1038/nmeth.4299. - Clustering depends on gene scaling, clustering method, number of simulations in k-means clustering.
Krzywinski, Martin, and Naomi Altman. “Points of Significance: Importance of Being Uncertain.” Nature Methods 10, no. 9 (September 2013): 809–10.
Altman, Naomi, and Martin Krzywinski. “Points of Significance: Association, Correlation and Causation.” Nature Methods 12, no. 10 (September 29, 2015): 899–900. doi:10.1038/nmeth.3587.
Abdi, Hervé, and Lynne J. Williams. “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics 2, no. 4 (July 2010): 433–59. doi:10.1002/wics.101. - PCA in-depth review. Mathematical formulations, terminology, examples, interpretation. Figures showing PC axes, rotations, projections, circle of correlation. Rules for selecting number of components. Rotation - varimax, promax, illustrated. Correspondence analysis for nominal variables, Multiple Factor Analysis for a set of observations described by several groups (tables) of variables. Appendices - eigenvalues and eigenvectors, positive semidefinite matrices, SVD
Wall, Michael. “Singular Value Decomposition and Principal Component Analysis,” n.d. https://link.springer.com/chapter/10.1007/0-306-47815-3_5 - SVD and PCA statistical intro. Relation of SVD to PCA, Fourier transform. Examples of applications, including genomics.
Golub, T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, et al. “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.” Science (New York, N.Y.) 286, no. 5439 (October 15, 1999): 531–37. http://science.sciencemag.org/content/286/5439/531.long - Class discovery and prediction using classical AML/ALL dataset. Definitions. Ref. 16 - their own definition of correlation. SOM for classification.
Lever, Jake, Martin Krzywinski, and Naomi Altman. “Points of Significance: Principal Component Analysis.” Nature Methods 14, no. 7 (June 29, 2017): 641–42. doi:10.1038/nmeth.4346. PCA explanation, the effect of scale. Limitations
Lee, D. D., and H. S. Seung. “Learning the Parts of Objects by Non-Negative Matrix Factorization.” Nature 401, no. 6755 (October 21, 1999): 788–91. https://doi.org/10.1038/44565. - Non-negative matrix factorization (NMF) principles, compared with vector quantization (VQ) and PCA. Intuition behind NMF learning parts and PCA learning the whole.
Lee, Daniel D., and H. Sebastian Seung. “Algorithms for Non-Negative Matrix Factorization.” In Advances in Neural Information Processing Systems 13, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 556–62. MIT Press, 2001. http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf. - Two algorithms for solving NMF - Euclidean distance and Kullback-Leibler divergence, with proofs.
Meng, Chen, Oana A. Zeleznik, Gerhard G. Thallinger, Bernhard Kuster, Amin M. Gholami, and Aedín C. Culhane. “Dimension Reduction Techniques for the Integrative Analysis of Multi-Omics Data.” Briefings in Bioinformatics 17, no. 4 (July 2016): 628–41. doi:10.1093/bib/bbv108. - A must. Dimensionality reduction techniques - PCA and its derivatives, NMF. Table 1 - Terminology. Table 2 - methods, tools, visualization packages. Methods for integrative data analysis of multi-omics data.
Lee, Su-In, and Serafim Batzoglou. “Application of Independent Component Analysis to Microarrays.” Genome Biology 4, no. 11 (2003): R76. doi:10.1186/gb-2003-4-11-r76. - Independent Components Analysis theory and applications.
Stein-O’Brien, Genevieve L., Raman Arora, Aedin C. Culhane, Alexander Favorov, Casey Greene, Loyal A. Goff, Yifeng Li, et al. “Enter the Matrix: Interpreting Unsupervised Feature Learning with Matrix Decomposition to Discover Hidden Knowledge in High-Throughput Omics Data.” BioRxiv, October 2, 2017. doi:10.1101/196915. - Matrix factorization and visualization. Refs to various types of MF methods.Terminology, Fig 1 explanation of MF in terms of gene expression and biological processes. References to biological examples.
Yeung, K. Y., and W. L. Ruzzo. “Principal Component Analysis for Clustering Gene Expression Data.” Bioinformatics 17, no. 9 (September 1, 2001): 763–74. https://doi.org/10.1093/bioinformatics/17.9.763. - PCA is not always good for denoising data before clustering, clustering of PCs often worse than the original data. Simulated and real-life data. Data used for benchmarks: http://faculty.washington.edu/kayee/pca/
Libbrecht, Maxwell W., and William Stafford Noble. “Machine Learning Applications in Genetics and Genomics.” Nature Reviews. Genetics 16, no. 6 (June 2015): 321–32. doi:10.1038/nrg3920. - Machine learning in genomics. Supervised/unsupervised learning, semi-supervised, bayesian (incorporating prior knowledge), feature selection, imbalanced class sizes, missing data, networks.
Meng, Chen, Bernhard Kuster, Aedín C. Culhane, and Amin Moghaddas Gholami. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics 15 (May 29, 2014): 162. doi:10.1186/1471-2105-15-162. - MCIA - multiple correspondence analysis for integrating multiple datasets. Statistics and implementation in
omicade4- Multiple co-inertia analysis of omics datasets. https://bioconductor.org/packages/release/bioc/html/omicade4.htmlSingular Value Decomposition (SVD) Tutorial: Applications, Examples, Exercises. https://blog.statsbot.co/singular-value-decomposition-tutorial-52c695315254
Liu, Yanchi, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. “Understanding of Internal Clustering Validation Measures,” 911–16. IEEE, 2010. https://doi.org/10.1109/ICDM.2010.35. - Internal clustering validation metrics, table, concise description of each.
Guido Kraemer, Markus Reichstein, and Miguel D. Mahecha, “DimRed and CoRanking Unifying Dimensionality Reduction in R,” The R Journal, 2018, https://journal.r-project.org/archive/2018/RJ-2018-039/index.html - R packages implementing 15 methods for dimensionality reduction, from PCA, ICA, MDS to Laplasian eigenmaps. Brief but very good overview of each method, its complexity. Quality metrics to judge the quality of embedding. https://github.com/gdkrmr/dimRed, https://cran.r-project.org/package=dimRed
Amir, El-ad David, Kara L Davis, Michelle D Tadmor, Erin F Simonds, Jacob H Levine, Sean C Bendall, Daniel K Shenfeld, Smita Krishnaswamy, Garry P Nolan, and Dana Pe’er. “ViSNE Enables Visualization of High Dimensional Single-Cell Data and Reveals Phenotypic Heterogeneity of Leukemia.” Nature Biotechnology 31, no. 6 (June 2013): 545–52. https://doi.org/10.1038/nbt.2594. - viSNE paper - tSNE (Barnes-Hut) implementation for single-cell data, and the
cyttool for visualization. Supplementary methods - details of t-SNE algorithm, https://media.nature.com/original/nature-assets/nbt/journal/v31/n6/extref/nbt.2594-S1.pdf. Details of usage https://www.denovosoftware.com/site/manual/visne.htmBelacel, Nabil, Qian Wang, and Miroslava Cuperlovic-Culf. “Clustering Methods for Microarray Gene Expression Data.” Omics: A Journal of Integrative Biology 10, no. 4 (2006): 507–31. https://doi.org/10.1089/omi.2006.10.507. - Clustering methods overview. Hierarchical (agglomerative, divisive), partitional clustering (K-means, K-medoids, SOM). DBSCAN and other density-based algorithms. Graph-theoretical cllustering. Fuzzy clustering, expectation-maximization methods. Table with software.
Kossenkov, Andrew V., and Michael F. Ochs. “Matrix Factorisation Methods Applied in Microarray Data Analysis.” International Journal of Data Mining and Bioinformatics 4, no. 1 (2010): 72. https://doi.org/10.1504/IJDMB.2010.030968. - Matrix factorization methods for genomics data. SVD, PCA, ICA, NCA, NMF (sparse and least squares NMF), Bayesian decomposition
Chavent, Marie, Vanessa Kuentz-Simonet, Amaury Labenne, and Jérôme Saracco. “Multivariate Analysis of Mixed Data: The R Package PCAmixdata.” ArXiv:1411.4911 [Stat], December 8, 2017. http://arxiv.org/abs/1411.4911. - PCAmixdata - R package for PCA on a mixture of numerical and categorical variables. Other packages - ade4, FactoMineR. Theory, statistics, code examples with interpretation. https://cran.r-project.org/web/packages/PCAmixdata/index.html