class: center, middle, inverse, title-slide .title[ # Supervised Learning: Methods for Analyzing Gene Expression Data ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-10-08 ] --- <!-- HTML style block --> <style> .large { font-size: 130%; } .small { font-size: 70%; } .tiny { font-size: 40%; } </style> ## Linear regression overview We measure tumor grade and expression of gene 1. We are interested in: 1. How useful was expression of gene 1 for predicting tumor grade? `\(R^2\)` 2. Was that relationship due to chance? `\(p-value\)` <img src="07b_design_matrices_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- ## T-test <img src="07b_design_matrices_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> The goal of t-test is to compare means and see if they are significantly different from each other. --- ## T-test in terms of linear regression - Calculate the overall mean - Calculate the sum of squared residuals around the mean `\(SS_{mean}\)` <img src="07b_design_matrices_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## T-test in terms of linear regression - Fit a line to the data, separately for each group (mean = the least squares fit to the group of data) - For each group, we can calculate `\(SS_{fit}\)` <img src="07b_design_matrices_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## T-test in terms of linear regression - Combine two lines into a single equation. - This will make the steps for computing F-statistics exactly the same as for the regression ``` r # Combines both lines for the first group y_11 = 1 * mu_1 + 0 * mu_2 + residual_11 y_12 = 1 * mu_1 + 0 * mu_2 + residual_12 y_13 = 1 * mu_1 + 0 * mu_2 + residual_13 y_14 = 1 * mu_1 + 0 * mu_2 + residual_14 # Combines both lines for the second group y_21 = 0 * mu_1 + 1 * mu_2 + residual_21 y_22 = 0 * mu_1 + 1 * mu_2 + residual_22 y_23 = 0 * mu_1 + 1 * mu_2 + residual_23 y_24 = 0 * mu_1 + 1 * mu_2 + residual_24 ``` --- ## Design matrix - 1's and 0's serve as "switches" for each group. - This is our design matrix `\(X\)`, one column per group. ``` 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 ``` --- ## Connecting Design Matrices and F-statistics We can express the model as `$$Y = X \mu + \varepsilon$$` Then calculate `$$F = \frac{(SS_{\text{mean}} - SS_{\text{fit}})/(p_{\text{fit}} - p_{\text{mean}})}{SS_{\text{fit}}/(n - p_{\text{fit}})}$$` where: * `\(p_{\text{mean}} = 1\)` (overall mean) * `\(p_{\text{fit}} = 2\)` (two group means) This generalizes naturally to **ANOVA** when `\(k > 2\)` groups. --- ## A more common design matrix ``` 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 ``` - In this setup, all measurements contribute to the mean for the first group - But only the measurements from the second group contribute to the _difference_ between the first and the second group - So the second column serves as a switch for the offset from the mean for the second group --- ## A more common design matrix <img src="07b_design_matrices_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## A more common design matrix ``` r # Combines both lines for the first group y_11 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_11 y_12 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_12 y_13 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_13 y_14 = 1 * mu_1 + 0 * difference_{mu_2 - mu_1} + residual_14 # Combines both lines for the second group y_21 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_21 y_22 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_22 y_23 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_23 y_24 = 1 * mu_1 + 1 * difference_{mu_2 - mu_1} + residual_24 ``` - Same way to calculate `\(SS_{mean}\)` and `\(SS_{fit}\)` - Same number of equations - Same number of parameters --- ## Power of design matrices - Say, in addition to group 1 and group 2, you have age variable. <img src="07b_design_matrices_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Power of design matrices - We need to expand our model like `\(y = group1\_intercept + group2\_offset + slope\)` - full model - So, in our design matrix, first columns of 1's mean that both lines intercept the Y-axis, and specify the intercept for group 1 - The second column indicates the offset of group 2 measures - The third column is the Age variable for each group ``` 1 0 1 1 0 2 1 0 3 1 0 4 1 1 1 1 1 2 1 1 3 1 1 4 ``` --- ## Power of design matrices - We need to expand our model like `\(y = group1\_intercept + group2\_offset + slope\)` - full model - So, in our design matrix, first columns of 1's mean that both lines intercept the Y-axis, and specify the intercept for group 1 - The second column indicates the offset of group 2 measures - The third column is the Age variable for each group - Compare with the simple model `\(y = overall\_mean\)` - Calculate how much better is the full model: `\(F = \frac{ (SS_{simple} - SS_{full}) / (p_{full} - p_{simple}) }{SS_{full} / (n - p_{simple})}\)` --- ## Batch effect - Suppose you have measurements from two labs <img src="07b_design_matrices_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Batch effect - First, add a term for the first lab normal group mean <img src="07b_design_matrices_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Batch effect - Second, add a term for the offset in measurements by the second lab <img src="07b_design_matrices_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Batch effect - Third, add a term for the offset of the tumor measurements <img src="07b_design_matrices_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## Batch effect - The final model: `$$y = lab1\_normal\_mean + lab2\_offset + difference_{tumor - normal}$$` and the design matrix: ``` 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 ``` --- ## Batch effect - The final model: `$$y = lab1\_normal\_mean + lab2\_offset + difference_{tumor - normal}$$` - Does the lab effect matter? - Compare the final model with a simpler one `\(y = lab1\_normal\_mean + difference_{tumor - normal}\)`