class: center, middle, inverse, title-slide .title[ # Data manipulation with dplyr ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-08-25 ] --- <!-- HTML style block --> <style> .large { font-size: 130%; } .small { font-size: 70%; } .tiny { font-size: 40%; } </style> ## dplyr: Data Manipulation with R 80% of your work will involve data preparation: - **Filtering rows** (to create a subset) – `filter()` - **Selecting columns** (selecting variables) – `select()`, `pull()` - **Adding new variables** – `mutate()` - **Sorting** – `arrange()` - **Aggregating** – `summarise()` (often used with `group_by()`) - **Joining** – `left_join()`, `right_join()`, `inner_join()`, `full_join()` .small[https://dplyr.tidyverse.org/] --- ## The pipe `%>%` operator - Pass the output of one command as the input to another — chain commands together (similar to the `|` operator in Linux) - Read as “then”: take the object, *then* perform the next operation ``` r library(dplyr) # Without the pipe round(sqrt(1000), 3) ``` ``` ## [1] 31.623 ``` ``` r # With the pipe 1000 %>% sqrt() %>% round() ``` ``` ## [1] 32 ``` ``` r # Using . to specify where the piped value goes 1000 %>% sqrt() %>% round(., 3) ``` ``` ## [1] 31.623 ``` <!--- ## The pipe %>% operator - For example, we can view the head of the `diamonds` data.frame using either of the last two lines of code here: ``` r library(dplyr) library(ggplot2) data(diamonds) head(diamonds) diamonds %>% head ``` ``` ## # A tibble: 6 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ``` ## The pipe %>% operator - For example, read the last line of code as: "Take the `price` column of the `diamonds` data.frame and _then_ summarize it" ``` r library(dplyr) data(diamonds) head(diamonds) diamonds %>% head summary(diamonds$price) diamonds$price %>% summary(object = .) ``` --> --- ## dplyr::filter() - `filter(data, condition)` - filter (select) rows based on the condition of a column - For example, keep only the entries with Ideal cut ``` r library(dplyr) library(ggplot2) diamonds %>% filter(., cut == "Ideal") ``` ``` ## # A tibble: 21,551 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # ℹ 21,541 more rows ``` --- ## dplyr::select() - `select(data, columns)` - Select columns from the dataset by names - **Note:** The `select()` function name exists in multiple R packages (e.g., `MASS`, `caret`), which can cause conflicts. If you enrounter errors, call it explicitly as `dplyr::select()`. ``` r diamonds %>% select(., carat, cut, color, price, clarity) ``` ``` ## # A tibble: 53,940 × 5 ## carat cut color price clarity ## <dbl> <ord> <ord> <int> <ord> ## 1 0.23 Ideal E 326 SI2 ## 2 0.21 Premium E 326 SI1 ## 3 0.23 Good E 327 VS1 ## 4 0.29 Premium I 334 VS2 ## 5 0.31 Good J 335 SI2 ## 6 0.24 Very Good J 336 VVS2 ## 7 0.24 Very Good I 336 VVS1 ## 8 0.26 Very Good H 337 SI1 ## 9 0.22 Fair E 337 VS2 ## 10 0.23 Very Good H 338 VS1 ## # ℹ 53,930 more rows ``` --- ## dplyr::mutate() - `mutate(data, new_column = function(old_columns))` - Add new columns to your dataset that are functions of old columns ``` r diamonds %>% mutate(price_per_carat = price/carat) ``` ``` ## # A tibble: 53,940 × 11 ## carat cut color clarity depth table price x y z price_per_carat ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1417. ## 2 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 1552. ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1422. ## 4 0.29 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 1152. ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1081. ## 6 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 1400 ## 7 0.24 Very… I VVS1 62.3 57 336 3.95 3.98 2.47 1400 ## 8 0.26 Very… H SI1 61.9 55 337 4.07 4.11 2.53 1296. ## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1532. ## 10 0.23 Very… H VS1 59.4 61 338 4 4.05 2.39 1470. ## # ℹ 53,930 more rows ``` --- ## dplyr::arrange() - `arrange(data, column_to_sort_by)` - Sort your data by columns ``` r diamonds %>% arrange(cut, desc(price)) ``` ``` ## # A tibble: 53,940 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 2.01 Fair G SI1 70.6 64 18574 7.43 6.64 4.69 ## 2 2.02 Fair H VS2 64.5 57 18565 8 7.95 5.14 ## 3 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72 ## 4 2 Fair G VS2 67.6 58 18515 7.65 7.61 5.16 ## 5 2.51 Fair H SI2 64.7 57 18308 8.44 8.5 5.48 ## 6 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.9 ## 7 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.9 ## 8 2.32 Fair H SI1 62 62 18026 8.47 8.31 5.2 ## 9 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98 ## 10 1.93 Fair F VS1 58.9 62 17995 8.17 7.97 4.75 ## # ℹ 53,930 more rows ``` --- ## dplyr::summarize() - `summarize(function_of_variables)` - Summarize columns by custom summary statistics ``` r diamonds %>% summarize(length = n(), avg_price = mean(price)) ``` ``` ## # A tibble: 1 × 2 ## length avg_price ## <int> <dbl> ## 1 53940 3933. ``` --- ## dplyr::group_by() - `group_by(data, column_to_group)` - Summarize *subsets of* columns by custom summary statistics ``` r diamonds %>% group_by(cut, color) %>% summarize(mean(price)) ``` ``` ## `summarise()` has grouped output by 'cut'. You can override using the `.groups` ## argument. ``` ``` ## # A tibble: 35 × 3 ## # Groups: cut [5] ## cut color `mean(price)` ## <ord> <ord> <dbl> ## 1 Fair D 4291. ## 2 Fair E 3682. ## 3 Fair F 3827. ## 4 Fair G 4239. ## 5 Fair H 5136. ## 6 Fair I 4685. ## 7 Fair J 4976. ## 8 Good D 3405. ## 9 Good E 3424. ## 10 Good F 3496. ## # ℹ 25 more rows ``` --- ## tidyr::pivot_longer() - `pivot_longer(data, cols, names_to, values_to)` - Reshape data from wide to long format - Useful when values are spread across multiple columns and you want them in a single column ``` r library(tidyr) relig_income ``` ``` ## # A tibble: 18 × 11 ## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Agnostic 27 34 60 81 76 137 122 ## 2 Atheist 12 27 37 52 35 70 73 ## 3 Buddhist 27 21 30 34 33 58 62 ## 4 Catholic 418 617 732 670 638 1116 949 ## 5 Don’t k… 15 14 15 11 10 35 21 ## 6 Evangel… 575 869 1064 982 881 1486 949 ## 7 Hindu 1 9 7 9 11 34 47 ## 8 Histori… 228 244 236 238 197 223 131 ## 9 Jehovah… 20 27 24 24 21 30 15 ## 10 Jewish 19 19 25 25 30 95 69 ## 11 Mainlin… 289 495 619 655 651 1107 939 ## 12 Mormon 29 40 48 51 56 112 85 ## 13 Muslim 6 7 9 10 9 23 16 ## 14 Orthodox 13 17 23 32 32 47 38 ## 15 Other C… 9 7 11 13 13 14 18 ## 16 Other F… 20 33 40 46 49 63 46 ## 17 Other W… 5 2 3 4 2 7 3 ## 18 Unaffil… 217 299 374 365 341 528 407 ## # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, ## # `Don't know/refused` <dbl> ``` ``` r # Example: turn multiple measurement columns into key-value pairs relig_income %>% pivot_longer(cols = !religion, names_to = "income", values_to = "count") ``` ``` ## # A tibble: 180 × 3 ## religion income count ## <chr> <chr> <dbl> ## 1 Agnostic <$10k 27 ## 2 Agnostic $10-20k 34 ## 3 Agnostic $20-30k 60 ## 4 Agnostic $30-40k 81 ## 5 Agnostic $40-50k 76 ## 6 Agnostic $50-75k 137 ## 7 Agnostic $75-100k 122 ## 8 Agnostic $100-150k 109 ## 9 Agnostic >150k 84 ## 10 Agnostic Don't know/refused 96 ## # ℹ 170 more rows ``` --- ## tidyr::pivot_longer() - `pivot_longer(data, cols, names_to, values_to)` - Reshape data from wide to long format - Useful when values are spread across multiple columns and you want them in a single column ``` r library(tidyr) # Turn multiple measurement columns into key-value pairs relig_income %>% pivot_longer(cols = !religion, names_to = "income", values_to = "count") ``` ``` ## # A tibble: 180 × 3 ## religion income count ## <chr> <chr> <dbl> ## 1 Agnostic <$10k 27 ## 2 Agnostic $10-20k 34 ## 3 Agnostic $20-30k 60 ## 4 Agnostic $30-40k 81 ## 5 Agnostic $40-50k 76 ## 6 Agnostic $50-75k 137 ## 7 Agnostic $75-100k 122 ## 8 Agnostic $100-150k 109 ## 9 Agnostic >150k 84 ## 10 Agnostic Don't know/refused 96 ## # ℹ 170 more rows ``` --- ## tidyr::pivot_wider() - `pivot_wider(data, names_from, values_from)` - Reshape data from long to wide format - Useful when observations are in rows but you want them spread across columns ``` r relig_income %>% pivot_longer(cols = !religion, names_to = "income", values_to = "count") %>% pivot_wider(names_from = income, values_from = count) ``` ``` ## # A tibble: 18 × 11 ## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Agnostic 27 34 60 81 76 137 122 ## 2 Atheist 12 27 37 52 35 70 73 ## 3 Buddhist 27 21 30 34 33 58 62 ## 4 Catholic 418 617 732 670 638 1116 949 ## 5 Don’t k… 15 14 15 11 10 35 21 ## 6 Evangel… 575 869 1064 982 881 1486 949 ## 7 Hindu 1 9 7 9 11 34 47 ## 8 Histori… 228 244 236 238 197 223 131 ## 9 Jehovah… 20 27 24 24 21 30 15 ## 10 Jewish 19 19 25 25 30 95 69 ## 11 Mainlin… 289 495 619 655 651 1107 939 ## 12 Mormon 29 40 48 51 56 112 85 ## 13 Muslim 6 7 9 10 9 23 16 ## 14 Orthodox 13 17 23 32 32 47 38 ## 15 Other C… 9 7 11 13 13 14 18 ## 16 Other F… 20 33 40 46 49 63 46 ## 17 Other W… 5 2 3 4 2 7 3 ## 18 Unaffil… 217 299 374 365 341 528 407 ## # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, ## # `Don't know/refused` <dbl> ``` <!--- ## The power of pipe %>% - Summarize *subsets of* columns by custom summary statistics ``` r arrange(mutate(arrange(filter(tbl_df(diamonds), cut == "Ideal"), price), price_per_carat = price/carat), price_per_carat) arrange( mutate( arrange( filter(tbl_df(diamonds), cut == "Ideal"), price), price_per_carat = price/carat), price_per_carat) diamonds %>% filter(cut == "Ideal") %>% arrange(price) %>% mutate(price_per_carat = price/carat) %>% arrange(price_per_carat) ``` -->