Data Manipulation Toolbox: dplyr

 


The dplyr is a package that is arguably the most popular in R programming language. It provides all-in-one functionality for filtering, arranging, summarizing, and transforming data. To illustrate, let’s consider a dataset below:

library(dplyr)
head(starwars)
## # A tibble: 6 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>


The dataset above has 87 rows and 14 variables1.

  • name: Name of the character
  • height: Height in cm
  • mass: Weight in kg
  • hair_color/skin_color/eye_color: Hair, skin, and eye colors
  • birth_year: Year born (BBY = Before Battle of Yavin)
  • sex: The biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids).
  • gender: The gender role or identity of the character as determined by their personality or the way they were programmed (as in the case for Droids).
  • homeworld: Name of homeworld
  • species: Name of species
  • films: List of films the character appeared in
  • vehecles: List of vehicles the character has piloted
  • starships: List of starships the character has piloted


The dplyr package make data manipulation process crystal clear by seamlessly passing the output of one function as the input to the subsequent function. To be more specific, it streamlines the process by chaining each operation with the pipe operator (%>%).


Select Columns

The select function in the dplyr package is used for column selection in R data frames. It allows you to choose specific columns from a data frame, creating a new data frame with only the selected columns. The syntax of the select function is straightforward:

select(data_frame, column1, column2, ...)


Where data_frame is the name of the data frame you want to manipulate and column1, column2 … are the columns you want to select. For example:

# select columns by name
head(starwars %>% select(name, homeworld, species))
## # A tibble: 6 × 3
##   name           homeworld species
##   <chr>          <chr>     <chr>  
## 1 Luke Skywalker Tatooine  Human  
## 2 C-3PO          Tatooine  Droid  
## 3 R2-D2          Naboo     Droid  
## 4 Darth Vader    Tatooine  Human  
## 5 Leia Organa    Alderaan  Human  
## 6 Owen Lars      Tatooine  Human
# select columns by index
head(starwars %>% select(1:3))
## # A tibble: 6 × 3
##   name           height  mass
##   <chr>           <int> <dbl>
## 1 Luke Skywalker    172    77
## 2 C-3PO             167    75
## 3 R2-D2              96    32
## 4 Darth Vader       202   136
## 5 Leia Organa       150    49
## 6 Owen Lars         178   120


Note: selecting columns with colon operator is typically not recommended in many practices, as it hurts maintainability of your R program.

You can also use some helper functions and regex within the select function:

  • starts_with()/ends_with()/contains(): Selects columns based on their names.
  • matches(): Selects columns based on regular expressions.
  • everything(): Includes all columns.

For example:

selected_data <- starwars %>% 
  select(starts_with("f"), ends_with("color"), contains("sh"))

head(selected_data)
## # A tibble: 6 × 5
##   films     hair_color  skin_color  eye_color starships
##   <list>    <chr>       <chr>       <chr>     <list>   
## 1 <chr [5]> blond       fair        blue      <chr [2]>
## 2 <chr [6]> <NA>        gold        yellow    <chr [0]>
## 3 <chr [7]> <NA>        white, blue red       <chr [0]>
## 4 <chr [4]> none        white       yellow    <chr [1]>
## 5 <chr [5]> brown       light       brown     <chr [0]>
## 6 <chr [3]> brown, grey light       blue      <chr [0]>


selected_data <- starwars %>% 
  select(matches("*_color"))

head(selected_data)
## # A tibble: 6 × 3
##   hair_color  skin_color  eye_color
##   <chr>       <chr>       <chr>    
## 1 blond       fair        blue     
## 2 <NA>        gold        yellow   
## 3 <NA>        white, blue red      
## 4 none        white       yellow   
## 5 brown       light       brown    
## 6 brown, grey light       blue


Filter and Sort Rows

The filter function in the dplyr package is used to subset or filter rows from dataframe based on specified conditions. It allows you to extract a subset of data that meets specific criteria. Here’s the basic syntax of the filter function:

filter(data_frame, condition1, condition2, ...)


  • data_frame: The name of the dataframe you want to filter.
  • condition1, condition2, … : Conditions that specify which rows to keep. These conditions are typically logical expressions involving column values.

For example:

# filter rows where height is greater than 180 
filtered <- starwars %>% 
  select(c(height, homeworld, species)) %>%
  filter(height > 170)

head(filtered)
## # A tibble: 6 × 3
##   height homeworld species
##    <int> <chr>     <chr>  
## 1    172 Tatooine  Human  
## 2    202 Tatooine  Human  
## 3    178 Tatooine  Human  
## 4    183 Tatooine  Human  
## 5    182 Stewjon   Human  
## 6    188 Tatooine  Human


In the code chunk above, observe that the %>% operator hands over the selected columns to the filter function. That is, the columns that you want to use in the filter function must be selected in the select function in advance.

Suppose that you want to sort the resulting data frame. In such cases, you can use the arrange function. For example:

# arrange rows by name in descending order
arranged <- starwars %>% 
  select(c(height, homeworld, species)) %>%
  filter(height > 170) %>%
  arrange(species, desc(height))

head(arranged)
## # A tibble: 6 × 3
##   height homeworld species  
##    <int> <chr>     <chr>    
## 1    198 Ojom      Besalisk 
## 2    198 Cerea     Cerean   
## 3    196 Champala  Chagrian 
## 4    200 <NA>      Droid    
## 5    183 Geonosis  Geonosian
## 6    224 Naboo     Gungan


By default, it sorts data in an ascending order. To reverse it, you should use the desc() function.


Group and Aggregate Data

Grouping and aggregating data is a common operation that performs calculations on subsets of data based on one or more grouping variables. The primary dplyr functions involved in this process are group_by() for grouping and summarize() for aggregation. Here’s a basic example:

# calculate the mean of hieght
arranged %>% summarize(mean(height, na.rm = TRUE))
## # A tibble: 1 × 1
##   `mean(height, na.rm = TRUE)`
##                          <dbl>
## 1                         192.


# Calculate the mean of height grouped by species
arranged %>% 
  group_by(species) %>% 
  summarize(mean(height, na.rm = TRUE))
## # A tibble: 27 × 2
##    species   `mean(height, na.rm = TRUE)`
##    <chr>                            <dbl>
##  1 Besalisk                          198 
##  2 Cerean                            198 
##  3 Chagrian                          196 
##  4 Droid                             200 
##  5 Geonosian                         183 
##  6 Gungan                            209.
##  7 Human                             184.
##  8 Hutt                              175 
##  9 Iktotchi                          188 
## 10 Kaleesh                           216 
## # ℹ 17 more rows


In the aggregation functions, na.rm=TRUE specifies how R should handle any NA values in the calculation. When na.rm is set to be true, R will ignore the NA values. If na.rm = FALSE (the default setting), the presence of NA values in the entire aggregation result being NA.

For example, we can observe that the calculations from the entire subset of the data turns out to be NA, when there is any NA values

# calculate the mean of mass and do not ignore NA
starwars %>% 
  group_by(species) %>%
  summarize(mean(mass, na.rm = FALSE))
## # A tibble: 38 × 2
##    species   `mean(mass, na.rm = FALSE)`
##    <chr>                           <dbl>
##  1 Aleena                             15
##  2 Besalisk                          102
##  3 Cerean                             82
##  4 Chagrian                           NA
##  5 Clawdite                           55
##  6 Droid                              NA
##  7 Dug                                40
##  8 Ewok                               20
##  9 Geonosian                          80
## 10 Gungan                             NA
## # ℹ 28 more rows


# Count the number of NA values within each species
starwars %>%
  group_by(species) %>%
  summarize(sum(is.na(mass))) 
## # A tibble: 38 × 2
##    species   `sum(is.na(mass))`
##    <chr>                  <int>
##  1 Aleena                     0
##  2 Besalisk                   0
##  3 Cerean                     0
##  4 Chagrian                   1
##  5 Clawdite                   0
##  6 Droid                      2
##  7 Dug                        0
##  8 Ewok                       0
##  9 Geonosian                  0
## 10 Gungan                     1
## # ℹ 28 more rows


Create New Variables

The mutate function in the dplyr package is used to create new variables in a dataframe or modify existing ones. It allows you to perform calculations on existing columns and add the results as new variables, enhancing the flexibility of data manipulation. Here’s the basic syntax of the mutate function:

mutate(data_frame, new_variable = expression, ...)


  • data_frame: The name of the dataframe you want to modify.
  • new_variable: The name of the new variable you want to create.
  • expression: The calculation or operation that defines the values for the new variable.
  • ...: Additional variables to create or modify.

Here’s a simple example:

# add a new column that is the sum of col1 and col2
bmi <- starwars %>%
  select(name, height, mass) %>%
  mutate(bmi = mass / (height/100)^2)

head(bmi)
## # A tibble: 6 × 4
##   name           height  mass   bmi
##   <chr>           <int> <dbl> <dbl>
## 1 Luke Skywalker    172    77  26.0
## 2 C-3PO             167    75  26.9
## 3 R2-D2              96    32  34.7
## 4 Darth Vader       202   136  33.3
## 5 Leia Organa       150    49  21.8
## 6 Owen Lars         178   120  37.9

  1. The original dataset is sourced from SWAPI, the Star Wars API. The starwars dataset above includes updated information on character genders and sex.↩︎

Post a Comment

0 Comments