The dplyr
is a package that is arguably
the most popular in R programming language. It provides all-in-one
functionality for filtering, arranging, summarizing, and transforming
data. To illustrate, let’s consider a dataset below:
library(dplyr)
head(starwars)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Org… 150 49 brown light brown 19 fema… femin…
## 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
The dataset above has 87 rows and 14 variables1.
name
: Name of the characterheight
: Height in cmmass
: Weight in kghair_color
/skin_color
/eye_color
: Hair, skin, and eye colorsbirth_year
: Year born (BBY = Before Battle of Yavin)sex
: The biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids).gender
: The gender role or identity of the character as determined by their personality or the way they were programmed (as in the case for Droids).homeworld
: Name of homeworldspecies
: Name of speciesfilms
: List of films the character appeared invehecles
: List of vehicles the character has pilotedstarships
: List of starships the character has piloted
The dplyr
package make data manipulation process crystal
clear by seamlessly passing the output of one function as
the input to the subsequent function. To be more specific,
it streamlines the process by chaining each operation with the
pipe operator (%>%
).
Select Columns
The select
function in the
dplyr
package is used for column selection in R data
frames. It allows you to choose specific columns from a data frame,
creating a new data frame with only the selected columns. The syntax of
the select
function is straightforward:
select(data_frame, column1, column2, ...)
Where data_frame
is the name of the data frame you want
to manipulate and column1
, column2
… are the
columns you want to select. For example:
# select columns by name
head(starwars %>% select(name, homeworld, species))
## # A tibble: 6 × 3
## name homeworld species
## <chr> <chr> <chr>
## 1 Luke Skywalker Tatooine Human
## 2 C-3PO Tatooine Droid
## 3 R2-D2 Naboo Droid
## 4 Darth Vader Tatooine Human
## 5 Leia Organa Alderaan Human
## 6 Owen Lars Tatooine Human
# select columns by index
head(starwars %>% select(1:3))
## # A tibble: 6 × 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
Note: selecting columns with colon operator is typically not recommended in many practices, as it hurts maintainability of your R program.
You can also use some helper functions and regex within the
select
function:
starts_with()
/ends_with()
/contains()
: Selects columns based on their names.matches()
: Selects columns based on regular expressions.everything()
: Includes all columns.
For example:
selected_data <- starwars %>%
select(starts_with("f"), ends_with("color"), contains("sh"))
head(selected_data)
## # A tibble: 6 × 5
## films hair_color skin_color eye_color starships
## <list> <chr> <chr> <chr> <list>
## 1 <chr [5]> blond fair blue <chr [2]>
## 2 <chr [6]> <NA> gold yellow <chr [0]>
## 3 <chr [7]> <NA> white, blue red <chr [0]>
## 4 <chr [4]> none white yellow <chr [1]>
## 5 <chr [5]> brown light brown <chr [0]>
## 6 <chr [3]> brown, grey light blue <chr [0]>
selected_data <- starwars %>%
select(matches("*_color"))
head(selected_data)
## # A tibble: 6 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
Filter and Sort Rows
The filter
function in the
dplyr
package is used to subset or filter rows from
dataframe based on specified conditions. It allows you to extract a
subset of data that meets specific criteria. Here’s the basic syntax of
the filter
function:
filter(data_frame, condition1, condition2, ...)
data_frame
: The name of the dataframe you want to filter.condition1
,condition2
, … : Conditions that specify which rows to keep. These conditions are typically logical expressions involving column values.
For example:
# filter rows where height is greater than 180
filtered <- starwars %>%
select(c(height, homeworld, species)) %>%
filter(height > 170)
head(filtered)
## # A tibble: 6 × 3
## height homeworld species
## <int> <chr> <chr>
## 1 172 Tatooine Human
## 2 202 Tatooine Human
## 3 178 Tatooine Human
## 4 183 Tatooine Human
## 5 182 Stewjon Human
## 6 188 Tatooine Human
In the code chunk above, observe that the %>%
operator hands over the selected columns to the filter
function. That is, the columns that you want to use in the
filter
function must be selected in the
select
function in advance.
Suppose that you want to sort the resulting data frame. In such
cases, you can use the arrange
function.
For example:
# arrange rows by name in descending order
arranged <- starwars %>%
select(c(height, homeworld, species)) %>%
filter(height > 170) %>%
arrange(species, desc(height))
head(arranged)
## # A tibble: 6 × 3
## height homeworld species
## <int> <chr> <chr>
## 1 198 Ojom Besalisk
## 2 198 Cerea Cerean
## 3 196 Champala Chagrian
## 4 200 <NA> Droid
## 5 183 Geonosis Geonosian
## 6 224 Naboo Gungan
By default, it sorts data in an ascending order. To reverse it, you
should use the desc()
function.
Group and Aggregate Data
Grouping and aggregating data is a common operation that performs
calculations on subsets of data based on one or more grouping variables.
The primary dplyr
functions involved in this process are
group_by()
for grouping and
summarize()
for aggregation. Here’s a
basic example:
# calculate the mean of hieght
arranged %>% summarize(mean(height, na.rm = TRUE))
## # A tibble: 1 × 1
## `mean(height, na.rm = TRUE)`
## <dbl>
## 1 192.
# Calculate the mean of height grouped by species
arranged %>%
group_by(species) %>%
summarize(mean(height, na.rm = TRUE))
## # A tibble: 27 × 2
## species `mean(height, na.rm = TRUE)`
## <chr> <dbl>
## 1 Besalisk 198
## 2 Cerean 198
## 3 Chagrian 196
## 4 Droid 200
## 5 Geonosian 183
## 6 Gungan 209.
## 7 Human 184.
## 8 Hutt 175
## 9 Iktotchi 188
## 10 Kaleesh 216
## # ℹ 17 more rows
In the aggregation functions, na.rm=TRUE
specifies how R
should handle any NA values in the calculation. When na.rm
is set to be true, R will ignore the NA values. If
na.rm = FALSE
(the default setting), the presence of NA
values in the entire aggregation result being NA.
For example, we can observe that the calculations from the entire subset of the data turns out to be NA, when there is any NA values
# calculate the mean of mass and do not ignore NA
starwars %>%
group_by(species) %>%
summarize(mean(mass, na.rm = FALSE))
## # A tibble: 38 × 2
## species `mean(mass, na.rm = FALSE)`
## <chr> <dbl>
## 1 Aleena 15
## 2 Besalisk 102
## 3 Cerean 82
## 4 Chagrian NA
## 5 Clawdite 55
## 6 Droid NA
## 7 Dug 40
## 8 Ewok 20
## 9 Geonosian 80
## 10 Gungan NA
## # ℹ 28 more rows
# Count the number of NA values within each species
starwars %>%
group_by(species) %>%
summarize(sum(is.na(mass)))
## # A tibble: 38 × 2
## species `sum(is.na(mass))`
## <chr> <int>
## 1 Aleena 0
## 2 Besalisk 0
## 3 Cerean 0
## 4 Chagrian 1
## 5 Clawdite 0
## 6 Droid 2
## 7 Dug 0
## 8 Ewok 0
## 9 Geonosian 0
## 10 Gungan 1
## # ℹ 28 more rows
Create New Variables
The mutate
function in the
dplyr
package is used to create new variables in a
dataframe or modify existing ones. It allows you to perform calculations
on existing columns and add the results as new variables, enhancing the
flexibility of data manipulation. Here’s the basic syntax of the
mutate
function:
mutate(data_frame, new_variable = expression, ...)
data_frame
: The name of the dataframe you want to modify.new_variable
: The name of the new variable you want to create.expression
: The calculation or operation that defines the values for the new variable....
: Additional variables to create or modify.
Here’s a simple example:
# add a new column that is the sum of col1 and col2
bmi <- starwars %>%
select(name, height, mass) %>%
mutate(bmi = mass / (height/100)^2)
head(bmi)
## # A tibble: 6 × 4
## name height mass bmi
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 26.0
## 2 C-3PO 167 75 26.9
## 3 R2-D2 96 32 34.7
## 4 Darth Vader 202 136 33.3
## 5 Leia Organa 150 49 21.8
## 6 Owen Lars 178 120 37.9
The original dataset is sourced from SWAPI, the Star Wars API. The starwars dataset above includes updated information on character genders and sex.↩︎