In data analysis, categorical variables are those that can take on a finite number of discrete values. R provides a comprehensive data structure named factor to represent the categorical variables. In R, factors are character vectors with pre-defined or known valid levels of values.
To construct a factor variable, you can use factor()
.
For example, the state.name
in the code lines below is a
vector containing 50 state names. To construct a factor with the vector,
you can use factor()
function.
print(state.name)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
str(state.name)
## chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" ...
state.name.factor <- factor(state.name)
str(state.name.factor)
## Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
Internally, factors differ from strings in their representation. They
are stored as an integer vector where each element points to a level
within the factor’s levels
attribute, accompanied by a set
of names for each level.
attributes(state.name.factor)
## $levels
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
##
## $class
## [1] "factor"
How to Modify Factor Levels
We often find some needs to modify factor levels, adjusting for changes in the data. For example, suppose that Texas gained independence from the United States and Puerto Rico became one. To replace a factor level Texas with Puerto Rico:
levels(state.name.factor) <- c(levels(state.name.factor), "Puerto Rico")
state.name.factor[state.name.factor == "Texas"] <- "Puerto Rico"
state.name.factor
## [1] Alabama Alaska Arizona Arkansas California
## [6] Colorado Connecticut Delaware Florida Georgia
## [11] Hawaii Idaho Illinois Indiana Iowa
## [16] Kansas Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi Missouri
## [26] Montana Nebraska Nevada New Hampshire New Jersey
## [31] New Mexico New York North Carolina North Dakota Ohio
## [36] Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Puerto Rico Utah Vermont
## [46] Virginia Washington West Virginia Wisconsin Wyoming
## 51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Puerto Rico
In the code chunk above, I added another factor level,
"Puerto Rico"
, to the factor variable with the
levels
function. Since the level "Puerto Rico"
is now available in the levels attribute, the factor variable replaced
"Texas"
by "Puerto Rico"
.
However, observe that "Texas"
is still included in the
factor levels. This is because the levels of a factor are set when the
factor is created, and adding levels using levels()
does
not automatically remove existing levels. If you want to remove the
level "Texas"
from the factor levels, you can use the
droplevels()
function after updating the levels. For
example:
# Remove all the unused levels, including "Texas"
state.name.factor <- droplevels(state.name.factor)
print(state.name.factor)
## [1] Alabama Alaska Arizona Arkansas California
## [6] Colorado Connecticut Delaware Florida Georgia
## [11] Hawaii Idaho Illinois Indiana Iowa
## [16] Kansas Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi Missouri
## [26] Montana Nebraska Nevada New Hampshire New Jersey
## [31] New Mexico New York North Carolina North Dakota Ohio
## [36] Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee Puerto Rico Utah Vermont
## [46] Virginia Washington West Virginia Wisconsin Wyoming
## 50 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Puerto Rico
Similarly, to replace the level of the existing factor level, we
should use the levels()
function.
pr.index = which(levels(state.name.factor) == "Puerto Rico")
# The People's Republic of Puerto Rico
levels(state.name.factor)[pr.index] = "The PR of PR"
print(state.name.factor)
## [1] Alabama Alaska Arizona Arkansas California
## [6] Colorado Connecticut Delaware Florida Georgia
## [11] Hawaii Idaho Illinois Indiana Iowa
## [16] Kansas Kentucky Louisiana Maine Maryland
## [21] Massachusetts Michigan Minnesota Mississippi Missouri
## [26] Montana Nebraska Nevada New Hampshire New Jersey
## [31] New Mexico New York North Carolina North Dakota Ohio
## [36] Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## [41] South Dakota Tennessee The PR of PR Utah Vermont
## [46] Virginia Washington West Virginia Wisconsin Wyoming
## 50 Levels: Alabama Alaska Arizona Arkansas California Colorado ... The PR of PR
Ordered factors
Factors can also have an order so that they can represent the natural
orders or hierarchy based on their levels. You can give an order to a
factor variable by passing an argument ordered=TRUE
. For
example:
survey <- c("strongly agree", "agree", "disagree", "strongly disagree")
survey.factor <- factor(survey, levels = survey, ordered = TRUE)
survey.factor
## [1] strongly agree agree disagree strongly disagree
## Levels: strongly agree < agree < disagree < strongly disagree
To reverse the order of the factor level, you can use
rev()
function. For example:
survey.rev.factor <- factor(survey, levels = rev(survey), ordered = TRUE)
survey.rev.factor
## [1] strongly agree agree disagree strongly disagree
## Levels: strongly disagree < disagree < agree < strongly agree