In data analysis, categorical variables are those that can take on a finite number of discrete values. R provides a comprehensive data structure named factor to represent the categorical variables. In R, factors are character vectors with pre-defined or known valid levels of values.

To construct a factor variable, you can use factor(). For example, the state.name in the code lines below is a vector containing 50 state names. To construct a factor with the vector, you can use factor() function.

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
##  chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" ...
state.name.factor <- factor(state.name)
##  Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...

Internally, factors differ from strings in their representation. They are stored as an integer vector where each element points to a level within the factor’s levels attribute, accompanied by a set of names for each level.

## $levels
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"       
## $class
## [1] "factor"

How to Modify Factor Levels

We often find some needs to modify factor levels, adjusting for changes in the data. For example, suppose that Texas gained independence from the United States and Puerto Rico became one. To replace a factor level Texas with Puerto Rico:

levels(state.name.factor) <- c(levels(state.name.factor), "Puerto Rico")
state.name.factor[state.name.factor == "Texas"] <- "Puerto Rico"
##  [1] Alabama        Alaska         Arizona        Arkansas       California    
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Puerto Rico    Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 51 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Puerto Rico

In the code chunk above, I added another factor level, "Puerto Rico", to the factor variable with the levels function. Since the level "Puerto Rico" is now available in the levels attribute, the factor variable replaced "Texas" by "Puerto Rico".

However, observe that "Texas" is still included in the factor levels. This is because the levels of a factor are set when the factor is created, and adding levels using levels() does not automatically remove existing levels. If you want to remove the level "Texas" from the factor levels, you can use the droplevels() function after updating the levels. For example:

# Remove all the unused levels, including "Texas"
state.name.factor <- droplevels(state.name.factor)

##  [1] Alabama        Alaska         Arizona        Arkansas       California    
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      Puerto Rico    Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 50 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Puerto Rico

Similarly, to replace the level of the existing factor level, we should use the levels() function.

pr.index = which(levels(state.name.factor) == "Puerto Rico")

# The People's Republic of Puerto Rico
levels(state.name.factor)[pr.index] = "The PR of PR" 
##  [1] Alabama        Alaska         Arizona        Arkansas       California    
##  [6] Colorado       Connecticut    Delaware       Florida        Georgia       
## [11] Hawaii         Idaho          Illinois       Indiana        Iowa          
## [16] Kansas         Kentucky       Louisiana      Maine          Maryland      
## [21] Massachusetts  Michigan       Minnesota      Mississippi    Missouri      
## [26] Montana        Nebraska       Nevada         New Hampshire  New Jersey    
## [31] New Mexico     New York       North Carolina North Dakota   Ohio          
## [36] Oklahoma       Oregon         Pennsylvania   Rhode Island   South Carolina
## [41] South Dakota   Tennessee      The PR of PR   Utah           Vermont       
## [46] Virginia       Washington     West Virginia  Wisconsin      Wyoming       
## 50 Levels: Alabama Alaska Arizona Arkansas California Colorado ... The PR of PR

Ordered factors

Factors can also have an order so that they can represent the natural orders or hierarchy based on their levels. You can give an order to a factor variable by passing an argument ordered=TRUE. For example:

survey <- c("strongly agree", "agree", "disagree", "strongly disagree")
survey.factor <- factor(survey, levels = survey, ordered = TRUE)
## [1] strongly agree    agree             disagree          strongly disagree
## Levels: strongly agree < agree < disagree < strongly disagree

To reverse the order of the factor level, you can use rev() function. For example:

survey.rev.factor <- factor(survey, levels = rev(survey), ordered = TRUE)
## [1] strongly agree    agree             disagree          strongly disagree
## Levels: strongly disagree < disagree < agree < strongly agree

