Vectorization is a way to avoiding explicit for loops in your code. In fact, for and while loops can be slower than optimized vectorizations in may interpreted languages, including R. This is particularly true when you have complex and extensive iterations.
One powerful feature of R as a scientific programming language is that it natively supports vectorization. That is you can perform operations or apply functions on entire vectors, matrices, lists, and data frames simultaneously, instead of iterating through them element by element. In fact, many built-in functions in R natively supports vecorization for better performance. For example, to create a sequence vector:
x1 <- seq(1, 11, by = 2)
x1
## [1] 1 3 5 7 9 11
Time for this code chunk to run: 0.00712203979492188
The code line above is much efficient than the code lines with the for loops below:
x2 <- numeric(6)
for(i in 1:length(x2)) {
x2[i] = 1 + 2 * (i - 1)
}
x2
## [1] 1 3 5 7 9 11
Time for this code chunk to run: 0.0305929183959961
However, not every function in R is vectorized, and we often want to
go beyond the built-in functions. In such cases, *apply
family of functions can help us avoiding iterations.
Applying to One-dimensional Structures
The lapply
and sapply
will apply a function
to a one-dimensional structure. The basic syntax for lapply
and sapply
is:
lapply(x, fun, ...)
sapply(x, fun, ...)
Where x
is a vector or list that the function should be
applied to, and fun
is the function that you want to apply
on each element in x
. When there are arguments that you
want to pass in to the fun
, you can specify them in the
...
part either by name or position. For example, let
list1
be a list of length 3 whose elements are vectors with
5 numbers, and get_weighted_mean
be a function that
calculates a mean on pre-defined weights. If you want to calculate the
mean of each vector in the list, you can use lapply
as
follows:
get_weighted_mean <- function(numbers, weights) {
if (length(numbers) == 0 || length(weights) == 0 || length(numbers) != length(weights)) {
stop("Input vectors are empty or of different lengths. Weighted mean is undefined.")
}
weighted_mean <- sum(numbers * weights) / sum(weights)
return(weighted_mean)
}
list1 <- list(1:5, 6:10, 11:15)
lapply(list1, get_weighted_mean, c(0.4, 0.3, 0.2, 0.1, 0))
## [[1]]
## [1] 2
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] 12
The difference between lapply
and sapply
is
how the returning output is formatted. The lapply
returns a list containing each calculated results from x
as
its elements. On the other hand, sapply
tries to simplify the output to a vector or matrix. It only
creates a list if necessary, making it potentially more memory efficient
for compatible results. When each element of the function output is of
length 1, the sapply
returns a vector. For example:
sapply(list1, get_weighted_mean, c(0.4, 0.3, 0.2, 0.1, 0))
## [1] 2 7 12
If each element has the same length of two or more, it returns a
matrix. Otherwise, it returns a list, just as lapply
.
sapply(list1, abs)
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
Applying to Two-diemnsional Structures
For two-dimensional structures, such as matrices or data frames, you
can use apply
to avoid iterations. The basic syntax of the
apply
is:
apply(mat, dim, fun, ... )
Where mat
is the matrix or data frame you want to apply
the function to, dim
is the direction the function applies,
fun
is the function you want to apply, and ...
are extra arguments to be passed to the function. The dim
can take either 1 or 2: 1 means apply
the
function to each row, while 2
means apply
the
function to each column in mat
. For
example:
x <- matrix(sample(-5:5, size = 20, replace = TRUE), nrow = 4, ncol = 5)
x
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 3 -3 -4 2
## [2,] 1 -4 4 -4 -4
## [3,] 5 -3 -1 -3 -1
## [4,] 1 5 2 3 5
apply(x, 1, mean)
## [1] 0.6 -1.4 -0.6 3.2
Time for this code chunk to run: 0.00559306144714355
apply(x, 2, mean)
## [1] 3.00 0.25 0.50 -2.00 0.50
Time for this code chunk to run: 0.00534391403198242
This is equivalent to:
result = vector()
for(i in 1:nrow(x)) {
result[i] = mean(x[i,])
}
print(result)
## [1] 0.6 -1.4 -0.6 3.2
Time for this code chunk to run: 0.0119569301605225
result = vector()
for(j in 1:ncol(x)) {
result[j] = mean(x[,j])
}
print(result)
## [1] 3.00 0.25 0.50 -2.00 0.50
Time for this code chunk to run: 0.0111289024353027
While we could obtain the same results, taking advantage of
apply
function can make the R code much efficient! The
apply
function assumes the rows are vectors, meaning
it will coerce the row! Particularly, in a
data frame row observation, we have different types of data. So, you
should be a bit careful when you use apply
function on a
data frame. For example, in the code lines below, the apply
function will coerce the age
values in each row to a
character. Thus, R will throw an error when it tries to execute
age + 3
for the first row.
triwizard <- data.frame(
character = c("Harry Potter", "Cedric Diggory", "Viktor Krum", "Fleur Delacour", "George Weasley"),
age = c(14, 17, 18, 17, 16)
)
goblet_of_fire <- function(row) {
chr = row[1] # first element of row is character
age = row[2] # second element of row is wand length
if(chr == "Harry Potter") {
return(age + 3)
} else if(age >= 17) {
return("Triwizard Champion")
}
"You should be at least 17!"
}
print(triwizard)
## character age
## 1 Harry Potter 14
## 2 Cedric Diggory 17
## 3 Viktor Krum 18
## 4 Fleur Delacour 17
## 5 George Weasley 16
apply(triwizard, 1, goblet_of_fire)
## Error in age + 3: non-numeric argument to binary operator
If you want to apply a function to a data frame that are of different
types, a better and safer way is to use mapply
. The
mapply
function applies a function to multiple vectors of
different types.
The syntax of the mapply
is:
mapply(fun, vec1, vec2, ... , vecN)
Where fun
is the function you want to execute and
vec1
, vec2
, …, vecN
are column
vectors that the fun
will be applied on. The
mapply
will return an output as a list, and the i-th
element of the output will be
fun(vec1[i], vec2[i], ... , vecN[i])
. For example:
triwizard <- data.frame(
character = c("Harry Potter", "Cedric Diggory", "Viktor Krum", "Fleur Delacour", "George Weasley"),
age = c(14, 17, 18, 17, 16)
)
goblet_of_fire <- function(chr, age) {
if(chr == "Harry Potter") {
return(age + 3)
} else if(age >= 17) {
return("Triwizard Champion")
}
"You should be at least 17!"
}
mapply(goblet_of_fire, triwizard$character, triwizard$age)
## Harry Potter Cedric Diggory
## "17" "Triwizard Champion"
## Viktor Krum Fleur Delacour
## "Triwizard Champion" "Triwizard Champion"
## George Weasley
## "You should be at least 17!"
The split
/apply
/combine
Paradigm
The *apply
functions we’ve seen so far applies a
function element-by-element to a vector or matrix. What if
we need to apply a function not to individual elements/rows/columns, but
to subsets of different sizes? In this case, you cannot
directly use the apply
function. Instead, you can use the
split/apply/combine paradigm.
This programming pattern is pretty clear from its name. The split step involves dividing a dataset into groups based on some criteria. Then you would apply a function to each subset of data in the apply step. Lastly, in the combine step, you’re going to combine the results into a single output data. This programming pattern helps increasing readability and maintainability of your R program.
split/apply/combine for vectors
This is the simplest use case of applying the split/apply/combine
pattern. We have a vector and a grouping factor. We want to apply a
function to groups of elements of the vector. The solution is
tapply(x, fac, fun)
, where x
is the vector,
fun
is the function, and fac
is the factor
indicating how the data should be split up.
The data presented below pertains to police stops conducted in New
York City during the years 1998-1999. It includes information on the
number of stops, categorized according to specific variables. In
particular, we have counts of police stops for all combinations of
precinct
(75 total), eth
(ethnicity of the
person stopped, 1 = Black, 2 = Hispanic, 3 = White), and
crime
(type of crime, 1 = violent, 2 = weapons, 3 =
property, 4 = drug) .
url <- "http://www.stat.columbia.edu/~gelman/arm/examples/police/frisk_with_noise.dat"
frisk <- read.table(url, skip = 6, header = TRUE)
head(frisk)
## stops pop past.arrests precinct eth crime
## 1 75 1720 191 1 1 1
## 2 36 1720 57 1 1 2
## 3 74 1720 599 1 1 3
## 4 17 1720 133 1 1 4
## 5 37 1368 62 1 2 1
## 6 39 1368 27 1 2 2
Suppose we want to get the total number of stops in each value of
precinct
. We can use the tapply
for the
purpose:
tapply(frisk$stops, # values apply function to
frisk$precinct, # grouping factor
sum) # function to apply
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 385 347 1603 1411 1600 1297 649 1572 285 1156 838 884 2215 1165 2373 965
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005 902 2888 2157 1805 1041
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2966 2810 724 1189 997 1808 2464 859 1979 2679 1645 1669 2953 3443 1001 1574
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 1041 3054 1208 2116 1216 995 1140 572 856 2671 1147 2941 1625 1635 1248 1351
## 65 66 67 68 69 70 71 72 73 74 75
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327 322
split/apply/combine for Data Frames
To split a data frame into groups of rows and apply a function to the
groups, you should use by(df, fac, fun)
, where
df
is a data frame, fun
is the function, and
fac
is a factor variable indicating the groups you want to
split df
into. For example:
library(ggplot2)
data(diamonds)
get_diamond_coefficients <- function(data_subset) {
diamond_lm = lm(log(price) ~ carat, data = data_subset)
diamond_coefficients = coef(diamond_lm)
return(diamond_coefficients)
}
# by does the split and apply steps and an ugly version of a combine step
out <- by(diamonds, diamonds$color, get_diamond_coefficients)
out
## diamonds$color: D
## (Intercept) carat
## 6.048811 2.383864
## ------------------------------------------------------------
## diamonds$color: E
## (Intercept) carat
## 6.034513 2.348335
## ------------------------------------------------------------
## diamonds$color: F
## (Intercept) carat
## 6.088442 2.272790
## ------------------------------------------------------------
## diamonds$color: G
## (Intercept) carat
## 6.109554 2.178489
## ------------------------------------------------------------
## diamonds$color: H
## (Intercept) carat
## 6.180284 1.906300
## ------------------------------------------------------------
## diamonds$color: I
## (Intercept) carat
## 6.175315 1.799199
## ------------------------------------------------------------
## diamonds$color: J
## (Intercept) carat
## 6.254074 1.627947
The output is a list with some extra attributes. The class is
by
as we can see below.
typeof(out)
## [1] "list"
str(out)
## List of 7
## $ D: Named num [1:2] 6.05 2.38
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
## $ E: Named num [1:2] 6.03 2.35
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
## $ F: Named num [1:2] 6.09 2.27
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
## $ G: Named num [1:2] 6.11 2.18
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
## $ H: Named num [1:2] 6.18 1.91
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
## $ I: Named num [1:2] 6.18 1.8
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
## $ J: Named num [1:2] 6.25 1.63
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
## - attr(*, "dim")= int 7
## - attr(*, "dimnames")=List of 1
## ..$ diamonds$color: chr [1:7] "D" "E" "F" "G" ...
## - attr(*, "call")= language by.data.frame(data = diamonds, INDICES = diamonds$color, FUN = get_diamond_coefficients)
## - attr(*, "class")= chr "by"
The output format is not ideal on its own, and usually we want to simplify this. You have a couple of options, including examples below:
simplify2array(out)
## D E F G H I J
## (Intercept) 6.048811 6.034513 6.088442 6.109554 6.180284 6.175315 6.254074
## carat 2.383864 2.348335 2.272790 2.178489 1.906300 1.799199 1.627947
# changing the output from a list to a matrix
do.call(rbind, out)
## (Intercept) carat
## D 6.048811 2.383864
## E 6.034513 2.348335
## F 6.088442 2.272790
## G 6.109554 2.178489
## H 6.180284 1.906300
## I 6.175315 1.799199
## J 6.254074 1.627947
Note: the by
function can also be used
with multiple grouping factors. The syntax in the case is
by(df, df[, c(factor_variable1, factor_variable2)], fun)
.
If you want to do split and apply/combine separately
The split
function can perform just split step in the
paradigm, and you can do the rest of the steps separately. The basic
syntax for the split
function is:
split(x, fac)
Where x
is your data1 and fac
is the factor variable
to group it.
The split
function returns a list whose length equals to
the number of levels in fac
. Each element in the list
contains data from x
depending on its factor level of
fac
2. For example:
frisk_split_by_precinct <- split(frisk$stops, frisk$precinct)
head(frisk_split_by_precinct, 4); tail(frisk_split_by_precinct, 4)
## $`1`
## [1] 75 36 74 17 37 39 23 3 26 32 10 13
##
## $`2`
## [1] 73 37 9 13 77 40 16 11 18 14 9 30
##
## $`3`
## [1] 293 245 112 102 131 149 99 62 110 88 116 96
##
## $`4`
## [1] 203 116 24 42 328 324 88 157 26 27 30 46
## $`72`
## [1] 127 332 206 37 444 841 1054 128 63 96 197 35
##
## $`73`
## [1] 858 982 287 365 176 160 112 73 135 136 216 150
##
## $`74`
## [1] 90 59 33 15 55 60 72 47 226 213 288 169
##
## $`75`
## [1] 3 2 2 0 5 6 7 2 111 48 115 21
After splitting, we can then apply sapply
:
sapply(frisk_split_by_precinct, sum)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 385 347 1603 1411 1600 1297 649 1572 285 1156 838 884 2215 1165 2373 965
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005 902 2888 2157 1805 1041
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2966 2810 724 1189 997 1808 2464 859 1979 2679 1645 1669 2953 3443 1001 1574
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 1041 3054 1208 2116 1216 995 1140 572 856 2671 1147 2941 1625 1635 1248 1351
## 65 66 67 68 69 70 71 72 73 74 75
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327 322