Regression Diagnostics

 

Vectorization is a way to avoiding explicit for loops in your code. In fact, for and while loops can be slower than optimized vectorizations in may interpreted languages, including R. This is particularly true when you have complex and extensive iterations.


One powerful feature of R as a scientific programming language is that it natively supports vectorization. That is you can perform operations or apply functions on entire vectors, matrices, lists, and data frames simultaneously, instead of iterating through them element by element. In fact, many built-in functions in R natively supports vecorization for better performance. For example, to create a sequence vector:

x1 <- seq(1, 11, by = 2)
x1
## [1]  1  3  5  7  9 11

Time for this code chunk to run: 0.00712203979492188


The code line above is much efficient than the code lines with the for loops below:

x2 <- numeric(6)
for(i in 1:length(x2)) {
    x2[i] = 1 + 2 * (i - 1)
}
x2
## [1]  1  3  5  7  9 11

Time for this code chunk to run: 0.0305929183959961


However, not every function in R is vectorized, and we often want to go beyond the built-in functions. In such cases, *apply family of functions can help us avoiding iterations.



Applying to One-dimensional Structures

The lapply and sapply will apply a function to a one-dimensional structure. The basic syntax for lapply and sapply is:

lapply(x, fun, ...)
sapply(x, fun, ...)


Where x is a vector or list that the function should be applied to, and fun is the function that you want to apply on each element in x. When there are arguments that you want to pass in to the fun, you can specify them in the ... part either by name or position. For example, let list1 be a list of length 3 whose elements are vectors with 5 numbers, and get_weighted_mean be a function that calculates a mean on pre-defined weights. If you want to calculate the mean of each vector in the list, you can use lapply as follows:

get_weighted_mean <- function(numbers, weights) {
  if (length(numbers) == 0 || length(weights) == 0 || length(numbers) != length(weights)) {
    stop("Input vectors are empty or of different lengths. Weighted mean is undefined.")
  }

  weighted_mean <- sum(numbers * weights) / sum(weights)

  return(weighted_mean)
}

list1 <- list(1:5, 6:10, 11:15)
lapply(list1, get_weighted_mean, c(0.4, 0.3, 0.2, 0.1, 0))  
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] 12


The difference between lapply and sapply is how the returning output is formatted. The lapply returns a list containing each calculated results from x as its elements. On the other hand, sapply tries to simplify the output to a vector or matrix. It only creates a list if necessary, making it potentially more memory efficient for compatible results. When each element of the function output is of length 1, the sapply returns a vector. For example:

sapply(list1, get_weighted_mean, c(0.4, 0.3, 0.2, 0.1, 0))
## [1]  2  7 12


If each element has the same length of two or more, it returns a matrix. Otherwise, it returns a list, just as lapply.

sapply(list1, abs)
##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15



Applying to Two-diemnsional Structures

For two-dimensional structures, such as matrices or data frames, you can use apply to avoid iterations. The basic syntax of the apply is:

apply(mat, dim, fun, ... )


Where mat is the matrix or data frame you want to apply the function to, dim is the direction the function applies, fun is the function you want to apply, and ... are extra arguments to be passed to the function. The dim can take either 1 or 2: 1 means apply the function to each row, while 2 means apply the function to each column in mat. For example:

x <- matrix(sample(-5:5, size = 20, replace = TRUE), nrow = 4, ncol = 5)
x
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5    3   -3   -4    2
## [2,]    1   -4    4   -4   -4
## [3,]    5   -3   -1   -3   -1
## [4,]    1    5    2    3    5
apply(x, 1, mean)
## [1]  0.6 -1.4 -0.6  3.2

Time for this code chunk to run: 0.00559306144714355

apply(x, 2, mean)
## [1]  3.00  0.25  0.50 -2.00  0.50

Time for this code chunk to run: 0.00534391403198242


This is equivalent to:

result = vector()
for(i in 1:nrow(x)) {
  result[i] = mean(x[i,])
}
print(result)
## [1]  0.6 -1.4 -0.6  3.2

Time for this code chunk to run: 0.0119569301605225

result = vector()
for(j in 1:ncol(x)) {
  result[j] = mean(x[,j])
}
print(result)
## [1]  3.00  0.25  0.50 -2.00  0.50

Time for this code chunk to run: 0.0111289024353027


While we could obtain the same results, taking advantage of apply function can make the R code much efficient! The apply function assumes the rows are vectors, meaning it will coerce the row! Particularly, in a data frame row observation, we have different types of data. So, you should be a bit careful when you use apply function on a data frame. For example, in the code lines below, the apply function will coerce the age values in each row to a character. Thus, R will throw an error when it tries to execute age + 3 for the first row.

triwizard <- data.frame(
  character = c("Harry Potter", "Cedric Diggory", "Viktor Krum", "Fleur Delacour", "George Weasley"),
  age = c(14, 17, 18, 17, 16)
)

goblet_of_fire <- function(row) {
  chr = row[1] # first element of row is character
  age = row[2] # second element of row is wand length
  
  if(chr == "Harry Potter") {
    return(age + 3)
  } else if(age >= 17) {
    return("Triwizard Champion")
  }
  "You should be at least 17!"
}
print(triwizard)
##        character age
## 1   Harry Potter  14
## 2 Cedric Diggory  17
## 3    Viktor Krum  18
## 4 Fleur Delacour  17
## 5 George Weasley  16
apply(triwizard, 1, goblet_of_fire)
## Error in age + 3: non-numeric argument to binary operator


If you want to apply a function to a data frame that are of different types, a better and safer way is to use mapply. The mapply function applies a function to multiple vectors of different types.

The syntax of the mapply is:

mapply(fun, vec1, vec2, ... , vecN)


Where fun is the function you want to execute and vec1, vec2, …, vecN are column vectors that the fun will be applied on. The mapply will return an output as a list, and the i-th element of the output will be fun(vec1[i], vec2[i], ... , vecN[i]). For example:

triwizard <- data.frame(
  character = c("Harry Potter", "Cedric Diggory", "Viktor Krum", "Fleur Delacour", "George Weasley"),
  age = c(14, 17, 18, 17, 16)
)

goblet_of_fire <- function(chr, age) {
  if(chr == "Harry Potter") {
    return(age + 3)
  } else if(age >= 17) {
    return("Triwizard Champion")
  }
  "You should be at least 17!"
}

mapply(goblet_of_fire, triwizard$character, triwizard$age)
##                 Harry Potter               Cedric Diggory 
##                         "17"         "Triwizard Champion" 
##                  Viktor Krum               Fleur Delacour 
##         "Triwizard Champion"         "Triwizard Champion" 
##               George Weasley 
## "You should be at least 17!"



The split/apply/combine Paradigm

The *apply functions we’ve seen so far applies a function element-by-element to a vector or matrix. What if we need to apply a function not to individual elements/rows/columns, but to subsets of different sizes? In this case, you cannot directly use the apply function. Instead, you can use the split/apply/combine paradigm.

This programming pattern is pretty clear from its name. The split step involves dividing a dataset into groups based on some criteria. Then you would apply a function to each subset of data in the apply step. Lastly, in the combine step, you’re going to combine the results into a single output data. This programming pattern helps increasing readability and maintainability of your R program.


split/apply/combine for vectors

This is the simplest use case of applying the split/apply/combine pattern. We have a vector and a grouping factor. We want to apply a function to groups of elements of the vector. The solution is tapply(x, fac, fun), where x is the vector, fun is the function, and fac is the factor indicating how the data should be split up.

The data presented below pertains to police stops conducted in New York City during the years 1998-1999. It includes information on the number of stops, categorized according to specific variables. In particular, we have counts of police stops for all combinations of precinct (75 total), eth (ethnicity of the person stopped, 1 = Black, 2 = Hispanic, 3 = White), and crime (type of crime, 1 = violent, 2 = weapons, 3 = property, 4 = drug) .

url <- "http://www.stat.columbia.edu/~gelman/arm/examples/police/frisk_with_noise.dat"

frisk <- read.table(url, skip = 6, header = TRUE)
head(frisk)
##   stops  pop past.arrests precinct eth crime
## 1    75 1720          191        1   1     1
## 2    36 1720           57        1   1     2
## 3    74 1720          599        1   1     3
## 4    17 1720          133        1   1     4
## 5    37 1368           62        1   2     1
## 6    39 1368           27        1   2     2


Suppose we want to get the total number of stops in each value of precinct. We can use the tapply for the purpose:

tapply(frisk$stops, # values apply function to
       frisk$precinct, # grouping factor
       sum) # function to apply
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  385  347 1603 1411 1600 1297  649 1572  285 1156  838  884 2215 1165 2373  965 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005  902 2888 2157 1805 1041 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## 2966 2810  724 1189  997 1808 2464  859 1979 2679 1645 1669 2953 3443 1001 1574 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 1041 3054 1208 2116 1216  995 1140  572  856 2671 1147 2941 1625 1635 1248 1351 
##   65   66   67   68   69   70   71   72   73   74   75 
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327  322


split/apply/combine for Data Frames

To split a data frame into groups of rows and apply a function to the groups, you should use by(df, fac, fun), where df is a data frame, fun is the function, and fac is a factor variable indicating the groups you want to split df into. For example:

library(ggplot2)
data(diamonds)

get_diamond_coefficients <- function(data_subset) {
    diamond_lm = lm(log(price) ~ carat, data = data_subset)
    diamond_coefficients = coef(diamond_lm)
    return(diamond_coefficients)
}

# by does the split and apply steps and an ugly version of a combine step
out <- by(diamonds, diamonds$color, get_diamond_coefficients)
out
## diamonds$color: D
## (Intercept)       carat 
##    6.048811    2.383864 
## ------------------------------------------------------------ 
## diamonds$color: E
## (Intercept)       carat 
##    6.034513    2.348335 
## ------------------------------------------------------------ 
## diamonds$color: F
## (Intercept)       carat 
##    6.088442    2.272790 
## ------------------------------------------------------------ 
## diamonds$color: G
## (Intercept)       carat 
##    6.109554    2.178489 
## ------------------------------------------------------------ 
## diamonds$color: H
## (Intercept)       carat 
##    6.180284    1.906300 
## ------------------------------------------------------------ 
## diamonds$color: I
## (Intercept)       carat 
##    6.175315    1.799199 
## ------------------------------------------------------------ 
## diamonds$color: J
## (Intercept)       carat 
##    6.254074    1.627947


The output is a list with some extra attributes. The class is by as we can see below.

typeof(out)
## [1] "list"
str(out)
## List of 7
##  $ D: Named num [1:2] 6.05 2.38
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
##  $ E: Named num [1:2] 6.03 2.35
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
##  $ F: Named num [1:2] 6.09 2.27
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
##  $ G: Named num [1:2] 6.11 2.18
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
##  $ H: Named num [1:2] 6.18 1.91
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
##  $ I: Named num [1:2] 6.18 1.8
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
##  $ J: Named num [1:2] 6.25 1.63
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "carat"
##  - attr(*, "dim")= int 7
##  - attr(*, "dimnames")=List of 1
##   ..$ diamonds$color: chr [1:7] "D" "E" "F" "G" ...
##  - attr(*, "call")= language by.data.frame(data = diamonds, INDICES = diamonds$color, FUN = get_diamond_coefficients)
##  - attr(*, "class")= chr "by"


The output format is not ideal on its own, and usually we want to simplify this. You have a couple of options, including examples below:

simplify2array(out)
##                    D        E        F        G        H        I        J
## (Intercept) 6.048811 6.034513 6.088442 6.109554 6.180284 6.175315 6.254074
## carat       2.383864 2.348335 2.272790 2.178489 1.906300 1.799199 1.627947
# changing the output from a list to a matrix
do.call(rbind, out)
##   (Intercept)    carat
## D    6.048811 2.383864
## E    6.034513 2.348335
## F    6.088442 2.272790
## G    6.109554 2.178489
## H    6.180284 1.906300
## I    6.175315 1.799199
## J    6.254074 1.627947


Note: the by function can also be used with multiple grouping factors. The syntax in the case is by(df, df[, c(factor_variable1, factor_variable2)], fun).


If you want to do split and apply/combine separately

The split function can perform just split step in the paradigm, and you can do the rest of the steps separately. The basic syntax for the split function is:

split(x, fac)


Where x is your data1 and fac is the factor variable to group it.

The split function returns a list whose length equals to the number of levels in fac. Each element in the list contains data from x depending on its factor level of fac2. For example:

frisk_split_by_precinct <- split(frisk$stops, frisk$precinct)
head(frisk_split_by_precinct, 4); tail(frisk_split_by_precinct, 4)
## $`1`
##  [1] 75 36 74 17 37 39 23  3 26 32 10 13
## 
## $`2`
##  [1] 73 37  9 13 77 40 16 11 18 14  9 30
## 
## $`3`
##  [1] 293 245 112 102 131 149  99  62 110  88 116  96
## 
## $`4`
##  [1] 203 116  24  42 328 324  88 157  26  27  30  46
## $`72`
##  [1]  127  332  206   37  444  841 1054  128   63   96  197   35
## 
## $`73`
##  [1] 858 982 287 365 176 160 112  73 135 136 216 150
## 
## $`74`
##  [1]  90  59  33  15  55  60  72  47 226 213 288 169
## 
## $`75`
##  [1]   3   2   2   0   5   6   7   2 111  48 115  21


After splitting, we can then apply sapply:

sapply(frisk_split_by_precinct, sum)
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  385  347 1603 1411 1600 1297  649 1572  285 1156  838  884 2215 1165 2373  965 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005  902 2888 2157 1805 1041 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## 2966 2810  724 1189  997 1808 2464  859 1979 2679 1645 1669 2953 3443 1001 1574 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 1041 3054 1208 2116 1216  995 1140  572  856 2671 1147 2941 1625 1635 1248 1351 
##   65   66   67   68   69   70   71   72   73   74   75 
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327  322

  1. It can be either a vector or a data frame↩︎

  2. Elements can be a vector or a data frame. It depends on what data type x is↩︎

Post a Comment

0 Comments