Learning objectives

  • Understand when and why to iterate code
  • Be able to start with a single use and build up to iteration
  • Use for loops and apply functions to iterate
  • Be able to write functions to cleanly iterate code


Videos

Once Twice Thrice in a Lifetime

And you may find yourself
Behind the keys of a large computing machine
And you may find yourself
Copy-pasting tons of code
And you may ask yourself, well
How did I get here?


It’s pretty common that you’ll want to run the same basic bit of code a bunch of times with different inputs. Maybe you want to read in a bunch of data files with different names or calculate something complex on every row of a dataframe. A general rule of thumb is that any code you want to run 3+ times should be iterated instead of copy-pasted. Copy-pasting code and replacing the parts you want to change is generally a bad practice for several reasons:

  • it’s easy to forget to change all the parts that need to be different
  • it’s easy to mistype
  • it is ugly to read
  • it scales very poorly (try copy-pasting 100 times…)

Lots of functions (including many base functions) are vectorized, meaning they already work on vectors of values. Here’s an example:

x <- 1:10
log(x)
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

The log() function already knows we want to take the log of each element in x, and it returns a vector that’s the same length as x. If a vectorized function already exists to do what you want, use it! It’s going to be faster and cleaner than trying to iterate everything yourself.

However, we may want to do more complex iterations, which brings us to our first main iterating concept.

For Loops

A for loop will repeat some bit of code, each time with a new input value. Here’s the basic structure:

for(i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

You’ll often see i used in for loops, you can think of it as the iteration value. For each i value in the vector 1:10, we’ll print that index value. You can use the i value more than once in a loop:

for (i in 1:10) {
  print(i)
  print(i^2)
}
## [1] 1
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 9
## [1] 4
## [1] 16
## [1] 5
## [1] 25
## [1] 6
## [1] 36
## [1] 7
## [1] 49
## [1] 8
## [1] 64
## [1] 9
## [1] 81
## [1] 10
## [1] 100

What’s happening is the value of i gets inserted into the code block, the block gets run, the value of i changes, and the process repeats. For loops can be a way to explicitly lay out fairly complicated procedures, since you can see exactly where your i value is going in the code.

You can also use the i value to index a vector or dataframe, which can be very powerful!

for (i in 1:10) {
  print(letters[i])
  print(mtcars$wt[i])
}
## [1] "a"
## [1] 2.62
## [1] "b"
## [1] 2.875
## [1] "c"
## [1] 2.32
## [1] "d"
## [1] 3.215
## [1] "e"
## [1] 3.44
## [1] "f"
## [1] 3.46
## [1] "g"
## [1] 3.57
## [1] "h"
## [1] 3.19
## [1] "i"
## [1] 3.15
## [1] "j"
## [1] 3.44

Here we printed out the first 10 letters of the alphabet from the letters vector, as well as the first 10 car weights from the mtcars dataframe.

If you want to store your results somewhere, it is important that you create an empty object to hold them before you run the loop. If you grow your results vector one value at a time, it will be much slower. Here’s how to make that empty vector first. We’ll also use the function seq_along to create a sequence that’s the proper length, instead of explicitly writing out something like 1:10.

results <- rep(NA, nrow(mtcars))

for (i in seq_along(mtcars$wt)) {
  results[i] <- mtcars$wt[i] * 1000
}
results
##  [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780 5250
## [16] 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513 3170 2770
## [31] 3570 2780

apply Functions

For loops are very handy and important to understand, but they can involve writing a lot of code and can generally look fairly messy. Base R includes a family of functions called the apply functions that provide a more concise way to iterate operations across data structures.

The apply family of functions all do the same basic thing: take a data structure and apply a function to parts of it. The different functions in the family are designed for different data structures and different ways of applying functions.

apply()

The apply() function is designed for matrices and data frames. It applies a function over the rows or columns of a matrix or data frame. The basic syntax is:

apply(X, MARGIN, FUN, ...)

  • X is the data (matrix or data frame)
  • MARGIN tells it whether to apply the function over rows (1) or columns (2)
  • FUN is the function to apply
  • ... allows you to pass additional arguments to the function

Let’s try some examples with the mtcars dataset:

# Apply mean function to each column (MARGIN = 2)
apply(mtcars, 2, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500
# Apply mean function to each row (MARGIN = 1) 
head(apply(mtcars, 1, mean))
##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##          29.90727          29.98136          23.59818          38.73955 
## Hornet Sportabout           Valiant 
##          53.66455          35.04909
# Apply a function with additional arguments
apply(mtcars, 2, mean, na.rm = TRUE)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

lapply() and sapply()

While apply() works on matrices and data frames, lapply() and sapply() work on lists and vectors.

  • lapply() always returns a list
  • sapply() tries to simplify the result to a vector or matrix when possible
# lapply returns a list
lapply(mtcars, mean)
## $mpg
## [1] 20.09062
## 
## $cyl
## [1] 6.1875
## 
## $disp
## [1] 230.7219
## 
## $hp
## [1] 146.6875
## 
## $drat
## [1] 3.596563
## 
## $wt
## [1] 3.21725
## 
## $qsec
## [1] 17.84875
## 
## $vs
## [1] 0.4375
## 
## $am
## [1] 0.40625
## 
## $gear
## [1] 3.6875
## 
## $carb
## [1] 2.8125
# sapply simplifies to a named vector
sapply(mtcars, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Handling Missing Values

You can pass additional arguments to functions using the apply family, just like with for loops:

# Create some missing data
mtcars2 <- mtcars
mtcars2[3, c(1,6,8)] <- NA
head(mtcars2)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          NA   4  108  93 3.85    NA 18.61 NA  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# This returns NA for columns with missing values
sapply(mtcars2, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##         NA   6.187500 230.721875 146.687500   3.596563         NA  17.848750 
##         vs         am       gear       carb 
##         NA   0.406250   3.687500   2.812500
# Use na.rm = TRUE to handle missing values
sapply(mtcars2, mean, na.rm = TRUE)
##         mpg         cyl        disp          hp        drat          wt 
##  20.0032258   6.1875000 230.7218750 146.6875000   3.5965625   3.2461935 
##        qsec          vs          am        gear        carb 
##  17.8487500   0.4193548   0.4062500   3.6875000   2.8125000

mapply()

mapply() is the multivariate version of sapply(). It can apply a function to multiple lists or vectors in parallel:

# Create a sentence using car names and mpg values
car_sentences <- mapply(function(name, mpg) paste(name, "gets", mpg, "miles per gallon"), 
                       rownames(mtcars), mtcars$mpg)
head(car_sentences)
##                                      Mazda RX4 
##           "Mazda RX4 gets 21 miles per gallon" 
##                                  Mazda RX4 Wag 
##       "Mazda RX4 Wag gets 21 miles per gallon" 
##                                     Datsun 710 
##        "Datsun 710 gets 22.8 miles per gallon" 
##                                 Hornet 4 Drive 
##    "Hornet 4 Drive gets 21.4 miles per gallon" 
##                              Hornet Sportabout 
## "Hornet Sportabout gets 18.7 miles per gallon" 
##                                        Valiant 
##           "Valiant gets 18.1 miles per gallon"

Writing Custom Functions Inside Apply

One of the powerful features of the apply functions is that you can write custom functions directly inside the apply call. This is useful when you need to do something specific that doesn’t have a pre-existing function.

# Write a custom function inside sapply to calculate coefficient of variation
sapply(mtcars, function(x) sd(x) / mean(x))
##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
## 0.2999881 0.2886338 0.5371779 0.4674077 0.1486638 0.3041285 0.1001159 1.1520369 
##        am      gear      carb 
## 1.2282853 0.2000825 0.5742933
# Write a custom function inside apply to find the range of each column
apply(mtcars, 2, function(x) max(x) - min(x))
##     mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear 
##  23.500   4.000 400.900 283.000   2.170   3.911   8.400   1.000   1.000   2.000 
##    carb 
##   7.000
# More complex example: standardize each column (subtract mean, divide by sd)
standardized_data <- apply(mtcars, 2, function(x) (x - mean(x)) / sd(x))
head(standardized_data)
##                          mpg        cyl        disp         hp       drat
## Mazda RX4          0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
## Mazda RX4 Wag      0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
## Datsun 710         0.4495434 -1.2248578 -0.99018209 -0.7830405  0.4739996
## Hornet 4 Drive     0.2172534 -0.1049878  0.22009369 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345  1.0148821  1.04308123  0.4129422 -0.8351978
## Valiant           -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
##                             wt       qsec         vs         am       gear
## Mazda RX4         -0.610399567 -0.7771651 -0.8680278  1.1899014  0.4235542
## Mazda RX4 Wag     -0.349785269 -0.4637808 -0.8680278  1.1899014  0.4235542
## Datsun 710        -0.917004624  0.4260068  1.1160357  1.1899014  0.4235542
## Hornet 4 Drive    -0.002299538  0.8904872  1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout  0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
## Valiant            0.248094592  1.3269868  1.1160357 -0.8141431 -0.9318192
##                         carb
## Mazda RX4          0.7352031
## Mazda RX4 Wag      0.7352031
## Datsun 710        -1.1221521
## Hornet 4 Drive    -1.1221521
## Hornet Sportabout -0.5030337
## Valiant           -1.1221521

You can also write more complex custom functions with multiple steps:

# Custom function that returns summary statistics for each column
summary_stats <- sapply(mtcars, function(x) {
  c(mean = mean(x),
    median = median(x),
    sd = sd(x),
    min = min(x),
    max = max(x))
})

# This returns a matrix where each column is a variable and each row is a statistic
summary_stats[, 1:3]  # Show first 3 columns
##              mpg      cyl     disp
## mean   20.090625 6.187500 230.7219
## median 19.200000 6.000000 196.3000
## sd      6.026948 1.785922 123.9387
## min    10.400000 4.000000  71.1000
## max    33.900000 8.000000 472.0000

When to Use Each Apply Function

  • apply(): For matrices/data frames when you want to apply a function to rows or columns
  • lapply(): For lists/vectors when you want the result as a list
  • sapply(): For lists/vectors when you want simplified output (vector/matrix)
  • mapply(): For applying a function to multiple lists/vectors in parallel

The apply functions are generally faster than for loops and often more concise, making them a popular choice for many iteration tasks in R.

Complete Workflow

Let’s try working through a complete example of how you might iterate a more complex operation across a dataset. This will follow 3 basic steps:

  1. Write code that does the thing you want once
  2. Generalize that code into a function that can take different inputs
  3. Apply that function across your data

Starting With a Single Case

The first thing we’ll do is figure out if we can do the right thing once! We want to rescale a vector of values to a 0-1 scale. We’ll try it out on the weights in mtcars. Our heaviest vehicle will have a scaled weight of 1, and our lightest will have a scaled weight of 0. We’ll do this by taking our weight, subtracting the minimum car weight from it, and dividing this by the range of the car weights (max minus min). We’ll have to be careful about our order of operations…

(mtcars$wt[1] - min(mtcars$wt, na.rm = T)) /
  (max(mtcars$wt, na.rm = T) - min(mtcars$wt, na.rm = T))
## [1] 0.2830478

Great! We got a scaled value out of the deal. Because we’re working with base functions like max, min, and /, we can vectorize. This means we can give it the whole weight vector, and we’ll get a whole scaled vector back.

mtcars$wt_scaled <- (mtcars$wt - min(mtcars$wt, na.rm = T)) /
  diff(range(mtcars$wt, na.rm = T))

mtcars$wt_scaled
##  [1] 0.28304781 0.34824853 0.20634109 0.43518282 0.49271286 0.49782664
##  [7] 0.52595244 0.42879059 0.41856303 0.49271286 0.49271286 0.65379698
## [13] 0.56686269 0.57964715 0.95551010 1.00000000 0.97980056 0.17565840
## [19] 0.02608029 0.08233188 0.24341601 0.51316799 0.49143442 0.59498849
## [25] 0.59626694 0.10790079 0.16031705 0.00000000 0.42367681 0.32140118
## [31] 0.52595244 0.32395807

Generalizing

Now let’s replace our reference to a specific vector of data with something generic: x. This code won’t run on its own, since x doesn’t have a value, but it’s just showing how we would refer to some generic value.

x_scaled <- (x - min(x, na.rm = T)) /
  diff(range(x, na.rm = T))

Making it a Function

Now that we’ve got a generalized bit of code, we can turn it into a function. All we need is a name, function, and a list of arguments. In this case, we’ve just got one argument: x.

rescale_0_1 <- function(x) {
  (x - min(x, na.rm = T)) /
  diff(range(x, na.rm = T))
}

rescale_0_1(mtcars$mpg) # it works on one of our columns
##  [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574
##  [8] 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191 0.2936170 0.2042553
## [15] 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404
## [22] 0.2170213 0.2042553 0.1234043 0.3744681 0.7191489 0.6638298 0.8510638
## [29] 0.2297872 0.3957447 0.1957447 0.4680851

Iterating with Apply Functions!

Now that we’ve got a function that’ll rescale a vector of values, we can use one of the apply functions to iterate across all the columns in a dataframe, rescaling each one. We’ll use sapply since we want simplified output, and we’re working with a dataframe.

# Apply our rescale function to each column
rescaled_data <- sapply(mtcars, rescale_0_1)
head(rescaled_data)
##            mpg cyl      disp        hp      drat        wt      qsec vs am gear
## [1,] 0.4510638 0.5 0.2217511 0.2049470 0.5253456 0.2830478 0.2333333  0  1  0.5
## [2,] 0.4510638 0.5 0.2217511 0.2049470 0.5253456 0.3482485 0.3000000  0  1  0.5
## [3,] 0.5276596 0.0 0.0920429 0.1448763 0.5023041 0.2063411 0.4892857  1  1  0.5
## [4,] 0.4680851 0.5 0.4662010 0.2049470 0.1474654 0.4351828 0.5880952  1  0  0.0
## [5,] 0.3531915 1.0 0.7206286 0.4346290 0.1797235 0.4927129 0.3000000  0  0  0.0
## [6,] 0.3276596 0.5 0.3838863 0.1872792 0.0000000 0.4978266 0.6809524  1  0  0.0
##           carb wt_scaled
## [1,] 0.4285714 0.2830478
## [2,] 0.4285714 0.3482485
## [3,] 0.0000000 0.2063411
## [4,] 0.0000000 0.4351828
## [5,] 0.1428571 0.4927129
## [6,] 0.0000000 0.4978266
# You can also use lapply if you want the result as a list
rescaled_list <- lapply(mtcars, rescale_0_1)

There you have it! We went from some code that calculated one value to being able to iterate it across any number of columns in a dataframe using base R’s apply functions. It can be tempting to jump straight to your final iteration code, but it’s often better to start simple and work your way up, verifying that things work at each step, especially if you’re trying to do something even moderately complex.

Other Iteration Options: purrr from the Tidyverse

While we’ve focused on base R’s apply functions, it’s worth mentioning that there are other approaches to iteration available in R. The tidyverse includes a package called purrr that provides the map family of functions. These functions are very similar to the apply functions, but with a more consistent syntax and some additional features.

The purrr functions include map(), map_dbl(), map_chr(), and others that explicitly specify the output type. For example:

library(purrr)
mtcars %>% map_dbl(mean)  # Returns a numeric vector

If you want to learn more about purrr, check out Jenny Bryan’s tutorial. You might come across purrr functions in tidyverse-focused code, but the base R apply functions we’ve learned will handle most iteration needs effectively.

This lesson was contributed by Michael Culshaw-Maurer, with ideas from Mike Koontz and Brandon Hurr’s D-RUG presentation.