And you may find yourself
Behind the keys of a large computing
machine
And you may find yourself
Copy-pasting tons of code
And you may ask yourself, well
How did I get here?

It’s pretty common that you’ll want to run the same basic bit of code a bunch of times with different inputs. Maybe you want to read in a bunch of data files with different names or calculate something complex on every row of a dataframe. A general rule of thumb is that any code you want to run 3+ times should be iterated instead of copy-pasted. Copy-pasting code and replacing the parts you want to change is generally a bad practice for several reasons:
Lots of functions (including many base functions) are
vectorized, meaning they already work on vectors of values.
Here’s an example:
x <- 1:10
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
The log() function already knows we want to take the log
of each element in x, and it returns a vector that’s the same length as
x. If a vectorized function already exists to do what you want,
use it! It’s going to be faster and cleaner than trying to iterate
everything yourself.
However, we may want to do more complex iterations, which brings us to our first main iterating concept.
A for loop will repeat some bit of code, each time with a new input value. Here’s the basic structure:
for(i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
You’ll often see i used in for loops, you can think of
it as the iteration value. For each i value in the vector
1:10, we’ll print that index value. You can use the i value
more than once in a loop:
for (i in 1:10) {
print(i)
print(i^2)
}
## [1] 1
## [1] 1
## [1] 2
## [1] 4
## [1] 3
## [1] 9
## [1] 4
## [1] 16
## [1] 5
## [1] 25
## [1] 6
## [1] 36
## [1] 7
## [1] 49
## [1] 8
## [1] 64
## [1] 9
## [1] 81
## [1] 10
## [1] 100
What’s happening is the value of i gets inserted into
the code block, the block gets run, the value of i changes,
and the process repeats. For loops can be a way to explicitly lay out
fairly complicated procedures, since you can see exactly where your
i value is going in the code.
You can also use the i value to index a vector or
dataframe, which can be very powerful!
for (i in 1:10) {
print(letters[i])
print(mtcars$wt[i])
}
## [1] "a"
## [1] 2.62
## [1] "b"
## [1] 2.875
## [1] "c"
## [1] 2.32
## [1] "d"
## [1] 3.215
## [1] "e"
## [1] 3.44
## [1] "f"
## [1] 3.46
## [1] "g"
## [1] 3.57
## [1] "h"
## [1] 3.19
## [1] "i"
## [1] 3.15
## [1] "j"
## [1] 3.44
Here we printed out the first 10 letters of the alphabet from the
letters vector, as well as the first 10 car weights from
the mtcars dataframe.
If you want to store your results somewhere, it is important that you
create an empty object to hold them before you run the
loop. If you grow your results vector one value at a time, it will be
much slower. Here’s how to make that empty vector first. We’ll also use
the function seq_along to create a sequence that’s the
proper length, instead of explicitly writing out something like
1:10.
results <- rep(NA, nrow(mtcars))
for (i in seq_along(mtcars$wt)) {
results[i] <- mtcars$wt[i] * 1000
}
results
## [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780 5250
## [16] 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513 3170 2770
## [31] 3570 2780
apply FunctionsFor loops are very handy and important to understand, but they can
involve writing a lot of code and can generally look fairly messy. Base
R includes a family of functions called the apply functions
that provide a more concise way to iterate operations across data
structures.
The apply family of functions all do the same basic
thing: take a data structure and apply a function to parts of it. The
different functions in the family are designed for different data
structures and different ways of applying functions.
apply()The apply() function is designed for matrices and data
frames. It applies a function over the rows or columns of a matrix or
data frame. The basic syntax is:
apply(X, MARGIN, FUN, ...)
X is the data (matrix or data frame)MARGIN tells it whether to apply the function over rows
(1) or columns (2)FUN is the function to apply... allows you to pass additional arguments to the
functionLet’s try some examples with the mtcars dataset:
# Apply mean function to each column (MARGIN = 2)
apply(mtcars, 2, mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
# Apply mean function to each row (MARGIN = 1)
head(apply(mtcars, 1, mean))
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 29.90727 29.98136 23.59818 38.73955
## Hornet Sportabout Valiant
## 53.66455 35.04909
# Apply a function with additional arguments
apply(mtcars, 2, mean, na.rm = TRUE)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
lapply() and sapply()While apply() works on matrices and data frames,
lapply() and sapply() work on lists and
vectors.
lapply() always returns a listsapply() tries to simplify the result to a vector or
matrix when possible# lapply returns a list
lapply(mtcars, mean)
## $mpg
## [1] 20.09062
##
## $cyl
## [1] 6.1875
##
## $disp
## [1] 230.7219
##
## $hp
## [1] 146.6875
##
## $drat
## [1] 3.596563
##
## $wt
## [1] 3.21725
##
## $qsec
## [1] 17.84875
##
## $vs
## [1] 0.4375
##
## $am
## [1] 0.40625
##
## $gear
## [1] 3.6875
##
## $carb
## [1] 2.8125
# sapply simplifies to a named vector
sapply(mtcars, mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
You can pass additional arguments to functions using the
apply family, just like with for loops:
# Create some missing data
mtcars2 <- mtcars
mtcars2[3, c(1,6,8)] <- NA
head(mtcars2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 NA 4 108 93 3.85 NA 18.61 NA 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# This returns NA for columns with missing values
sapply(mtcars2, mean)
## mpg cyl disp hp drat wt qsec
## NA 6.187500 230.721875 146.687500 3.596563 NA 17.848750
## vs am gear carb
## NA 0.406250 3.687500 2.812500
# Use na.rm = TRUE to handle missing values
sapply(mtcars2, mean, na.rm = TRUE)
## mpg cyl disp hp drat wt
## 20.0032258 6.1875000 230.7218750 146.6875000 3.5965625 3.2461935
## qsec vs am gear carb
## 17.8487500 0.4193548 0.4062500 3.6875000 2.8125000
mapply()mapply() is the multivariate version of
sapply(). It can apply a function to multiple lists or
vectors in parallel:
# Create a sentence using car names and mpg values
car_sentences <- mapply(function(name, mpg) paste(name, "gets", mpg, "miles per gallon"),
rownames(mtcars), mtcars$mpg)
head(car_sentences)
## Mazda RX4
## "Mazda RX4 gets 21 miles per gallon"
## Mazda RX4 Wag
## "Mazda RX4 Wag gets 21 miles per gallon"
## Datsun 710
## "Datsun 710 gets 22.8 miles per gallon"
## Hornet 4 Drive
## "Hornet 4 Drive gets 21.4 miles per gallon"
## Hornet Sportabout
## "Hornet Sportabout gets 18.7 miles per gallon"
## Valiant
## "Valiant gets 18.1 miles per gallon"
One of the powerful features of the apply functions is that you can write custom functions directly inside the apply call. This is useful when you need to do something specific that doesn’t have a pre-existing function.
# Write a custom function inside sapply to calculate coefficient of variation
sapply(mtcars, function(x) sd(x) / mean(x))
## mpg cyl disp hp drat wt qsec vs
## 0.2999881 0.2886338 0.5371779 0.4674077 0.1486638 0.3041285 0.1001159 1.1520369
## am gear carb
## 1.2282853 0.2000825 0.5742933
# Write a custom function inside apply to find the range of each column
apply(mtcars, 2, function(x) max(x) - min(x))
## mpg cyl disp hp drat wt qsec vs am gear
## 23.500 4.000 400.900 283.000 2.170 3.911 8.400 1.000 1.000 2.000
## carb
## 7.000
# More complex example: standardize each column (subtract mean, divide by sd)
standardized_data <- apply(mtcars, 2, function(x) (x - mean(x)) / sd(x))
head(standardized_data)
## mpg cyl disp hp drat
## Mazda RX4 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137
## Mazda RX4 Wag 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137
## Datsun 710 0.4495434 -1.2248578 -0.99018209 -0.7830405 0.4739996
## Hornet 4 Drive 0.2172534 -0.1049878 0.22009369 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345 1.0148821 1.04308123 0.4129422 -0.8351978
## Valiant -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
## wt qsec vs am gear
## Mazda RX4 -0.610399567 -0.7771651 -0.8680278 1.1899014 0.4235542
## Mazda RX4 Wag -0.349785269 -0.4637808 -0.8680278 1.1899014 0.4235542
## Datsun 710 -0.917004624 0.4260068 1.1160357 1.1899014 0.4235542
## Hornet 4 Drive -0.002299538 0.8904872 1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout 0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
## Valiant 0.248094592 1.3269868 1.1160357 -0.8141431 -0.9318192
## carb
## Mazda RX4 0.7352031
## Mazda RX4 Wag 0.7352031
## Datsun 710 -1.1221521
## Hornet 4 Drive -1.1221521
## Hornet Sportabout -0.5030337
## Valiant -1.1221521
You can also write more complex custom functions with multiple steps:
# Custom function that returns summary statistics for each column
summary_stats <- sapply(mtcars, function(x) {
c(mean = mean(x),
median = median(x),
sd = sd(x),
min = min(x),
max = max(x))
})
# This returns a matrix where each column is a variable and each row is a statistic
summary_stats[, 1:3] # Show first 3 columns
## mpg cyl disp
## mean 20.090625 6.187500 230.7219
## median 19.200000 6.000000 196.3000
## sd 6.026948 1.785922 123.9387
## min 10.400000 4.000000 71.1000
## max 33.900000 8.000000 472.0000
apply(): For matrices/data frames when you want to
apply a function to rows or columnslapply(): For lists/vectors when you want the result as
a listsapply(): For lists/vectors when you want simplified
output (vector/matrix)mapply(): For applying a function to multiple
lists/vectors in parallelThe apply functions are generally faster than for loops and often more concise, making them a popular choice for many iteration tasks in R.
Let’s try working through a complete example of how you might iterate a more complex operation across a dataset. This will follow 3 basic steps:
The first thing we’ll do is figure out if we can do the right thing
once! We want to rescale a vector of values to a 0-1 scale. We’ll try it
out on the weights in mtcars. Our heaviest vehicle will
have a scaled weight of 1, and our lightest will have a scaled weight of
0. We’ll do this by taking our weight, subtracting the minimum car
weight from it, and dividing this by the range of the car weights (max
minus min). We’ll have to be careful about our order of operations…
(mtcars$wt[1] - min(mtcars$wt, na.rm = T)) /
(max(mtcars$wt, na.rm = T) - min(mtcars$wt, na.rm = T))
## [1] 0.2830478
Great! We got a scaled value out of the deal. Because we’re working
with base functions like max, min, and
/, we can vectorize. This means we can give it the whole
weight vector, and we’ll get a whole scaled vector back.
mtcars$wt_scaled <- (mtcars$wt - min(mtcars$wt, na.rm = T)) /
diff(range(mtcars$wt, na.rm = T))
mtcars$wt_scaled
## [1] 0.28304781 0.34824853 0.20634109 0.43518282 0.49271286 0.49782664
## [7] 0.52595244 0.42879059 0.41856303 0.49271286 0.49271286 0.65379698
## [13] 0.56686269 0.57964715 0.95551010 1.00000000 0.97980056 0.17565840
## [19] 0.02608029 0.08233188 0.24341601 0.51316799 0.49143442 0.59498849
## [25] 0.59626694 0.10790079 0.16031705 0.00000000 0.42367681 0.32140118
## [31] 0.52595244 0.32395807
Now let’s replace our reference to a specific vector of data with
something generic: x. This code won’t run on its own, since
x doesn’t have a value, but it’s just showing how we would
refer to some generic value.
x_scaled <- (x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
Now that we’ve got a generalized bit of code, we can turn it into a
function. All we need is a name, function, and a list of
arguments. In this case, we’ve just got one argument:
x.
rescale_0_1 <- function(x) {
(x - min(x, na.rm = T)) /
diff(range(x, na.rm = T))
}
rescale_0_1(mtcars$mpg) # it works on one of our columns
## [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574
## [8] 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191 0.2936170 0.2042553
## [15] 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404
## [22] 0.2170213 0.2042553 0.1234043 0.3744681 0.7191489 0.6638298 0.8510638
## [29] 0.2297872 0.3957447 0.1957447 0.4680851
Now that we’ve got a function that’ll rescale a vector of values, we
can use one of the apply functions to iterate across all
the columns in a dataframe, rescaling each one. We’ll use
sapply since we want simplified output, and we’re working
with a dataframe.
# Apply our rescale function to each column
rescaled_data <- sapply(mtcars, rescale_0_1)
head(rescaled_data)
## mpg cyl disp hp drat wt qsec vs am gear
## [1,] 0.4510638 0.5 0.2217511 0.2049470 0.5253456 0.2830478 0.2333333 0 1 0.5
## [2,] 0.4510638 0.5 0.2217511 0.2049470 0.5253456 0.3482485 0.3000000 0 1 0.5
## [3,] 0.5276596 0.0 0.0920429 0.1448763 0.5023041 0.2063411 0.4892857 1 1 0.5
## [4,] 0.4680851 0.5 0.4662010 0.2049470 0.1474654 0.4351828 0.5880952 1 0 0.0
## [5,] 0.3531915 1.0 0.7206286 0.4346290 0.1797235 0.4927129 0.3000000 0 0 0.0
## [6,] 0.3276596 0.5 0.3838863 0.1872792 0.0000000 0.4978266 0.6809524 1 0 0.0
## carb wt_scaled
## [1,] 0.4285714 0.2830478
## [2,] 0.4285714 0.3482485
## [3,] 0.0000000 0.2063411
## [4,] 0.0000000 0.4351828
## [5,] 0.1428571 0.4927129
## [6,] 0.0000000 0.4978266
# You can also use lapply if you want the result as a list
rescaled_list <- lapply(mtcars, rescale_0_1)
There you have it! We went from some code that calculated one value
to being able to iterate it across any number of columns in a dataframe
using base R’s apply functions. It can be tempting to jump
straight to your final iteration code, but it’s often better to start
simple and work your way up, verifying that things work at each step,
especially if you’re trying to do something even moderately complex.
purrr from the TidyverseWhile we’ve focused on base R’s apply functions, it’s
worth mentioning that there are other approaches to iteration available
in R. The tidyverse includes a package called
purrr that provides the map family of
functions. These functions are very similar to the apply
functions, but with a more consistent syntax and some additional
features.
The purrr functions include map(),
map_dbl(), map_chr(), and others that
explicitly specify the output type. For example:
library(purrr)
mtcars %>% map_dbl(mean) # Returns a numeric vector
If you want to learn more about purrr, check out Jenny Bryan’s
tutorial. You might come across purrr functions in
tidyverse-focused code, but the base R apply functions
we’ve learned will handle most iteration needs effectively.
This lesson was contributed by Michael Culshaw-Maurer, with ideas from Mike Koontz and Brandon Hurr’s D-RUG presentation.