Programming loops are a way to repeat blocks of code to perform the same task over and over again. R offers the same functionality to perform repetitive tasks, allowing to write less and more efficient code.
This guide introduces you to loops in R, explains how to write them, including for loops and while loops, and some best practice for writing efficient loops. I'll also take a look at vectorization in R, explaining what it is and why you should used vectorized functions where possible. This guide will introduce you to the basics of programming and automating tasks in R.
Guide Information
Title | Introduction to loops in R |
Author | Benjamin Bell |
Published | 24 Oct 2021 |
Last updated | 05 Dec 2022 |
R version | 4.1.1 |
Packages | base |
Navigation |
This guide has been updated:
- Additional text added regarding the use of apply() and lapply(), updates also made to the flow diagrams to correct a minor error, and updates made to the repeat loop section to add additional keyboard shortcut to stop the repeat loop on Linux.
This is a 2 part guide:
Part 1: Introduction to loops in R | This guide explains how to write loops in R to automate and repeat code, and gives an overview of vectorized functions. |
Part 2: Nested loops | This guide explains how to write loops within a loop - a nested loop. |
What are loops in R?
In programming, a loop is a block of code that will be repeated over and over again for a specified number of times, or until a condition is met. In R, there are three main types of loops: the for
loop, which will repeat the code a specified number of times; the while
loop, which will repeat the code until a logical condition is met; and the repeat
loop, which will repeat the code indefinitely. To stop a repeat
loop, you must also include a break
statement and condition within the loop code.
So why would you want to write a loop?
If you want to perform the same task multiple times, writing a loop will save you time as you can write the code once, rather than writing the same code over and over.
But maybe you heard that using loops in R was a bad idea, and that they are slow?
Loops are not "bad" per se, but it is easy to write bad code whether this involves a loop or not. There are also certain things you can do to avoid writing bad loops. However, in terms of speed, a loop in R will be slower than using a vectorized method or function to perform the same task, but how much slower depends on what it is you are doing (and indeed how fast your system is).
Should you avoid using loops?
You can read hundreds of articles or comments that say you should avoid using loops in R, and you can read hundreds more that say using loops is fine. Whether to use a loop or not depends entirely on your specific task and requirements.
For example, is there a in-built function in R that achieves the same task that your loop does? If yes, then use that built-in function and avoid using a loop. Do you need to perform thousands of operations on massive datasets where speed is critical? If yes, then if possible, you may want to avoid using a R loop. However, if you need to quickly perform a task with a small dataset where speed is not an issue, or you need to perform a task that is not easily vectorized (for example, where the end result relies on the previous loop iteration), using a loop is fine.
But what about using apply()
instead?
Using the apply()
family of functions, including lapply()
for list objects, is considered the "preferred" method in R for achieving the same result as a for loop.
In fact, these functions are a wrapper for loops, and might not actually perform any faster, but, this depends on the task that you are doing. However, the apply()
functions work in such a way as to eliminate many of the problems associated with badly written loops, and the code is considered more legible. Indeed, you are not required to create container objects when using apply()
.
Despite this, it is still useful to have an understanding of loops, as they are quite easy to write, and they can help you understand programming in R. And, you might just prefer using loops - why not! Additionally, there are likely to be some cases where it would be difficult to use apply()
, and this is where loops really shine.
Vectorization in R
Okay, what about these "vectorized" functions in R?
Many of the functions in R are vectorized. But what exactly does this mean, and why are vectorized functions faster than loops written in R? A vectorized function is a function that works on each element of a vector at the same time. Effectively, the loop has already been implemented for you behind the scenes. Vectorized functions can simplify your code, they are easy to use and they are fast.
To understand why they are faster, it is important to understand how R works. As R is an interpreted language, when you send R code to the console, it is translated (or interpreted) into code that can be understood by the computer, it is then processed, before finally returning the output to the console. This process happens every time you execute code in R, line by line. This differs from a compiled language where the entire source code is compiled into machine readable code before the program can be run.
Because R code goes through an interpreter each time it is run, it is inherently slower than using a compiled language due to the additional overhead of the translation. This is why loops written in R code can be slow(er), because the code goes through the interpreter on every iteration of the loop. To keep R speedy, vectorized functions are written using C or Fortran, which are compiled languages and are very fast. A vectorized function does in fact use a loop to perform the task, but it is a highly optimised loop written in C/Fortran that is pre-compiled. Vectorized functions combine the benefits of using a compiled language (speed) with the benefits of an interpreted language (ease of use).
To help illustrate the concept, consider the following example. Lets say you want to sum all the numbers in a vector. If you were to use a loop written in R, it might look something like this:
# Vector of numbers
num <- c(5, 10, 8, 7, 2)
# Initialize "res"
res <- 0
# Loop to sum numbers
for(i in seq_along(num)) {
res <- res + num[i]
}
Compare to using the equivalent vectorized function:
# Vector of numbers
num <- c(5, 10, 8, 7, 2)
# Vectorized function to sum numbers
res <- sum(num)
In this example, the loop requires more code to perform the same task as a vectorized function, and will perform slightly slower. It is also a poorly written loop, since the result vector (res) has to "grow" each time the loop is run, which is bad practice. To perform a task like this, you should always use the vectorized function.
However, sometimes you will want to do something that is not easily vectorized, or where speed is not critical (or doesn't make any difference), and this is where loops are useful.
Overall, the issue is quite overblown (especially with fast modern computers) and it is easy to waste a lot of time trying to avoid the use of loops, or rewriting perfectly good existing code that uses a loop, rather than just using one. Remember that "premature optimization is the root of all evil" - Donald Knuth.
If you do want or need to write advanced or high performance loops, you might want to take a look at the Rcpp package which allows you to write efficient C++ code in R. Check out Hadley Wickham's guide to using Rcpp to learn how to use it.
Otherwise, read on to get started with loops!
For loops
The for loop allows you to perform a task a specific number of times. The diagram below helps to illustrate the for loop:
Here's an example of a for loop to print the same statement five times:
for(i in 1:5) {
print("Hello World!")
}
Which would result in the following output in the R console:
[1] "Hello World!"
[1] "Hello World!"
[1] "Hello World!"
[1] "Hello World!"
[1] "Hello World!"
So how does this work? The loop code is made up of several components. i
is the iterator variable, while 1:5
is our object which we will iterate through one by one.
But what is i
exactly? It is a local variable, local that is to the loop you have just written. i.e. it exists only within the loop. The value of i
changes on each iteration of the loop.
But why i
, and what does it mean? Well, depending on who you ask, it could mean "iterator", "integer", "index", "imaginary", or a number of other things. You can actually use anything in place of i
, e.g. a
, b
, x
, mysuperdooperiterator
, well perhaps not that last one. It doesn't matter what you call it as it will perform the same task. But, i
is the convention when writing loops and it is used across most programming languages. This means it will be easily understood by others who are reading your code. For nested loops, you will typically see i
and j
used as the iterator variables (since j comes after i).
The length of the object (to be iterated through) will dictate the maximum number of times that the for loop with run. In this example, 1:5
creates an integer vector composed of five sequentially numbered elements, which has a length of five. You can confirm this in the R console:
> 1:5
[1] 1 2 3 4 5
> class(1:5)
[1] "integer"
> length(1:5)
[1] 5
On each iteration of the loop, the value of i
changes. In this example, and on the first iteration of the loop, i
becomes 1
, on the second iteration, i
becomes 2
, for the third i
becomes 3
and so on.
Lastly, the code contained inside the curly brackets is the code that is executed on each iteration of the loop. In this example, we only have one line of code, but you can have multiple lines of code in a loop. In fact, since there is only one, you do not actually need to use the curly brackets, and you could also write the code like this:
for(i in 1:5) print("Hello World!")
Lets consider another example. If you wanted to print a different statement each time, you could use the following code:
# Vector of names
names <- c("Buffy", "Xander", "Willow", "Rupert", "Cordelia")
# Loop
for(i in 1:5) {
print(paste("Hello", names[i]))
}
Which would result in the following output in the R console:
[1] "Hello Buffy"
[1] "Hello Xander"
[1] "Hello Willow"
[1] "Hello Rupert"
[1] "Hello Cordelia"
As we iterate through the loop now, the code prints a different name, because we have now subsetted the names vector using the iterator variable i
. Since this variable changes on each iteration of the loop, it prints a different name (the subset changes). On the first iteration, the code that is executed effectively becomes:
print(paste("Hello", names[1]))
Whilst on the second iteration of the loop, the code effectively becomes:
print(paste("Hello", names[2]))
And so on.
Another way to create the object to be iterated through is to use the seq_along
function. For example:
# Loop
for(i in seq_along(names)) {
print(paste("Hello", names[i]))
}
seq_along
effectively does the same thing as using 1:5
, turning the object into an integer with sequentially numbered elements of the same length. Take a look:
> seq_along(names)
[1] 1 2 3 4 5
> class(seq_along(names))
[1] "integer"
> length(seq_along(names))
[1] 5
seq_along
is useful in loops when you do not know the length of the object that you are iterating through. Another option is to use the length()
function. For example:
# Loop
for(i in 1:length(names)) {
print(paste("Hello", names[i]))
}
Any of these approaches are valid, and they all have the same result.
Break and next statements
In the previous examples we iterate through every element within our vector. But you don't have to do this, you might only want to iterate through some of them. For example:
# Vector of names
names <- c("Buffy", "Xander", "Willow", "Rupert", "Cordelia")
# Loop
for(i in 2:4) {
print(paste("Hello", names[i]))
}
Would result in the following output:
[1] "Hello Xander"
[1] "Hello Willow"
[1] "Hello Rupert"
The loop only prints three of the names in the names vector because the object that is being iterated through now only has three elements:
> 2:4
[1] 2 3 4
> class(2:4)
[1] "integer"
> length(2:4)
[1] 3
So, on the first iteration of the loop, i
becomes 2
, on the second iteration, i
becomes 3
and on the third iteration i
becomes 4
.
A smarter way to decide what to iterate through in a loop can be achieved by using conditional statements. Let's say you want to stop the loop when it meets a certain condition, you can use the break
statement after setting a condition. Or, if you wanted to skip an iteration of the loop (and keep it running afterwards), you could use the next
statement after setting a condition.
For example, lets stop running the loop once we reach the name "Rupert":
# Vector of names
names <- c("Buffy", "Xander", "Willow", "Rupert", "Cordelia")
# Stop the loop once the name equals "Rupert"
for(i in 1:5) {
# Condition
if(names[i]=="Rupert")
break
# Code to be executed
print(paste("Hello", names[i]))
}
Which would result in the following output:
[1] "Hello Buffy"
[1] "Hello Xander"
[1] "Hello Willow"
But, what if we wanted to stop the loop from running after the name "Rupert", we could simply re-order the code within the loop:
# Stop the loop after "Rupert"
for(i in 1:5) {
# Code to be executed
print(paste("Hello", names[i]))
# Condition
if(names[i]=="Rupert")
break
}
[1] "Hello Buffy"
[1] "Hello Xander"
[1] "Hello Willow"
[1] "Hello Rupert"
Now lets take a look at the next
statement. For example, lets skip the iteration of the loop when the name is "Rupert":
# Skip "Rupert"
for(i in 1:5) {
# Condition
if(names[i]=="Rupert")
next
# Code to be executed
print(paste("Hello", names[i]))
}
Which would result in the following output:
[1] "Hello Buffy"
[1] "Hello Xander"
[1] "Hello Willow"
[1] "Hello Cordelia"
You can also use both break
and next
statements and multiple conditions in a loop:
# Multiple conditions
for(i in 1:5) {
# Condition 1
if(names[i]=="Buffy")
next
# Code to be executed
print(paste("Hello", names[i]))
# Condition 2
if(names[i]=="Rupert")
break
}
[1] "Hello Xander"
[1] "Hello Willow"
[1] "Hello Rupert"
Storing the output from a loop
The above examples send the output of the loop to the R console, this might be fine if you just want to gather some information, but it is likely that you will want to store the results from your loop. You can store the output or results of a loop in any format.
Before running the loop, you will first need to create a storage container for the results or output. IMPORTANT! The container should have the same length as the final results output, and it is also a good idea to give the container the same class of the results output (if possible).
For example, lets say you want to calculate the square root of every number in a vector and store the results in another vector. The original data is stored as a numeric vector and has ten elements. Therefore, your results vector should also have a length of ten, and since calculating the square root will generate numeric data, the results vector should also be numeric:
# Random data (10 numbers)
dat <- runif(10)
# Create a storage/container vector with a length of 10
results <- vector(mode="numeric", length=10)
# Run loop (calculate square root of each element)
for(i in 1:10) {
results[i] <- sqrt(dat[i])
}
Here we create an empty numeric vector named "results". Before the loop is run, it will look like this in the R console:
> results
[1] 0 0 0 0 0 0 0 0 0 0
And after you run the loop, it will be filled with the results (your data will vary):
> results
[1] 0.57213147 0.96330053 0.45254744 0.02192411 0.97023437 0.81988137 0.36956970 0.36543316
[9] 0.95555391 0.49002287
Both the "before" and "after" versions of the results vector will have the same memory allocation. You can test this by using the tracemem()
function (your results will vary):
> tracemem(results)
[1] "<0000000011FAC248>"
If you didn't create a matching length results vector, this vector would have to "grow" on each iteration of the loop, with R re-allocating memory each time. This is inefficient and will slow down the performance of the loop.
You can see this for yourself by running the following code:
# Bad container
results2 <- vector()
# Bad loop
for(i in 1:10) {
results2[i] <- sqrt(dat[i])
print(tracemem(results2))
}
For a small loop like this, you probably won't notice the speed difference. But, what if the "dat" vector contained 100 million elements? Let compare their performance:
# Large random data
large_dat <- runif(100000000)
# Good loop
results <- vector("numeric", 100000000)
system.time(
for(i in 1:100000000) {
results[i] <- sqrt(large_dat[i])
}
)
# Bad loop
results2 <- vector()
system.time(
for(i in 1:100000000) {
results2[i] <- sqrt(large_dat[i])
}
)
"good loop" speed:
user system elapsed
4.61 0.02 4.63
vs. "bad loop" speed:
user system elapsed
18.50 1.41 19.90
Your results will vary, but as you can see here, the "bad loop" runs four times slower than the "good loop".
You can also store the results in a matrix, just follow the practice of preparing your results matrix before the loop:
# Create a storage/container matrix
results_mat <- matrix(0, nrow=10, ncol=10)
# Run loop (calculate square root of each element)
for(i in 1:10) {
results_mat[i,] <- sqrt(data[i])
}
You can also store results in a list, which is useful when you want to separate out results, or your output has mixed data classes, or you want to store objects themselves. As with vectors, you should first create an empty list of the same length as your results or output:
# Create a storage/container list
results_list <- vector("list", 10)
# Run loop (calculate square root of each element)
for(i in 1:10) {
results_list[[i]] <- sqrt(data[i])
}
While loops
The while loop allows you to perform a task a until a logical condition is met. The diagram below helps to illustrate the while loop:
Here's an example of a while loop which will print the same statement five times:
# Container
i <- 1
# While loop
while(i <= 5) {
print("Hello World!")
i <- i + 1
}
The while loop requires a logical condition to work. In this example, the condition tells R to run the loop while the value of i
is less than, or equal to 5.
For this example to work, we have to first create a "container" object with a value of 1 outside the loop. We then increment this value within the loop itself, increasing the value of i
by 1 on each iteration. Therefore, on the fifth iteration of the loop, the value of i
will have become 6 (1 +1 +1 +1 +1 +1). When the loop checks to see if it should execute the code again (on the sixth iteration), since the condition is no longer met, the loop exits.
You can also use break
and next
statements within while loops.
Repeat loops
The repeat loop will perform a task repeatedly. It will only stop when it meets a break
statement and condition. Without this condition, the loop will not stop (unless you hit the escape key). The diagram below helps to illustrate the while loop:
Here's an example of a repeat loop without a break
statement and condition:
# Repeat loop
repeat {
print("Hit escape key (or ctrl + c) to stop this loop!")
flush.console()
Sys.sleep(1)
}
That loop would continue to run over and over again, until you intervened by hitting the escape key, or ctrl + c (on Linux). For a repeat loop to stop automatically, it must include a break
statement and condition:
# Container
i <- 1
# Repeat loop
repeat {
# Code to be executed
print("Hello World!")
# Container increment
i <- i + 1
# Condition
if(i > 5) {
break
}
}
Like the while loop in the previous example, we created a container outside the loop which is incremented on each iteration of the loop. The loop now includes a condition to break
(exit) the loop once the condition is met. The condition checks the value of i
, and exits the loop once the value of i
exceeds 5.
You'll notice that even though this performs the same task as the while loop in the previous example, the condition is written in a different way. One way of thinking about the difference is, a while loop will do a task while it hasn't yet met a condition, whereas the repeat loop will repeat a task until a condition is met.
So, that is the basics of writing loops in R. They can be very useful for automating or repeating tasks, and if there is not a native vectorized function that could perform the same task, don't be afraid to write a loop!
Thanks for reading, and please leave any comments or questions below.
This is a 2 part guide:
Part 1: Introduction to loops in R | This guide explains how to write loops in R to automate and repeat code, and gives an overview of vectorized functions. |
Part 2: Nested loops | This guide explains how to write loops within a loop - a nested loop. |
Further reading
Getting started with R - An introduction to R, everything you need to know to get started with R.
No comments
Post a Comment
Comments are moderated. There may be a delay until your comment appears.