5  Programming

This chapter introduces the core programming concepts that allow you to make R do more than simple calculations or data manipulations. You will learn how to write your own functions, use control-flow structures such as loops and conditionals, and organize code so it becomes clearer, more efficient, and easier to reuse. These tools form the foundation of programmatic thinking in R.

Control flow

Conditional Statements

Conditional statements let your code branch into different paths based on logical tests. Conditional statements are crucial when your analysis depends on criteria such as thresholds, missing values, file existence, or classification rules.

x <- -5

if (x > 0) {
  "positive value or zero"
} else {
  "negative value"
}
[1] "negative value"

Logical Operators

Logical operators allow you to build more complex conditions:

  • & and && — logical AND
  • | and || — logical OR
  • ! — logical NOT
  • %in% — test membership in a set
  • ==, !=, <, >, <=, >= — comparisons

Short-circuiting

In R, control structures like if(), while(), and repeat() require a single TRUE/FALSE value to determine program flow. The logical operators && and || are designed specifically for this purpose. They differ from the element-wise operators & and | in two important ways:

  • && and || must be used with single-length logical values. Passing a longer vector will result in an error.

  • Short-circuiting occurs when the second operand is evaluated only if needed, which can prevent errors or avoid unnecessary computations when the first condition already determines the outcome.

Using & evaluates both sides and computes sqrt(x) even if first condition is FALSE. Note, R somewhat mitigates this by returning FALSE for expressions like FALSE & NA.

x <- -5

x > 0 & sqrt(x) > 0
Warning in sqrt(x): NaNs produced
[1] FALSE

Using && short-circuits by evaluating only the first condition if the condition returns FALSE.

x > 0 && sqrt(x) > 0
[1] FALSE

Loops and Iteration

Basic forms:

  • for loops
  • while loops
  • repeat loops
  • Loop controls: break, next
for (i in 1:5) {
  if (i == 3) next   # skip iteration
  print(i)
}
[1] 1
[1] 2
[1] 4
[1] 5

Alternatives to Loops

Vectorized operations

Remember, many R functions are vectorized, meaning they operate on entire vectors or arrays at once. This allows you to perform calculations on multiple elements simultaneously without writing explicit loops, making your code faster, more concise, and easier to read. Vectorized operations take advantage of R’s internal optimizations and often run much more efficiently than manually iterating over each element with for or while loops.

x <- 1:10
x^2
 [1]   1   4   9  16  25  36  49  64  81 100

Apply

Previously, we learned that with vectorization (Section 2.1.1) mathematical operations are computed on every element of a vector without the need of a loop. Similarly, it is possible to calculate statistics on multiple columns (or rows) all at once without using loops. For example, assume you need to calculate the total sum of the values in each data column (or row). You could do this manually by calling sum() repeatedly on every column (or row), but this would be tedious. The apply() function automates that tasks for you. With apply(), you can apply the sum() function (or any function) across all rows (MARGIN=1) or columns (MARGIN=2) in one call.

a <- data.frame(A = c(1, 2, 3), B = c(4, 5, 6), C = c(7, 8, 9))
a
  A B C
1 1 4 7
2 2 5 8
3 3 6 9
a$D <- a$C - a$B
a
  A B C D
1 1 4 7 3
2 2 5 8 3
3 3 6 9 3

The apply function can be used to apply a function (e.g., mean, sd, sum, min, max) over the rows (MARGIN = 1) or columns (MARGIN = 2) of a data frame or matrix.

apply(X = a, MARGIN = 1, FUN = mean)
[1] 3.75 4.50 5.25
apply(X = a, MARGIN = 2, FUN = mean)
A B C D 
2 5 8 3 
Note

R provides an entire family of apply functions that work with lists and other objects: lapply() applies a function to each element of a list, sapply() simplifies the results into a vector or matrix when possible, and mapply() applies a function over multiple arguments in parallel.

purrr package (tidyverse)

The purrr package, part of the tidyverse, provides a set of tools for functional programming in R. It offers functions like map(), map_dbl(), and map_lgl() that apply a function to each element of a list or vector while ensuring a consistent output type. Compared to base R’s apply family, purrr functions are often more readable, type-safe, and integrate seamlessly with other tidyverse packages, making them a powerful alternative to explicit loops.

For a more in-depth treatment of this topic, see Chapter 9 in Wickham (2019).

Functions

Functions are self-contained units with a well-defined purpose. In general, functions take inputs, do calculations (possibly printing intermediate results, drawing graphs, calling other functions, etc.), and produce outputs. Below, I am creating a function my_addition_fun. The call function(x, y) creates a function with two arguments of the name x and y, respectively. The function body is contained in curly brackets {}. Here, you can do something with x and y and return the result with return(). A comprehensive description can be found in Wickham (2019) here.

my_addition_fun <- function(x, y){
  z <- x + y
  return(z)
}

my_addition_fun(x = 2, y = 4)
[1] 6

You can also specify default values for function arguments, here y=5.

my_addition_fun <- function(x , y = 5){
  z <- x + y
  return(z)
}

my_addition_fun(x = 2)
[1] 7

Default values of function arguments are overwritten, when you pass the argument some value:

my_addition_fun(x = 2, y = 8)
[1] 10

When the function is completed, all variables declared inside the function are gone. For example, the variable z is not known outside the scope of the function. It only exsists in the environment of the function. However, the opposite is not true. If a variable is not defined inside a function, R will search for it outside the function. This is also called cascading of environments. Below, a exists in the environment outside my_addition_fun(). Inside the function, the variable is used before it is declared. The assignment a <- 1 happens after a is used. Hence, z is calculated with the variable a=42.

a <- 42
my_addition_fun <- function(x , y = 5){
  z <- x + y + a
  a <- 1
  return(z)
}

my_addition_fun(x = 2)
[1] 49
a
[1] 42
Important

If a variable is not declared inside a function, R searches for it in the environments outside the function.

Error handling