x <- c(1, 3, 8, 12, 56, 875, 234, 13)
x[1] 1 3 8 12 56 875 234 13
Multiple values in R can be stored in different data structures, depending on whether the elements must be of the same type (homogeneous) or can differ (heterogeneous):
Homogeneous (of the same type): vector (1D), matrix, (2D), array (3D+)
Heterogeneous (of mixed types): data frame, list
A vector is an ordered collection of values from a single data type.
Use c() to combine different values to a vector:
x <- c(1, 3, 8, 12, 56, 875, 234, 13)
x[1] 1 3 8 12 56 875 234 13
Use length() to determine the number of values in a vector:
length(x)[1] 8
You can construct vectors from each data type:
y <- c("a", "b", "c")
typeof(y)[1] "character"
But you cannot mix data types. If you do, the simpler data type is used (coercion):
z <- c(1, 4, "b", 8.5, "abc")
typeof(z)[1] "character"
The order is: Logical > Double > Integer > Character
To apply an arithmetic operation for each element of a vector, you may be tempted to write a loop such as the one below. The example below loops through each element i of vector x and multiplies it with the number 2.
x[1] 1 3 8 12 56 875 234 13
for (i in x) {
i * 2
}However, this is not the way math is done in R. Most operations in (base) R are vectorized, which means they are automatically performed element by element. There is no need to loop over each element to do the calculation. In fact, vectorization is better than looping. Loops in R are relatively slow. The example below also applies the multiplication and addition to each element of x. These operations are vectorized.
x[1] 1 3 8 12 56 875 234 13
x * 2[1] 2 6 16 24 112 1750 468 26
x + 2[1] 3 5 10 14 58 877 236 15
Vectorized operations also work between two vectors. In the following example, the first elements are added together, the second elements are added together, and so forth.
x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(6, 3, 6, 9, 3, 1, 7, 2)
x + y[1] 7 5 9 13 8 7 14 10
If two vectors have different lengths, the shorter vector is recycled (repeated) as often as needed to match the length of the longer vector. In the example below, the y vector gets recycled four times to c(1, 2, 1, 2, 1, 2, 1, 2).
x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(1, 2)
x + y[1] 2 4 4 6 6 8 8 10
Here, y gets recycled two time to c(1, 5, 1, 3, 1, 5, 1, 3).
x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(1, 5, 1, 3)
x + y[1] 2 7 4 7 6 11 8 11
Recycling fails, if the length of y is not a multiple of x.
x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(1, 5, 3)
x + yWarning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 7 6 5 10 9 8 13
There are several functions that calculate a statistic from vectors such as:
max()min()sum()prod()length()x[1] 1 2 3 4 5 6 7 8
sum(x)[1] 36
Recall that arithmetic operators applied to NA returns NA. So, by default statistical functions return NA when the vector contains even a single missing value. This behavior is intentional. However, there is a way to have these functions ignore NA values in their calculation using the na.rm=TRUE keyword.
y <- c(1, 5, NA, 3)
sum(y)[1] NA
sum(y, na.rm=TRUE)[1] 9
x[1] 1 2 3 4 5 6 7 8
x < 4[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
all(x < 4)[1] FALSE
any(x < 4)[1] TRUE
When arithmetic functions are applied to logical vectors, TRUE is treated as the number 1 and FALSE is treated as the number 0. This can be very handy when counting the number of true values.
x[1] 1 2 3 4 5 6 7 8
sum(x < 4)[1] 3
R includes helpful functions for generating sequences:
1:10 [1] 1 2 3 4 5 6 7 8 9 10
15:5 [1] 15 14 13 12 11 10 9 8 7 6 5
seq(from = 1, to = 100, by = 10) [1] 1 11 21 31 41 51 61 71 81 91
R includes helpful functions for generating repeats:
rep("x", times=10) [1] "x" "x" "x" "x" "x" "x" "x" "x" "x" "x"
rep(c("x", "o"), times=5) [1] "x" "o" "x" "o" "x" "o" "x" "o" "x" "o"
rep(c("x", "o"), each=5) [1] "x" "x" "x" "x" "x" "o" "o" "o" "o" "o"
You can access the i’th value in a vector x by using its positional index x[i]:
x <- c(1, 3, 8, 12, 56, 875, 234, 13)
x[1][1] 1
x[c(1, 5)][1] 1 56
x[c(1:4, 8)][1] 1 3 8 12 13
You can remove values from a vector using negative indices:
length(x)[1] 8
x2 <- x[-3]
length(x2)[1] 7
You can also overwrite individual values in a vector using indices. Here, x[1] denotes the first element in x:
x[1] <- 5
x[1] 5 3 8 12 56 875 234 13
Instead of using a numeric index pointing to the ith position of vector x, you can use a logical expression to subset or extract elements of x that meet a certain condition. For example, the expression below evaluations for every element i in vector x if that element is larger than 100. The result is a logical vector of TRUE and FALSE that has the same length as x. If such a logical vector is used as index vector all elements are extracted (or replaced) where the index vector is TRUE.
x > 100[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
x[x > 100][1] 875 234
x[x > 100] <- 100
x[1] 5 3 8 12 56 100 100 13
Watch out. If the logical vector is shorter then x the recycling rule applies!
x[1] 5 3 8 12 56 100 100 13
x[c(TRUE, FALSE)][1] 5 8 56 100
x[c(TRUE, FALSE, TRUE)][1] 5 8 12 100 100
Factors are a special type of vector in R used to represent categorical variables in statistical modeling. Recall that statistical data types can be continuous or categorical, with categorical variables further classified as nominal (unordered) or ordinal (ordered). Factors have a predefined set of possible values, called levels, that represent the categories of the variable.
In this example, we start with a character vector, but we could also use a numeric vector.
treespecies_char <- c("SP", "PI", "FI", "FI", "PI")
treespecies_char[1] "SP" "PI" "FI" "FI" "PI"
You can create a factor variable using the factor() function. This function converts a vector (usually of characters or integers) into a factor by assigning a set of levels that represent the unique categories. For example:
treespecies <- factor(treespecies_char)
treespecies[1] SP PI FI FI PI
Levels: FI PI SP
Levels are the category labels of a factor and are stored as a character vector, while the factor itself stores integer codes that reference those levels.
levels(treespecies)[1] "FI" "PI" "SP"
typeof(levels(treespecies))[1] "character"
You can change the names of the levels as follows:
levels(treespecies) <- c("Fir", "Pine", "Spruce")
treespecies[1] Spruce Pine Fir Fir Pine
Levels: Fir Pine Spruce
The factor() function also accepts the levels and labels arguments, which allow you to rename the categories when creating a factor variable. Check out the help page.
treespecies2 <- factor(treespecies_char, levels=c("FI", "PI", "SP"), labels=c("Fir", "Pine", "Spruce"))
treespecies2[1] Spruce Pine Fir Fir Pine
Levels: Fir Pine Spruce
Note, you cannot simply add values to a factor that are not specified in levels. Below, I try to change the level of the first tree. Since the level does not exist in the factor variable, the entry gets deleted and replaced with NA. This is bad.
treespecies[1] <- "Oak"Warning in `[<-.factor`(`*tmp*`, 1, value = "Oak"): invalid factor level, NA
generated
treespecies[1] <NA> Pine Fir Fir Pine
Levels: Fir Pine Spruce
Instead, you first need to add a level.
levels(treespecies2) <- c(levels(treespecies2), "Oak")
treespecies2[1] Spruce Pine Fir Fir Pine
Levels: Fir Pine Spruce Oak
Then you can change the species class of the first tree to Oak.
treespecies2[1] <- "Oak"
treespecies2[1] Oak Pine Fir Fir Pine
Levels: Fir Pine Spruce Oak
The function cut divides a numeric variable into intervals and codes them into factors (categorical data):
temperature <- runif(20, min=0, max=30)
temperature [1] 11.253822 9.472241 12.421771 22.994277 8.743325 28.343908 16.486085
[8] 10.331039 15.495551 22.751907 5.275195 15.358569 13.382359 22.070058
[15] 22.189743 28.810041 3.460757 22.427383 2.114313 7.871795
cut(temperature, c(0, 10, 30), labels=c('cold', 'warm')) [1] warm cold warm warm cold warm warm warm warm warm cold warm warm warm warm
[16] warm cold warm cold cold
Levels: cold warm
In R, a matrix is a two-dimensional array with a dim attribute of length 2, specifying the number of rows (nrow) and columns (ncol).
m <- matrix(1:9, nrow = 3, ncol = 3)
m [,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
See: dim(m), nrow(m), ncol(m)
Note that by default the columns of the matrix will be filled first. If you want to fill the matrix by row, you can specify this with the byrow argument:
n <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
n [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
You can apply mathematical operators to matrices the same way as vectors (Attention: Recycling rule!):
m * 2 [,1] [,2] [,3]
[1,] 2 8 14
[2,] 4 10 16
[3,] 6 12 18
m * n [,1] [,2] [,3]
[1,] 1 8 21
[2,] 8 25 48
[3,] 21 48 81
As with vectors, you can access elements of a matrix using indices, but now you work with two dimensions [i,j] or [row, column].
m [,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
m[1, ][1] 1 4 7
m[ , 1][1] 1 2 3
m[1, 1][1] 1
m[1:2, 3][1] 7 8
m[1:2, c(1,3)] [,1] [,2]
[1,] 1 7
[2,] 2 8
When you extract elements from a matrix, the result can belong to a different class!
class(m)[1] "matrix" "array"
class(m[ , 3])[1] "integer"
The cbind() function combines vectors or matrices by binding them column-wise:`
cbind(m,n) [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 7 1 2 3
[2,] 2 5 8 4 5 6
[3,] 3 6 9 7 8 9
The rbind() function combines vectors or matrices by stacking them row-wise:
rbind(m,n) [,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[4,] 1 2 3
[5,] 4 5 6
[6,] 7 8 9
Lists in R are important because they can store different types of data—numbers, strings, vectors, data frames, or even other lists—within a single object. Many R functions return results as lists, which makes it easy to access specific components like coefficients or p-values. Lists also let you return multiple values from your own functions and handle complex or nested data, making them a flexible and essential tool in R programming.
l <- list(c(1, 2, 3), m, "a")To access the elements of a list with indices you need to use double brackets [[]]:
l[[1]][1] 1 2 3
l[[3]][1] "a"
A data frame is what you may call a data table. It is similar to a two-dimensional matrix but the columns can contain different data types.
df <- data.frame(TREEID = 1001:1003,
SPECIES = factor(c("Spruce", "Fir", "Pine")),
LIFE = c(TRUE, FALSE, TRUE),
HEIGHT = c(34, 21, 26)
)
df TREEID SPECIES LIFE HEIGHT
1 1001 Spruce TRUE 34
2 1002 Fir FALSE 21
3 1003 Pine TRUE 26
The summary() function gives a quick overview. Helpful for spotting data entry errors and NA’s:
summary(df) TREEID SPECIES LIFE HEIGHT
Min. :1001 Fir :1 Mode :logical Min. :21.0
1st Qu.:1002 Pine :1 FALSE:1 1st Qu.:23.5
Median :1002 Spruce:1 TRUE :2 Median :26.0
Mean :1002 Mean :27.0
3rd Qu.:1002 3rd Qu.:30.0
Max. :1003 Max. :34.0
You can index (access) columns using three main ways:
df$TREEID[1] 1001 1002 1003
df[ , 1][1] 1001 1002 1003
df[ , "TREEID"][1] 1001 1002 1003
Rows are indexed by row number:
df[3, ] TREEID SPECIES LIFE HEIGHT
3 1003 Pine TRUE 26
df[1:2, "TREEID"][1] 1001 1002
df[1, c("TREEID", "HEIGHT")] TREEID HEIGHT
1 1001 34
IMPORTANT: Extracting a row does not change the class but, extracting a column does!
class(df)[1] "data.frame"
class(df[ 1, ])[1] "data.frame"
class(df[ , "TREEID"])[1] "integer"
Recall from last session that NA is used for missing values in R:
x <- c(1, 5, 3, 6, NA, 9, 21, 4)
x[1] 1 5 3 6 NA 9 21 4
..and that you must use is.na() to determine if an element is or contains missing values.
is.na(x)[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
It is not uncommon to have missing values in datasets, e.g. in data frames and matrices:
df <- data.frame(var1 = c(1, 3, 12, NA, 5),
var2 = c(3, 4, 1, 8, 11))
df var1 var2
1 1 3
2 3 4
3 12 1
4 NA 8
5 5 11
Use na.omit() to ignore rows in a data frame that contain NAs:
na.omit(df) var1 var2
1 1 3
2 3 4
3 12 1
5 5 11
Also recall that arithmetic functions and operations applied to NAs return NA
NA * 3[1] NA
Many arithmetic functions allow you to specify whether to ignore or include NAs:
sum(df$var1)[1] NA
sum(df$var1, na.rm=TRUE)[1] 21
It is best to extract columns from a data frame using the column names:
names(airquality)[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
airquality_ozone <- airquality$Ozone
airquality_ozone <- airquality[, "Ozone"]
airquality_ozone_temp <- airquality[, c("Ozone", "Temp")]Subset rows based using logical operations on variables (columns):
airquality_temp_gr_70 <- airquality[airquality$Temp > 70, ]
nrow(airquality_temp_gr_70)[1] 120
Subset rows based on row indices:
airquality_zeile_10_100 <- airquality[1:100, ]
nrow(airquality_zeile_10_100)[1] 100
You can combine logical operators to make more complex subsets of rows and columns:
airquality_juni <- airquality[airquality$Month == 6, ]all measurements from 15. June:
airquality_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, ]or return only Ozone values (column) from 15. June:
ozone_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, "Ozone"]In previously covered logical operations, single values were compared to vectors or matrices (or single values), e.g. c(1,2,3) < 2.
The %in% operator can be applied to two vectors: x %in% y. For each element in vector x, %in% evaluates if the element is contained in vector y. The operator returns a logical vector of the same length as vector x.
The following example returns all rows of month June and July.
airquality_jun_jul <- airquality[airquality$Month %in% c(6, 7), ]Recall, the $ operator is used to access columns in a data frame by name. You can also use it to create a new column:
The following example creates a new column with the name “NewVariable” and fills it with Ozone values multiplied by 100.
airquality$NewVariable <- airquality$Ozone * 100Or we create a log-transformed variable.
airquality$logOzone <- log(airquality$Ozone)R can read a variety of dataset formats such as
Text (ASCII) files are a popular data storage and exchange format. They can be read on any OS platform without special software. The most common text files separate data columns by comma (csv), semi-colon, or tabs. On German (and other) systems, the comma is already reserved for decimal places, so here the semi-colon or tab-separation is sometimes preferred. On English systems, the dot (.) is used for decimal places, on other systems it may be used to group digits for readability, e.g. 1,000,000.
The read.table() function is the most generic function to read table data from various text files. The function allows several arguments to accommodate different data formats. See ?read.table(). Important formating options are:
sep=",": Columns are separated by ,dec=".": Decimal sign is .header=TRUE: The first row contains the column namestab <- read.table("data/basic/airquality.txt", sep = ",", dec = ".", header = TRUE)read.table() returns a data frame.
class(tab)[1] "data.frame"
Use head() to print the first couple of rows of the data.frame or tail() to print the last rows.
head(tab) ID Ozone Solar Wind Temp Month Day
1 1 41 190 7.4 67 5 1
2 2 36 118 8.0 72 5 2
3 3 12 149 12.6 74 5 3
4 4 18 313 11.5 62 5 4
5 5 NA NA 14.3 56 5 5
6 6 28 NA 14.9 66 5 6
read.csv() is a short-cut of read.table() text files with comma-separated columns and read.csv2() is a short-cut for semi-colon separated files.
Exporting data.frames to text files is similarly easy with write.table():
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
This example writes the data frame tab to a semi-colon delimited file.
write.table(tab, "data/basic/airquality_output.txt", sep = ";", dec = ".", row.names = FALSE)You can also use write.csv() or write.csv2() to export data frames to comma-delimited or semi-colon delimited text files, respectively.
write.csv(tab, "data/basic/airquality_output.csv")While text files are useful for data exchange, R objects can be efficiently saved and restored using saveRDS() and readRDS(). These functions serialize R objects in R’s native binary format, preserving the exact structure, data types, and attributes of the object. This makes them especially suitable for storing complex objects, such as fitted models or large complex datasets, that would otherwise take considerable time or computation to recreate.
Save an R object to a file with saveRDS():
lmod <- lm(y ~ x, data = dat)
saveRDS(lmod, "data/basic/my_model.rds")Load the object back into R with readRDS(). Assign the result to a variable, which allows you to restore the object with a different name if needed:
lmod <- readRDS("data/basic/my_model.rds")Saving your entire R workspace (.RData) is generally not recommended. Workspaces contain all objects in memory and lack reproducibility. You won’t know which code created which objects. Instead, save only the specific R objects you need with saveRDS() or readRDS(), or better yet, keep your analysis code in scripts and regenerate objects as needed. This approach ensures reproducibility and makes it easier to debug and share your work.
A drawback of binary RDS files is reduced long-term compatibility. As R evolves, older RDS files may become difficult or impossible to read in future versions. For long-term data storage or sharing with others, text-based formats like CSV are more robust and future-proof.