2 Data structures

Multiple values in R can be stored in different data structures, depending on whether the elements must be of the same type (homogeneous) or can differ (heterogeneous):

Homogeneous (of the same type): vector (1D), matrix, (2D), array (3D+)
Heterogeneous (of mixed types): data frame, list

Vectors

A vector is an ordered collection of values from a single data type.

Use c() to combine different values to a vector:

x <- c(1, 3, 8, 12, 56, 875, 234, 13)
x

[1]   1   3   8  12  56 875 234  13

Use length() to determine the number of values in a vector:

length(x)

[1] 8

You can construct vectors from each data type:

y <- c("a", "b", "c")
typeof(y)

[1] "character"

But you cannot mix data types. If you do, the simpler data type is used (coercion):

z <- c(1, 4, "b", 8.5, "abc")
typeof(z)

[1] "character"

The order is: Logical > Double > Integer > Character

Vectorized operations

To apply an arithmetic operation for each element of a vector, you may be tempted to write a loop such as the one below. The example below loops through each element i of vector x and multiplies it with the number 2.

[1]   1   3   8  12  56 875 234  13

for (i in x) {
  i * 2
}

However, this is not the way math is done in R. Most operations in (base) R are vectorized, which means they are automatically performed element by element. There is no need to loop over each element to do the calculation. In fact, vectorization is better than looping. Loops in R are relatively slow. The example below also applies the multiplication and addition to each element of x. These operations are vectorized.

[1]   1   3   8  12  56 875 234  13

x * 2

[1]    2    6   16   24  112 1750  468   26

x + 2

[1]   3   5  10  14  58 877 236  15

Vectorized operations also work between two vectors. In the following example, the first elements are added together, the second elements are added together, and so forth.

x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(6, 3, 6, 9, 3, 1, 7, 2)
x + y

[1]  7  5  9 13  8  7 14 10

Recyling

If two vectors have different lengths, the shorter vector is recycled (repeated) as often as needed to match the length of the longer vector. In the example below, the y vector gets recycled four times to c(1, 2, 1, 2, 1, 2, 1, 2).

x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(1, 2)
x + y

[1]  2  4  4  6  6  8  8 10

Here, y gets recycled two time to c(1, 5, 1, 3, 1, 5, 1, 3).

x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(1, 5, 1, 3)
x + y

[1]  2  7  4  7  6 11  8 11

Recycling fails, if the length of y is not a multiple of x.

x <- c(1, 2, 3, 4, 5, 6, 7, 8)
y <- c(1, 5, 3)
x + y

Warning in x + y: longer object length is not a multiple of shorter object
length

[1]  2  7  6  5 10  9  8 13

Statistical functions

There are several functions that calculate a statistic from vectors such as:

max()
min()
sum()
prod()
length()

[1] 1 2 3 4 5 6 7 8

sum(x)

[1] 36

Recall that arithmetic operators applied to NA returns NA. So, by default statistical functions return NA when the vector contains even a single missing value. This behavior is intentional. However, there is a way to have these functions ignore NA values in their calculation using the na.rm=TRUE keyword.

y <- c(1, 5, NA, 3)
sum(y)

[1] NA

sum(y, na.rm=TRUE)

[1] 9

Logical vectors

[1] 1 2 3 4 5 6 7 8

x < 4

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

all(x < 4)

[1] FALSE

any(x < 4)

[1] TRUE

When arithmetic functions are applied to logical vectors, TRUE is treated as the number 1 and FALSE is treated as the number 0. This can be very handy when counting the number of true values.

[1] 1 2 3 4 5 6 7 8

sum(x < 4)

[1] 3

Generating vectors

Sequences

R includes helpful functions for generating sequences:

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

15:5

 [1] 15 14 13 12 11 10  9  8  7  6  5

seq(from = 1, to = 100, by = 10)

 [1]  1 11 21 31 41 51 61 71 81 91

Repeats

R includes helpful functions for generating repeats:

rep("x", times=10)

 [1] "x" "x" "x" "x" "x" "x" "x" "x" "x" "x"

rep(c("x", "o"), times=5)

 [1] "x" "o" "x" "o" "x" "o" "x" "o" "x" "o"

rep(c("x", "o"), each=5)

 [1] "x" "x" "x" "x" "x" "o" "o" "o" "o" "o"

Numerical indexing

You can access the i’th value in a vector x by using its positional index x[i]:

x <- c(1, 3, 8, 12, 56, 875, 234, 13)
x[1]

[1] 1

x[c(1, 5)]

[1]  1 56

x[c(1:4, 8)]

[1]  1  3  8 12 13

Removing values of a vector

You can remove values from a vector using negative indices:

length(x)

[1] 8

x2 <- x[-3]
length(x2)

[1] 7

Overwriting values of a vector

You can also overwrite individual values in a vector using indices. Here, x[1] denotes the first element in x:

x[1] <- 5
x

[1]   5   3   8  12  56 875 234  13

Logical indexing

Instead of using a numeric index pointing to the ith position of vector x, you can use a logical expression to subset or extract elements of x that meet a certain condition. For example, the expression below evaluations for every element i in vector x if that element is larger than 100. The result is a logical vector of TRUE and FALSE that has the same length as x. If such a logical vector is used as index vector all elements are extracted (or replaced) where the index vector is TRUE.

x > 100

[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

x[x > 100]

[1] 875 234

x[x > 100] <- 100
x

[1]   5   3   8  12  56 100 100  13

Watch out. If the logical vector is shorter then x the recycling rule applies!

[1]   5   3   8  12  56 100 100  13

x[c(TRUE, FALSE)]

[1]   5   8  56 100

x[c(TRUE, FALSE, TRUE)]

[1]   5   8  12 100 100

Factors

Factors are a special type of vector in R used to represent categorical variables in statistical modeling. Recall that statistical data types can be continuous or categorical, with categorical variables further classified as nominal (unordered) or ordinal (ordered). Factors have a predefined set of possible values, called levels, that represent the categories of the variable.

In this example, we start with a character vector, but we could also use a numeric vector.

treespecies_char <- c("SP", "PI", "FI", "FI", "PI")
treespecies_char

[1] "SP" "PI" "FI" "FI" "PI"

You can create a factor variable using the factor() function. This function converts a vector (usually of characters or integers) into a factor by assigning a set of levels that represent the unique categories. For example:

treespecies <- factor(treespecies_char)
treespecies

[1] SP PI FI FI PI
Levels: FI PI SP

Levels are the category labels of a factor and are stored as a character vector, while the factor itself stores integer codes that reference those levels.

levels(treespecies)

[1] "FI" "PI" "SP"

typeof(levels(treespecies))

[1] "character"

You can change the names of the levels as follows:

levels(treespecies) <- c("Fir", "Pine", "Spruce")
treespecies

[1] Spruce Pine   Fir    Fir    Pine  
Levels: Fir Pine Spruce

The factor() function also accepts the levels and labels arguments, which allow you to rename the categories when creating a factor variable. Check out the help page.

treespecies2 <- factor(treespecies_char, levels=c("FI", "PI", "SP"), labels=c("Fir", "Pine", "Spruce"))
treespecies2

[1] Spruce Pine   Fir    Fir    Pine  
Levels: Fir Pine Spruce

Note, you cannot simply add values to a factor that are not specified in levels. Below, I try to change the level of the first tree. Since the level does not exist in the factor variable, the entry gets deleted and replaced with NA. This is bad.

treespecies[1] <- "Oak"

Warning in `[<-.factor`(`*tmp*`, 1, value = "Oak"): invalid factor level, NA
generated

treespecies

[1] <NA> Pine Fir  Fir  Pine
Levels: Fir Pine Spruce

Instead, you first need to add a level.

levels(treespecies2) <- c(levels(treespecies2), "Oak")
treespecies2

[1] Spruce Pine   Fir    Fir    Pine  
Levels: Fir Pine Spruce Oak

Then you can change the species class of the first tree to Oak.

treespecies2[1] <- "Oak"
treespecies2

[1] Oak  Pine Fir  Fir  Pine
Levels: Fir Pine Spruce Oak

Categorize continuous variables

The function cut divides a numeric variable into intervals and codes them into factors (categorical data):

temperature <- runif(20, min=0, max=30)
temperature

 [1] 11.253822  9.472241 12.421771 22.994277  8.743325 28.343908 16.486085
 [8] 10.331039 15.495551 22.751907  5.275195 15.358569 13.382359 22.070058
[15] 22.189743 28.810041  3.460757 22.427383  2.114313  7.871795

cut(temperature, c(0, 10, 30), labels=c('cold', 'warm'))

 [1] warm cold warm warm cold warm warm warm warm warm cold warm warm warm warm
[16] warm cold warm cold cold
Levels: cold warm

Matrices

In R, a matrix is a two-dimensional array with a dim attribute of length 2, specifying the number of rows (nrow) and columns (ncol).

m <- matrix(1:9, nrow = 3, ncol = 3)
m

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

See: dim(m), nrow(m), ncol(m)

Note that by default the columns of the matrix will be filled first. If you want to fill the matrix by row, you can specify this with the byrow argument:

n <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
n

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

You can apply mathematical operators to matrices the same way as vectors (Attention: Recycling rule!):

m * 2

     [,1] [,2] [,3]
[1,]    2    8   14
[2,]    4   10   16
[3,]    6   12   18

m * n

     [,1] [,2] [,3]
[1,]    1    8   21
[2,]    8   25   48
[3,]   21   48   81

As with vectors, you can access elements of a matrix using indices, but now you work with two dimensions [i,j] or [row, column].

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

m[1, ]

[1] 1 4 7

m[ , 1]

[1] 1 2 3

m[1, 1]

[1] 1

m[1:2, 3]

[1] 7 8

m[1:2, c(1,3)]

     [,1] [,2]
[1,]    1    7
[2,]    2    8

When you extract elements from a matrix, the result can belong to a different class!

class(m)

[1] "matrix" "array"

class(m[ , 3])

[1] "integer"

The cbind() function combines vectors or matrices by binding them column-wise:`

cbind(m,n)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    4    7    1    2    3
[2,]    2    5    8    4    5    6
[3,]    3    6    9    7    8    9

The rbind() function combines vectors or matrices by stacking them row-wise:

rbind(m,n)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
[4,]    1    2    3
[5,]    4    5    6
[6,]    7    8    9

Lists

Lists in R are important because they can store different types of data—numbers, strings, vectors, data frames, or even other lists—within a single object. Many R functions return results as lists, which makes it easy to access specific components like coefficients or p-values. Lists also let you return multiple values from your own functions and handle complex or nested data, making them a flexible and essential tool in R programming.

l <- list(c(1, 2, 3), m, "a")

To access the elements of a list with indices you need to use double brackets [[]]:

l[[1]]

[1] 1 2 3

l[[3]]

[1] "a"

Data frames

A data frame is what you may call a data table. It is similar to a two-dimensional matrix but the columns can contain different data types.

df <- data.frame(TREEID = 1001:1003, 
                 SPECIES = factor(c("Spruce", "Fir", "Pine")), 
                 LIFE = c(TRUE, FALSE, TRUE),
                 HEIGHT = c(34, 21, 26)
                 )
df

  TREEID SPECIES  LIFE HEIGHT
1   1001  Spruce  TRUE     34
2   1002     Fir FALSE     21
3   1003    Pine  TRUE     26

The summary() function gives a quick overview. Helpful for spotting data entry errors and NA’s:

summary(df)

     TREEID       SPECIES     LIFE             HEIGHT    
 Min.   :1001   Fir   :1   Mode :logical   Min.   :21.0  
 1st Qu.:1002   Pine  :1   FALSE:1         1st Qu.:23.5  
 Median :1002   Spruce:1   TRUE :2         Median :26.0  
 Mean   :1002                              Mean   :27.0  
 3rd Qu.:1002                              3rd Qu.:30.0  
 Max.   :1003                              Max.   :34.0

You can index (access) columns using three main ways:

df$TREEID

[1] 1001 1002 1003

df[ , 1]

[1] 1001 1002 1003

df[ , "TREEID"]

[1] 1001 1002 1003

Rows are indexed by row number:

df[3, ]

  TREEID SPECIES LIFE HEIGHT
3   1003    Pine TRUE     26

df[1:2, "TREEID"]

[1] 1001 1002

df[1, c("TREEID", "HEIGHT")]

  TREEID HEIGHT
1   1001     34

IMPORTANT: Extracting a row does not change the class but, extracting a column does!

class(df)

[1] "data.frame"

class(df[ 1, ])

[1] "data.frame"

class(df[ , "TREEID"])

[1] "integer"

Missing values

Recall from last session that NA is used for missing values in R:

x <- c(1, 5, 3, 6, NA, 9, 21, 4)
x

[1]  1  5  3  6 NA  9 21  4

..and that you must use is.na() to determine if an element is or contains missing values.

is.na(x)

[1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

It is not uncommon to have missing values in datasets, e.g. in data frames and matrices:

df <- data.frame(var1 = c(1, 3, 12, NA, 5), 
                 var2 = c(3, 4, 1, 8, 11))
df

  var1 var2
1    1    3
2    3    4
3   12    1
4   NA    8
5    5   11

Use na.omit() to ignore rows in a data frame that contain NAs:

na.omit(df)

  var1 var2
1    1    3
2    3    4
3   12    1
5    5   11

Also recall that arithmetic functions and operations applied to NAs return NA

NA * 3

[1] NA

Many arithmetic functions allow you to specify whether to ignore or include NAs:

sum(df$var1)

[1] NA

sum(df$var1, na.rm=TRUE)

[1] 21

Subset columns

It is best to extract columns from a data frame using the column names:

names(airquality)

[1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

airquality_ozone <- airquality$Ozone
airquality_ozone <- airquality[, "Ozone"]
airquality_ozone_temp <- airquality[, c("Ozone", "Temp")]

Subset rows

Subset rows based using logical operations on variables (columns):

airquality_temp_gr_70 <- airquality[airquality$Temp > 70, ]
nrow(airquality_temp_gr_70)

[1] 120

Subset rows based on row indices:

airquality_zeile_10_100 <- airquality[1:100, ]
nrow(airquality_zeile_10_100)

[1] 100

You can combine logical operators to make more complex subsets of rows and columns:

airquality_juni <- airquality[airquality$Month == 6, ]

all measurements from 15. June:

airquality_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, ]

or return only Ozone values (column) from 15. June:

ozone_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, "Ozone"]

In previously covered logical operations, single values were compared to vectors or matrices (or single values), e.g. c(1,2,3) < 2.

The %in% operator can be applied to two vectors: x %in% y. For each element in vector x, %in% evaluates if the element is contained in vector y. The operator returns a logical vector of the same length as vector x.

The following example returns all rows of month June and July.

airquality_jun_jul <- airquality[airquality$Month %in% c(6, 7), ]

Create new variable

Recall, the $ operator is used to access columns in a data frame by name. You can also use it to create a new column:

The following example creates a new column with the name “NewVariable” and fills it with Ozone values multiplied by 100.

airquality$NewVariable <- airquality$Ozone * 100

Or we create a log-transformed variable.

airquality$logOzone <- log(airquality$Ozone)

Read and write

R can read a variety of dataset formats such as

Text files (e.g. CSV, TXT)
Statistical programs (e.g. Excel, SPSS table)
DBF file (e.g. ArcGIS)
Databases (e.g. PostgreSQL)
local file system or on a remote server (e.g. ftp, http)

Read text

Text (ASCII) files are a popular data storage and exchange format. They can be read on any OS platform without special software. The most common text files separate data columns by comma (csv), semi-colon, or tabs. On German (and other) systems, the comma is already reserved for decimal places, so here the semi-colon or tab-separation is sometimes preferred. On English systems, the dot (.) is used for decimal places, on other systems it may be used to group digits for readability, e.g. 1,000,000.

The read.table() function is the most generic function to read table data from various text files. The function allows several arguments to accommodate different data formats. See ?read.table(). Important formating options are:

sep=",": Columns are separated by ,
dec=".": Decimal sign is .
header=TRUE: The first row contains the column names

tab <- read.table("data/basic/airquality.txt", sep = ",", dec = ".", header = TRUE)

read.table() returns a data frame.

class(tab)

[1] "data.frame"

Use head() to print the first couple of rows of the data.frame or tail() to print the last rows.

head(tab)

  ID Ozone Solar Wind Temp Month Day
1  1    41   190  7.4   67     5   1
2  2    36   118  8.0   72     5   2
3  3    12   149 12.6   74     5   3
4  4    18   313 11.5   62     5   4
5  5    NA    NA 14.3   56     5   5
6  6    28    NA 14.9   66     5   6

Note

read.csv() is a short-cut of read.table() text files with comma-separated columns and read.csv2() is a short-cut for semi-colon separated files.

Write text

Exporting data.frames to text files is similarly easy with write.table():

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

This example writes the data frame tab to a semi-colon delimited file.

write.table(tab, "data/basic/airquality_output.txt", sep = ";", dec = ".", row.names = FALSE)

You can also use write.csv() or write.csv2() to export data frames to comma-delimited or semi-colon delimited text files, respectively.

write.csv(tab, "data/basic/airquality_output.csv")

Save and load R objects

While text files are useful for data exchange, R objects can be efficiently saved and restored using saveRDS() and readRDS(). These functions serialize R objects in R’s native binary format, preserving the exact structure, data types, and attributes of the object. This makes them especially suitable for storing complex objects, such as fitted models or large complex datasets, that would otherwise take considerable time or computation to recreate.

Save an R object to a file with saveRDS():

lmod <- lm(y ~ x, data = dat)
saveRDS(lmod, "data/basic/my_model.rds")

Load the object back into R with readRDS(). Assign the result to a variable, which allows you to restore the object with a different name if needed:

lmod <- readRDS("data/basic/my_model.rds")

Note

Saving your entire R workspace (.RData) is generally not recommended. Workspaces contain all objects in memory and lack reproducibility. You won’t know which code created which objects. Instead, save only the specific R objects you need with saveRDS() or readRDS(), or better yet, keep your analysis code in scripts and regenerate objects as needed. This approach ensures reproducibility and makes it easier to debug and share your work.

Warning

A drawback of binary RDS files is reduced long-term compatibility. As R evolves, older RDS files may become difficult or impossible to read in future versions. For long-term data storage or sharing with others, text-based formats like CSV are more robust and future-proof.