+ - 0:00:00
Notes for current slide
Notes for next slide

R for Data Analysis: A Short Tutorial

Session 1: Introduction to R

Dimiter Toshkov

Institute of Public Administration, Leiden University

last updated: 2025-03-31

1 / 54

Welcome! First things first: introductions!

2 / 54

Welcome! First things first: introductions!

My name is written Димитър and is pronounced [diˈ mitər].

2 / 54

Welcome! First things first: introductions!

My name is written Димитър and is pronounced [diˈ mitər].

I am not a programmer.

2 / 54

Welcome! First things first: introductions!

My name is written Димитър and is pronounced [diˈ mitər].

I am not a programmer.

I have done a fair bit of data analysis, visualizations and some programming.

2 / 54

Welcome! First things first: introductions!

My name is written Димитър and is pronounced [diˈ mitər].

I am not a programmer.

I have done a fair bit of data analysis, visualizations and some programming.

I like pictures.

2 / 54

Here is a picture that I like (done in R)

3 / 54

Not done in R (yes, there's more where that came from )

4 / 54

What is R?

5 / 54

What is R?

R is the best thing since sliced bread,

5 / 54

What is R?

R is the best thing since sliced bread,

5 / 54

What is R?

R is the best thing since sliced bread,

only much better, because...

5 / 54

What is R?

R is the best thing since sliced bread,

only much better, because...

unlike bread, it combines functional with object-oriented programming,

5 / 54

What is R?

R is the best thing since sliced bread,

only much better, because...

unlike bread, it combines functional with object-oriented programming, and it is not sliced, but modular, so that it can be easily extended with new packages.

5 / 54

Downsides of R

6 / 54

Downsides of R

It is called R.

6 / 54

Downsides of R

It is called R.

It comes with no customer support.

6 / 54

Downsides of R

It is called R.

It comes with no customer support.

It regularly leads people into arguments about how good it is.

6 / 54

Downsides of R

It is called R.

It comes with no customer support.

It regularly leads people into arguments about how good it is.

Error messages are rarely informative.

6 / 54

Downsides of R

It is called R.

It comes with no customer support.

It regularly leads people into arguments about how good it is.

Error messages are rarely informative.

It is not made for data entry (though if you insist, look here).

6 / 54

What can you do with R? (1)

Fun things, such as:

  • Run (m)any statistical models, including Bayseian models (e.g. with STAN)
  • Do automated text analysis (e.g. with quanteda) and machine learning (e.g. with tensorflow)
  • Interact with Large Language Models (LLMs) (e.g. with chattr and tidychatmodels)
7 / 54

What can you do with R? (2)

Access data:

  • From any type of file (Stata, SPSS, Excel, text, etc.)
  • Directly from the web via APIs (e.g. World Bank)
  • Scrape complex internet sites and databases (e.g. EUR-Lex)
8 / 54

What can you do with R? (3)

Do other important things, such as:

  • Well-formatted conference programs from excel sheets
  • Presentations like this one (with RMarkdown and xaringan)

  • Practice open reproducible science with Quarto.

9 / 54

Art with R, by Katharina Brunner

description of the image

10 / 54

How can you learn to use R?

11 / 54

How can you learn to use R?

1. Get a good foundation

11 / 54

How can you learn to use R?

1. Get a good foundation

2. Learn by doing

2.1 with lots of support form the R community on blogs and StackOverflow

2.2 adapting other people's code

2.3 and asking LLMs for help

11 / 54

How can you learn to use R?

1. Get a good foundation

2. Learn by doing

2.1 with lots of support form the R community on blogs and StackOverflow

2.2 adapting other people's code

2.3 and asking LLMs for help

3. You can also get good old-fashioned books

11 / 54

What can you expect from this tutorial?

15 / 54

What can you expect from this tutorial?

Get started and get inspired.

15 / 54

What can you expect from this tutorial?

Get started and get inspired.

Get a good foundation, hopefully.

15 / 54

What can you expect from this tutorial?

Get started and get inspired.

Get a good foundation, hopefully.

Learn enough so you can continue learning on your own.

15 / 54

Organization of the meetings

Session 1: Introduction to R

  • Workflow
  • Fundamentals, objects and functions
  • Conditional evaluation and loops

Session 2: Data wrangling

  • Importing data
  • Restructuring datasets
  • Recoding variables
  • Merging and exporting data

Session 3: Data analysis

  • Data summary and simple linear models
  • Generalized linear models (logistic regression)
  • Multilevel models
  • Generating tables

Session 4: Data visualization

  • with plot
  • with ggplot2
16 / 54

By the end of the tutorial...

You should not think about working with any other software for your data work1.

[1] Unless you have to work with the uninitiated.

17 / 54

Let's get started (1)

18 / 54

Let's get started (1)

We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.

18 / 54

Let's get started (1)

We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.

We can use any text editor and copy and paste the code, but this gets boring pretty quickly.

18 / 54

Let's get started (1)

We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.

We can use any text editor and copy and paste the code, but this gets boring pretty quickly.

So we use programs such as R Studio that integrate a text editor linked to R and some other nice features.

18 / 54

Let's get started (1)

We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.

We can use any text editor and copy and paste the code, but this gets boring pretty quickly.

So we use programs such as R Studio that integrate a text editor linked to R and some other nice features.

We write in a text file (script) and send commands using Cntr+ENTER to be executed in the console.

18 / 54

Let's get started (2)

You can customize the appearance of code in RStudio.

Protip: Use an R Studio theme that highlights code (I use Pastel on Dark).

Protip: Turn on rainbow brackets.

Protip: R Studio has useful shortcuts (see them all with Alt+Shift+K). Learn and use some of them (e.g. Cntr+1/Cntr+2).

19 / 54

General grammatical features of R as a language

Functions have arguments in parentheses, separated by commas (unlike in Excel).

Capitalization of object and function names matters.

Intervals and indents in your code do not matter (unlike in Python). But they matter inside object and function names.

You can use single or double quotation marks, also nested within each other. But be careful to close like with like.

20 / 54

Files and projects (1)

21 / 54

Files and projects (1)

We can start a file, write code, execute the code, and - when we are happy - we can save the file (with an .r extension, but it remains a text file that can be edited in any text editor).

21 / 54

Files and projects (1)

We can start a file, write code, execute the code, and - when we are happy - we can save the file (with an .r extension, but it remains a text file that can be edited in any text editor).

This might be all we need for very small, simple and individual projects. (Btw, where is our work?)

### Where are we?
getwd() # oh, here
setwd('C:/my projects/here') # better here
21 / 54

Files and projects (2)

For more complex projects, you would want to start a Project. A project sets up the working environment and organizes things in a nice way. Within the project, you can (should) create separate folders for your code, input data, output data, plots, model results, and tables. The code itself can (should) be separated in smaller files (e.g. one for the libraries and functions that you use, one for data import and manipulation, one for statistical analyses, etc.). To start or continue working on a project, you click on the relevant .Rproj file, which loads the working environment.

22 / 54

Files and projects (3)

Use relative, not absolute paths in your scripts to make collaborative work easier.

### There are good and bad paths
'./data/nl/zh/dh/students.csv' # this is a good path
'C:/data/nl/zh/dh/students.csv' # this is a bad path
23 / 54

Files and projects (3)

Use relative, not absolute paths in your scripts to make collaborative work easier.

### There are good and bad paths
'./data/nl/zh/dh/students.csv' # this is a good path
'C:/data/nl/zh/dh/students.csv' # this is a bad path

Some additional advice on setting up projects is available here. We will say more about workflow (with GitHub and RMarkdown) later in this tutorial.

23 / 54

Some good practices

24 / 54

Some good practices

Take the time to annotate your code (using # to start a segment of a line that is not executed as code).

24 / 54

Some good practices

Take the time to annotate your code (using # to start a segment of a line that is not executed as code).

Think about the names of files and variables that you use. Have a system and be consistent. You can use ., or _, or capital letters, but stick to one.

### How (not) to name your variables
data.nl.denhaag.bezuidenhout # this is fine
data_nl_denhaag_bezuidenhout # this is also fine
DataNlDenhaagBezuidenhout # this is not so fine
data.Nl_DenHaag_.bezuidenhout # this is definitely not fine
24 / 54

Some more good practices

25 / 54

Some more good practices

Think about about how you name your scripts and other file names as well.

Protip: Use 00_libraries.R, 01_firstanalysis.R, etc. to name your scripts in the order that they should be executed, so you can quickly sort them within the folder alphabetically.

25 / 54

Modularity

R works with packages.

The default installation comes with basic functionality.

For everything else, you install a package.

There are multiple packages that can achieve the same task.

There is a special universe of packages called tidyverse, developed by Hadley Wickham and company, which creates a convenient way to load, wrangle data, analyze and visualize data. We will use these a lot.

26 / 54

Working with packages (1)

Working with packages is easy:

  • First, you have to install, from a CRAN repository, or from zip files, or via devtools. You can install with a command or from the R Studio menu. You install a package once on a computer (you might need to update every now and then).

  • Once the package is installed, you will want to load it with the library() function to use its functions. You have to load the package every session (if you need it, of course).

  • You can also directly specify functions from packages for use, e.g. dplyr::recode(). This is necessary because different functions in different packages can have the same name. This leads to confusion, both for R and for us.

27 / 54

Working with packages (2)

### How to install and load a package
install.packages('dplyr')
library (dplyr)
28 / 54

Working with packages (2)

### How to install and load a package
install.packages('dplyr')
library (dplyr)

Protip: If you work with people who would not know how to install a package but would want to run your code, you can start your code with a function that will install and load packages automatically (see here)

Protip: Don't do that with people who know their way around R. They don't like your script installing things without their authorization.

28 / 54

Assignment operators (1)

Perhaps the most fundamental operation in R is to assign a value to a named object: object_name <- value. Be careful, R is sensitive; case sensitive, that is.

You can be old school1 and assign values to names with <-. Or you can just use =. And if you are that cool, you can also use ->.

[1] "There is a general preference among the R community for using <- for assignment (other than in function signatures) for compatibility with (very) old versions of S-Plus."

29 / 54

Assignment operators (2)

### There are different ways to assign
best.month <- 'August'
best.date = 18
1978 -> best.year
best.date
## [1] 18
best.month
## [1] "August"
best.year
## [1] 1978
30 / 54

Assignment operators (3)

There are some subtle differences among the different assignment operators; if you are interested, read here.

You also have the assignment operator <<-. This is most useful 'in conjunction with closures to maintain state'. Exactly. If you want to know more, read here.

31 / 54

Vectors (1)

Vectors are one-dimensional collections of objects.

### How to make and check a vector
v1 <- seq (1, 50, by=5)
v1
## [1] 1 6 11 16 21 26 31 36 41 46
v2 <- c('R', 'pie', 5, NA)
v2
## [1] "R" "pie" "5" NA
is.vector(v1)
## [1] TRUE
is.vector (v2)
## [1] TRUE
is.vector(c(is.vector(v1), is.vector(v2)))
## [1] TRUE
32 / 54

Vectors (2)

There are several different types of vectors: logical, character, numeric (which can be double or integer), complex and raw. Factors and dates are augmented vectors that have a special attribute, their 'class'.

### What vectors?
typeof(v1)
## [1] "double"
typeof(v2)
## [1] "character"
typeof(is.vector(c(is.vector(v1), is.vector(v2))))
## [1] "logical"
typeof(c("1", "2", "4"))
## [1] "character"
33 / 54

More on vectors

Protip: In R, numbers are 'doubles' by default. To make an 'integer', place an L after the number (e.g.2L). This can save some trouble down the road. Alternatively, use round() when evaluating.

0.3/3 == 0.1 # floating point bizzaro
## [1] FALSE
round(0.3/3,1) == 0.1
## [1] TRUE
unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))
## [1] 0.3 0.3 0.3

Integers have one special value: NA, while doubles have four: NA, NaN, Inf and -Inf.

c(-1, 0, 1) / 0
## [1] -Inf NaN Inf
34 / 54

Coercion (1)

Use can coerce one type of vector to another. But be gentle and beware the consequences.

v1 <- c(1,2,4)
typeof(v1)
## [1] "double"
f1 <- as.factor(v1)
f1
## [1] 1 2 4
## Levels: 1 2 4
n1 <- as.numeric(f1)
is.numeric(n1)
## [1] TRUE
n1 # OMG!!!
## [1] 1 2 3
n2 <- as.numeric(as.character(f1))
n2 # that's better!
## [1] 1 2 4
35 / 54

Coercion (2)

Coercion happens without your help (and perhaps realization) as well, every time you mix vector elements of different types together. The most complex type prevails.

v1 <- seq(1:999)
is.numeric(v1)
## [1] TRUE
length(v1) # vectors have length
## [1] 999
v2 <- c(v1, '1000')
is.numeric(v2)
## [1] FALSE
typeof(v2) # it only takes one
## [1] "character"
36 / 54

Character vectors

Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.

Working with strings and character vectors is very common in data analysis. There are a couple of very useful operations with strings that we should learn right away:

v.char <- c("alpha", "beta", "gama")
substr(v.char, 1, 2) # get the first two letters of every element
## [1] "al" "be" "ga"
nchar(v.char) # count the number of characters in each string
## [1] 5 4 4
toupper(v.char) # capitalize each element
## [1] "ALPHA" "BETA" "GAMA"

The package stringr has handy functions for more advanced operations.

37 / 54

Lists

Lists, also called recursive vectors, can contain all kinds of things, including other lists.

y <- list("a", 1L, 1.5, TRUE)

Data frame are lists of a special class:

typeof(data.frame(NA))

class(data.frame(NA))

38 / 54

Navigating our objects

There are different ways in which we can navigate to and access elements of our objects.

39 / 54

Navigating our objects

There are different ways in which we can navigate to and access elements of our objects. We can do that by position or name:

x <- rnorm(100 ,0, 1) # let's generate some randomness
y <- rnorm(100 ,0, 1) # let's generate more randomness
m <- cbind(x,y) # let's bind randomness together in a ....
class(m)
## [1] "matrix" "array"
dim(m)
## [1] 100 2
length(x)
## [1] 100
df<-data.frame(m)
39 / 54

Navigation examples

x[1]
## [1] 1.114262
m[1,1]
## x
## 1.114262
m[1:5, -2]
## [1] 1.1142618 -1.3311627 0.3256704 -0.3040167 -1.7666893
df[seq(1, 100, 10), "y"]
## [1] 0.92370516 -0.05881295 -0.20430290 -0.73554277 -0.69460705 0.38450683
## [7] 0.37186241 0.31955549 0.34642647 0.46685266
40 / 54

Navigation examples

x[1]
## [1] 1.114262
m[1,1]
## x
## 1.114262
m[1:5, -2]
## [1] 1.1142618 -1.3311627 0.3256704 -0.3040167 -1.7666893
df[seq(1, 100, 10), "y"]
## [1] 0.92370516 -0.05881295 -0.20430290 -0.73554277 -0.69460705 0.38450683
## [7] 0.37186241 0.31955549 0.34642647 0.46685266

Navigating lists is more complicated. We have to use x[[ n ]] to get the n-th element of list x. That element itself could be anything (e.g. a data frame).

40 / 54

Some basic functions for summarizing data

First steps are easy

mean(x)
## [1] 0.05794814
sd(y)
## [1] 1.00284
quantile(m)
## 0% 25% 50% 75% 100%
## -3.68630633 -0.71101087 -0.02760762 0.57849092 2.58010017
range(df)
## [1] -3.686306 2.580100
summary(df)
## x y
## Min. :-2.01945 Min. :-3.686306
## 1st Qu.:-0.75151 1st Qu.:-0.672038
## Median :-0.07559 Median : 0.005507
## Mean : 0.05795 Mean :-0.079793
## 3rd Qu.: 0.76700 3rd Qu.: 0.449224
## Max. : 2.58010 Max. : 2.385434
41 / 54

But it can get more tricky. Note that we can use the dollar sign $ to access columns (variables) of a data frame.

df <- rbind(df, c(NA,NA)) # bind rows together
tail(df)
## x y
## 96 2.40618546 0.594138446
## 97 0.06715917 0.276341899
## 98 -0.39137443 1.235099081
## 99 -0.34462524 0.425646241
## 100 0.56221336 -0.003453708
## 101 NA NA
sum(df$x) # oops
## [1] NA
sum(df$x, na.rm=TRUE) # ok, R is very careful with missing data
## [1] 5.794814
42 / 54

And more tricky

sum(df)
## [1] NA
df$z <- rowSums(df)
tail(df, 2)
## x y z
## 100 0.5622134 -0.003453708 0.5587597
## 101 NA NA NA
df$z <- rowSums(df, na.rm=TRUE)
tail(df, 2)
## x y z
## 100 0.5622134 -0.003453708 1.117519
## 101 NA NA 0.000000
43 / 54

LOOPS (1)

Loops are a fundamental programming technique, in which we iterate over a predefined sequence and apply a function to each element.

for (i in 1:5){
print(round(df[i,], 2))
}
## x y z
## 1 1.11 0.92 4.08
## x y z
## 2 -1.33 -2.71 -8.08
## x y z
## 3 0.33 0.5 1.65
## x y z
## 4 -0.3 0.37 0.14
## x y z
## 5 -1.77 -1.51 -6.55

Most of R functions are vectorized, which means that we do not have to loop over the elements of a vector to apply the function to each element separately. Yet, in some cases loops can be handy.

44 / 54

LOOPS (2)

We can also create new objects in loops:

for (i in 1:dim(df)[1]){
df$our.sum[i] <- sum(df[i,1:2], na.rm=TRUE)
}
df[c(1,100:101),]
## x y z our.sum
## 1 1.1142618 0.923705162 4.075934 2.0379670
## 100 0.5622134 -0.003453708 1.117519 0.5587597
## 101 NA NA 0.000000 0.0000000

You can read more about loops here.

45 / 54

Comparisons (evaluation)

Sooner or later, we all become judgmental:

These are the main evaluation functions: >, >=, <, <=, != (not equal), and == (equal).

With logical operators, we can mix thing up a bit: & is “and”, | is “or”, and ! is “not”.

Be careful with missing values: almost any operation involving an unknown value will also be unknown.

46 / 54

Comparisons (2)

We can check for missing data: is.na(x) or even better which(is.na(df$x)).

which(df$x > 1)
## [1] 1 6 8 11 13 15 28 30 31 37 39 42 47 48 69 72 75 76 92 96
w1 <- which(df$x > 1)
length(w1)
## [1] 20
w2 <- which(df$y>1)
length(w2)
## [1] 11

Check whether the last row of df has elements greater than 1.

47 / 54

Conditionals (1)

Conditional evaluation is another fundamental programming technique.

if (this) {
# do that
} else if (that) {
# do something else
} else {
#
}
48 / 54

Conditionals (2)

For very short evaluations we can also use the ifelse one-liner: ifelse(evaluate, do.this.if.true, do.this.if.false). These simple statements can be nested, but it is better to use the extensive form shown above.

for (i in 1:length(df$x)){
if (is.na(df$x[i]) == FALSE & is.na(df$y[i]) == FALSE) {
df$out.sum2[i] <- sum(df[i,1:2])
} else {
df$out.sum2[i] <- NA
}
}
49 / 54

Functions

Objects are staff with names and values. Functions do things to objecs.

In R you can easily write your own functions. Just give them a name and tell them what to do

sum.na <- function (x) {sum(x, na.rm=T)} # sum that avoids NAs
sum.na(c(3,5,NA))
## [1] 8
sum.allna <- function (x) {if (all(is.na(x))) NA
else sum(x, na.rm=T)} # sum that avoids NAs but returns NA if all NAs
sum.allna(c(NA,NA))
## [1] NA

You can read more about functions here.

50 / 54

Strings and factors

Strings

You can create strings with either single quotes or double quotes. Multiple strings are often stored in a character vector, which you can create with c().

Factors

In R, factors are used to work with categorical variables: variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. You can read more about factors here.

If you ever need to access the set of valid factor levels directly, you can do so with levels(). You can also re-asign the levels of a factor with levels().

51 / 54

When things don't woRk as expected

Most often, code breaks because of punctuation errors (misspelled verbs and object names; parentheses and quotation marks that are not closed or are closed at the wrong place; capitalization errors; intervals in function and object calls, etc.)

Trying to apply a function to an object of the wrong type is a major source of errors.

Functions with the same name residing in different packages can cause confusion (e.g. recode() in car and recode() in dplyr).

Having the correct arguments, but in the wrong place in function calls.

52 / 54

Some solution strategies

Inspect your code for grammatical errors.

Read the documentation of the function that breaks the code.

Check that objects exist and have the expected type.

Isolate the problem by working step-by-step. Replicate the problem on a small subset of your data.

Google the text of the error message. Ask LLMs for help.

53 / 54

How to get in touch?

demetriodor@gmail.com

http://dimiter.eu

@dtoshkov.bsky.social

@DToshkov

github.com/demetriodor

Dimiter Toshkov

54 / 54

Welcome! First things first: introductions!

2 / 54
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow