My name is written Димитър and is pronounced [diˈ mitər].
My name is written Димитър and is pronounced [diˈ mitər].
I am not a programmer.
My name is written Димитър and is pronounced [diˈ mitər].
I am not a programmer.
I have done a fair bit of data analysis, visualizations and some programming.
My name is written Димитър and is pronounced [diˈ mitər].
I am not a programmer.
I have done a fair bit of data analysis, visualizations and some programming.
I like pictures.
unlike bread, it combines functional with object-oriented programming,
unlike bread, it combines functional with object-oriented programming, and it is not sliced, but modular, so that it can be easily extended with new packages.
It is called R.
It is called R.
It comes with no customer support.
It is called R.
It comes with no customer support.
It regularly leads people into arguments about how good it is.
It is called R.
It comes with no customer support.
It regularly leads people into arguments about how good it is.
Error messages are rarely informative.
It is called R.
It comes with no customer support.
It regularly leads people into arguments about how good it is.
Error messages are rarely informative.
It is not made for data entry (though if you insist, look here).
2.1 with lots of support form the R community on blogs and StackOverflow
2.2 adapting other people's code
2.3 and asking LLMs for help
2.1 with lots of support form the R community on blogs and StackOverflow
2.2 adapting other people's code
2.3 and asking LLMs for help
R for Data Science, (link to a free version)
Advanced R, (link to a free version)
R Cookbook, (link to a free version)
The R Book, (link to a free version)
Get started and get inspired.
Get started and get inspired.
Get a good foundation, hopefully.
Get started and get inspired.
Get a good foundation, hopefully.
Learn enough so you can continue learning on your own.
plot
ggplot2
[1] Unless you have to work with the uninitiated.
We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.
We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.
We can use any text editor and copy and paste the code, but this gets boring pretty quickly.
We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.
We can use any text editor and copy and paste the code, but this gets boring pretty quickly.
So we use programs such as R Studio that integrate a text editor linked to R and some other nice features.
We can work with R directly (from the console/terminal), but it would be nice if we could save our work somehow.
We can use any text editor and copy and paste the code, but this gets boring pretty quickly.
So we use programs such as R Studio that integrate a text editor linked to R and some other nice features.
We write in a text file (script) and send commands using Cntr+ENTER
to be executed in the console.
You can customize the appearance of code in RStudio.
Protip: Use an R Studio theme that highlights code (I use Pastel on Dark).
Protip: Turn on rainbow brackets.
Protip: R Studio has useful shortcuts (see them all with Alt+Shift+K
). Learn and use some of them (e.g. Cntr+1
/Cntr+2
).
Functions have arguments in parentheses, separated by commas (unlike in Excel).
Capitalization of object and function names matters.
Intervals and indents in your code do not matter (unlike in Python). But they matter inside object and function names.
You can use single or double quotation marks, also nested within each other. But be careful to close like with like.
We can start a file, write code, execute the code, and - when we are happy - we can save the file (with an .r
extension, but it remains a text file that can be edited in any text editor).
We can start a file, write code, execute the code, and - when we are happy - we can save the file (with an .r
extension, but it remains a text file that can be edited in any text editor).
This might be all we need for very small, simple and individual projects. (Btw, where is our work?)
### Where are we?getwd() # oh, heresetwd('C:/my projects/here') # better here
For more complex projects, you would want to start a Project. A project sets up the working environment and organizes things in a nice way. Within the project, you can (should) create separate folders for your code, input data, output data, plots, model results, and tables. The code itself can (should) be separated in smaller files (e.g. one for the libraries and functions that you use, one for data import and manipulation, one for statistical analyses, etc.). To start or continue working on a project, you click on the relevant .Rproj
file, which loads the working environment.
Use relative, not absolute paths in your scripts to make collaborative work easier.
### There are good and bad paths'./data/nl/zh/dh/students.csv' # this is a good path'C:/data/nl/zh/dh/students.csv' # this is a bad path
Use relative, not absolute paths in your scripts to make collaborative work easier.
### There are good and bad paths'./data/nl/zh/dh/students.csv' # this is a good path'C:/data/nl/zh/dh/students.csv' # this is a bad path
Some additional advice on setting up projects is available here.
We will say more about workflow (with GitHub
and RMarkdown
) later in this tutorial.
Take the time to annotate your code (using #
to start a segment of a line that is not executed as code).
Take the time to annotate your code (using #
to start a segment of a line that is not executed as code).
Think about the names of files and variables that you use. Have a system and be consistent. You can use .
, or _
, or capital letters, but stick to one.
### How (not) to name your variablesdata.nl.denhaag.bezuidenhout # this is finedata_nl_denhaag_bezuidenhout # this is also fineDataNlDenhaagBezuidenhout # this is not so finedata.Nl_DenHaag_.bezuidenhout # this is definitely not fine
Think about about how you name your scripts and other file names as well.
Protip: Use 00_libraries.R
, 01_firstanalysis.R
, etc. to name your scripts in the order that they should be executed, so you can quickly sort them within the folder alphabetically.
R works with packages.
The default installation comes with basic functionality.
For everything else, you install a package.
There are multiple packages that can achieve the same task.
There is a special universe of packages called tidyverse
, developed by Hadley Wickham and company, which creates a convenient way to load, wrangle data, analyze and visualize data. We will use these a lot.
Working with packages is easy:
First, you have to install, from a CRAN repository, or from zip files, or via devtools
. You can install with a command or from the R Studio menu. You install a package once on a computer (you might need to update every now and then).
Once the package is installed, you will want to load it with the library()
function to use its functions. You have to load the package every session (if you need it, of course).
You can also directly specify functions from packages for use, e.g. dplyr::recode()
. This is necessary because different functions in different packages can have the same name. This leads to confusion, both for R
and for us.
### How to install and load a packageinstall.packages('dplyr')library (dplyr)
### How to install and load a packageinstall.packages('dplyr')library (dplyr)
Protip: If you work with people who would not know how to install a package but would want to run your code, you can start your code with a function that will install and load packages automatically (see here)
Protip: Don't do that with people who know their way around R. They don't like your script installing things without their authorization.
Perhaps the most fundamental operation in R is to assign a value to a named object: object_name <- value
. Be careful, R is sensitive; case sensitive, that is.
You can be old school1 and assign values to names with <-
. Or you can just use =
. And if you are that cool, you can also use ->
.
[1] "There is a general preference among the R community for using <-
for assignment (other than in function signatures) for compatibility with (very) old versions of S-Plus."
### There are different ways to assignbest.month <- 'August'best.date = 181978 -> best.yearbest.date## [1] 18best.month## [1] "August"best.year## [1] 1978
There are some subtle differences among the different assignment operators; if you are interested, read here.
You also have the assignment operator <<-
. This is most useful 'in conjunction with closures to maintain state'. Exactly. If you want to know more, read here.
Vectors are one-dimensional collections of objects.
### How to make and check a vectorv1 <- seq (1, 50, by=5)v1## [1] 1 6 11 16 21 26 31 36 41 46v2 <- c('R', 'pie', 5, NA)v2## [1] "R" "pie" "5" NAis.vector(v1)## [1] TRUEis.vector (v2)## [1] TRUEis.vector(c(is.vector(v1), is.vector(v2)))## [1] TRUE
There are several different types of vectors: logical, character, numeric (which can be double or integer), complex and raw. Factors and dates are augmented vectors that have a special attribute, their 'class'.
### What vectors?typeof(v1)## [1] "double"typeof(v2)## [1] "character"typeof(is.vector(c(is.vector(v1), is.vector(v2))))## [1] "logical"typeof(c("1", "2", "4"))## [1] "character"
Protip: In R, numbers are 'doubles' by default. To make an 'integer', place an L
after the number (e.g.2L
). This can save some trouble down the road. Alternatively, use round()
when evaluating.
0.3/3 == 0.1 # floating point bizzaro## [1] FALSEround(0.3/3,1) == 0.1## [1] TRUEunique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))## [1] 0.3 0.3 0.3
Integers have one special value: NA
, while doubles have four: NA
, NaN
, Inf
and -Inf
.
c(-1, 0, 1) / 0 ## [1] -Inf NaN Inf
Use can coerce one type of vector to another. But be gentle and beware the consequences.
v1 <- c(1,2,4)typeof(v1)## [1] "double"f1 <- as.factor(v1)f1## [1] 1 2 4## Levels: 1 2 4n1 <- as.numeric(f1)is.numeric(n1)## [1] TRUEn1 # OMG!!!## [1] 1 2 3n2 <- as.numeric(as.character(f1))n2 # that's better!## [1] 1 2 4
Coercion happens without your help (and perhaps realization) as well, every time you mix vector elements of different types together. The most complex type prevails.
v1 <- seq(1:999)is.numeric(v1)## [1] TRUElength(v1) # vectors have length## [1] 999v2 <- c(v1, '1000')is.numeric(v2)## [1] FALSEtypeof(v2) # it only takes one## [1] "character"
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
Working with strings and character vectors is very common in data analysis. There are a couple of very useful operations with strings that we should learn right away:
v.char <- c("alpha", "beta", "gama")substr(v.char, 1, 2) # get the first two letters of every element ## [1] "al" "be" "ga"nchar(v.char) # count the number of characters in each string## [1] 5 4 4toupper(v.char) # capitalize each element## [1] "ALPHA" "BETA" "GAMA"
The package stringr
has handy functions for more advanced operations.
Lists, also called recursive vectors, can contain all kinds of things, including other lists.
y <- list("a", 1L, 1.5, TRUE)
Data frame are lists of a special class:
typeof(data.frame(NA))
class(data.frame(NA))
There are different ways in which we can navigate to and access elements of our objects.
There are different ways in which we can navigate to and access elements of our objects. We can do that by position or name:
x <- rnorm(100 ,0, 1) # let's generate some randomness y <- rnorm(100 ,0, 1) # let's generate more randomness m <- cbind(x,y) # let's bind randomness together in a ....class(m)## [1] "matrix" "array"dim(m)## [1] 100 2length(x)## [1] 100df<-data.frame(m)
x[1]## [1] 1.114262m[1,1]## x ## 1.114262m[1:5, -2]## [1] 1.1142618 -1.3311627 0.3256704 -0.3040167 -1.7666893df[seq(1, 100, 10), "y"]## [1] 0.92370516 -0.05881295 -0.20430290 -0.73554277 -0.69460705 0.38450683## [7] 0.37186241 0.31955549 0.34642647 0.46685266
x[1]## [1] 1.114262m[1,1]## x ## 1.114262m[1:5, -2]## [1] 1.1142618 -1.3311627 0.3256704 -0.3040167 -1.7666893df[seq(1, 100, 10), "y"]## [1] 0.92370516 -0.05881295 -0.20430290 -0.73554277 -0.69460705 0.38450683## [7] 0.37186241 0.31955549 0.34642647 0.46685266
Navigating lists is more complicated. We have to use x[[ n ]]
to get the n-th element of list x
. That element itself could be anything (e.g. a data frame).
First steps are easy
mean(x)## [1] 0.05794814sd(y)## [1] 1.00284quantile(m)## 0% 25% 50% 75% 100% ## -3.68630633 -0.71101087 -0.02760762 0.57849092 2.58010017range(df)## [1] -3.686306 2.580100summary(df)## x y ## Min. :-2.01945 Min. :-3.686306 ## 1st Qu.:-0.75151 1st Qu.:-0.672038 ## Median :-0.07559 Median : 0.005507 ## Mean : 0.05795 Mean :-0.079793 ## 3rd Qu.: 0.76700 3rd Qu.: 0.449224 ## Max. : 2.58010 Max. : 2.385434
$
to access columns (variables) of a data frame.df <- rbind(df, c(NA,NA)) # bind rows togethertail(df)## x y## 96 2.40618546 0.594138446## 97 0.06715917 0.276341899## 98 -0.39137443 1.235099081## 99 -0.34462524 0.425646241## 100 0.56221336 -0.003453708## 101 NA NAsum(df$x) # oops## [1] NAsum(df$x, na.rm=TRUE) # ok, R is very careful with missing data ## [1] 5.794814
sum(df)## [1] NAdf$z <- rowSums(df) tail(df, 2)## x y z## 100 0.5622134 -0.003453708 0.5587597## 101 NA NA NAdf$z <- rowSums(df, na.rm=TRUE)tail(df, 2)## x y z## 100 0.5622134 -0.003453708 1.117519## 101 NA NA 0.000000
Loops are a fundamental programming technique, in which we iterate over a predefined sequence and apply a function to each element.
for (i in 1:5){ print(round(df[i,], 2))}## x y z## 1 1.11 0.92 4.08## x y z## 2 -1.33 -2.71 -8.08## x y z## 3 0.33 0.5 1.65## x y z## 4 -0.3 0.37 0.14## x y z## 5 -1.77 -1.51 -6.55
Most of R functions are vectorized, which means that we do not have to loop over the elements of a vector to apply the function to each element separately. Yet, in some cases loops can be handy.
We can also create new objects in loops:
for (i in 1:dim(df)[1]){ df$our.sum[i] <- sum(df[i,1:2], na.rm=TRUE)}df[c(1,100:101),]
## x y z our.sum## 1 1.1142618 0.923705162 4.075934 2.0379670## 100 0.5622134 -0.003453708 1.117519 0.5587597## 101 NA NA 0.000000 0.0000000
You can read more about loops here.
Sooner or later, we all become judgmental:
These are the main evaluation functions: >
, >=
, <
, <=
, !=
(not equal), and ==
(equal).
With logical operators, we can mix thing up a bit: &
is “and”, |
is “or”, and !
is “not”.
Be careful with missing values: almost any operation involving an unknown value will also be unknown.
We can check for missing data: is.na(x)
or even better which(is.na(df$x))
.
which(df$x > 1)## [1] 1 6 8 11 13 15 28 30 31 37 39 42 47 48 69 72 75 76 92 96w1 <- which(df$x > 1)length(w1)## [1] 20w2 <- which(df$y>1)length(w2)## [1] 11
Check whether the last row of df
has elements greater than 1.
Conditional evaluation is another fundamental programming technique.
if (this) { # do that } else if (that) { # do something else } else { # }
For very short evaluations we can also use the ifelse
one-liner: ifelse(evaluate, do.this.if.true, do.this.if.false)
. These simple statements can be nested, but it is better to use the extensive form shown above.
for (i in 1:length(df$x)){ if (is.na(df$x[i]) == FALSE & is.na(df$y[i]) == FALSE) { df$out.sum2[i] <- sum(df[i,1:2]) } else { df$out.sum2[i] <- NA }}
Objects are staff with names and values. Functions do things to objecs.
In R you can easily write your own functions. Just give them a name and tell them what to do
sum.na <- function (x) {sum(x, na.rm=T)} # sum that avoids NAs sum.na(c(3,5,NA))## [1] 8sum.allna <- function (x) {if (all(is.na(x))) NA else sum(x, na.rm=T)} # sum that avoids NAs but returns NA if all NAssum.allna(c(NA,NA))## [1] NA
You can read more about functions here.
You can create strings with either single quotes or double quotes.
Multiple strings are often stored in a character vector, which you can create with c()
.
In R, factors are used to work with categorical variables: variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. You can read more about factors here.
If you ever need to access the set of valid factor levels directly, you can do so with levels()
. You can also re-asign the levels of a factor with levels()
.
Most often, code breaks because of punctuation errors (misspelled verbs and object names; parentheses and quotation marks that are not closed or are closed at the wrong place; capitalization errors; intervals in function and object calls, etc.)
Trying to apply a function to an object of the wrong type is a major source of errors.
Functions with the same name residing in different packages can cause confusion (e.g. recode()
in car
and recode()
in dplyr
).
Having the correct arguments, but in the wrong place in function calls.
Inspect your code for grammatical errors.
Read the documentation of the function that breaks the code.
Check that objects exist and have the expected type.
Isolate the problem by working step-by-step. Replicate the problem on a small subset of your data.
Google the text of the error message. Ask LLMs for help.
demetriodor@gmail.com
@dtoshkov.bsky.social
@DToshkov
Dimiter Toshkov
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |