class: top, left, inverse, title-slide .title[ # R for Data Analysis: A Short Tutorial ] .subtitle[ ## Session 2: Data Wrangling ] .author[ ### Dimiter Toshkov ] .institute[ ### Institute of Public Administration, Leiden University ] .date[ ### last updated: 2025-03-31 ] --- <style type="text/css"> .title-slide { background-image: url(https://cran.r-project.org/Rlogo.svg); background-position: 50% 0%; ## just start changing this background-size: 150px; background-color: #fff; padding-left: 100px; /* delete this for 4:3 aspect ratio */ } .remark-slide-content { font-size: 28px; padding: 1em 1em 1em 1em; } .remark-slide-content > h1 { font-size: 32px; margin-top: -85px; } </style> # Last session... -- <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M505 174.8l-39.6-39.6c-9.4-9.4-24.6-9.4-33.9 0L192 374.7 80.6 263.2c-9.4-9.4-24.6-9.4-33.9 0L7 302.9c-9.4 9.4-9.4 24.6 0 34L175 505c9.4 9.4 24.6 9.4 33.9 0l296-296.2c9.4-9.5 9.4-24.7.1-34zm-324.3 106c6.2 6.3 16.4 6.3 22.6 0l208-208.2c6.2-6.3 6.2-16.4 0-22.6L366.1 4.7c-6.2-6.3-16.4-6.3-22.6 0L192 156.2l-55.4-55.5c-6.2-6.3-16.4-6.3-22.6 0L68.7 146c-6.2 6.3-6.2 16.4 0 22.6l112 112.2z"></path></svg> We introduced R as a programming language. -- <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M505 174.8l-39.6-39.6c-9.4-9.4-24.6-9.4-33.9 0L192 374.7 80.6 263.2c-9.4-9.4-24.6-9.4-33.9 0L7 302.9c-9.4 9.4-9.4 24.6 0 34L175 505c9.4 9.4 24.6 9.4 33.9 0l296-296.2c9.4-9.5 9.4-24.7.1-34zm-324.3 106c6.2 6.3 16.4 6.3 22.6 0l208-208.2c6.2-6.3 6.2-16.4 0-22.6L366.1 4.7c-6.2-6.3-16.4-6.3-22.6 0L192 156.2l-55.4-55.5c-6.2-6.3-16.4-6.3-22.6 0L68.7 146c-6.2 6.3-6.2 16.4 0 22.6l112 112.2z"></path></svg> We learned how to set up projects and reviewed some good practices for our workflow. -- <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M505 174.8l-39.6-39.6c-9.4-9.4-24.6-9.4-33.9 0L192 374.7 80.6 263.2c-9.4-9.4-24.6-9.4-33.9 0L7 302.9c-9.4 9.4-9.4 24.6 0 34L175 505c9.4 9.4 24.6 9.4 33.9 0l296-296.2c9.4-9.5 9.4-24.7.1-34zm-324.3 106c6.2 6.3 16.4 6.3 22.6 0l208-208.2c6.2-6.3 6.2-16.4 0-22.6L366.1 4.7c-6.2-6.3-16.4-6.3-22.6 0L192 156.2l-55.4-55.5c-6.2-6.3-16.4-6.3-22.6 0L68.7 146c-6.2 6.3-6.2 16.4 0 22.6l112 112.2z"></path></svg> We learned about some of the fundamental features of R: data types (vectors, lists, etc.), assignment operators, indexing, evaluation, loops and functions. -- <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M500.5 231.4l-192-160C287.9 54.3 256 68.6 256 96v320c0 27.4 31.9 41.8 52.5 24.6l192-160c15.3-12.8 15.3-36.4 0-49.2zm-256 0l-192-160C31.9 54.3 0 68.6 0 96v320c0 27.4 31.9 41.8 52.5 24.6l192-160c15.3-12.8 15.3-36.4 0-49.2z"></path></svg> Today we get our hands dirty with wrangling some real data! --- class: center, inverse, top background-image: url("data:image/png;base64,#figs/bigstock-197264191-1.jpg") background-size: contain --- # Importing data <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"></path></svg> Let's get some data from the European Social Survey! You will have to register and download the data file first. ``` r library(haven) # adapt the path to the file if needed df <- read_sav("./data/ESS11.sav") # this is a tibble, a fancy type of data frame dt <- as.data.frame(df) # we can make it a plain data frame ``` --- # There are different ways of getting data into R (1) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M16 288c-8.8 0-16 7.2-16 16v32c0 8.8 7.2 16 16 16h112v-64zm489-183L407.1 7c-4.5-4.5-10.6-7-17-7H384v128h128v-6.1c0-6.3-2.5-12.4-7-16.9zm-153 31V0H152c-13.3 0-24 10.7-24 24v264h128v-65.2c0-14.3 17.3-21.4 27.4-11.3L379 308c6.6 6.7 6.6 17.4 0 24l-95.7 96.4c-10.1 10.1-27.4 3-27.4-11.3V352H128v136c0 13.3 10.7 24 24 24h336c13.3 0 24-10.7 24-24V160H376c-13.2 0-24-10.8-24-24z"></path></svg> In `base R`, you can use `read.table('filename.txt', header = TRUE, sep = '\t', dec = ',', as.is = TRUE)` for comma-separated (`read.csv()`) or tab-separated (`read.delim()`) data. A more general function is `scan()`, which you're unlikely to use for rectangular data files. <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M496 448H16c-8.84 0-16 7.16-16 16v32c0 8.84 7.16 16 16 16h480c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zm-304-64l-64-32 64-32 32-64 32 64 64 32-64 32-16 32h208l-86.41-201.63a63.955 63.955 0 0 1-1.89-45.45L416 0 228.42 107.19a127.989 127.989 0 0 0-53.46 59.15L64 416h144l-16-32zm64-224l16-32 16 32 32 16-32 16-16 32-16-32-32-16 32-16z"></path></svg> **Protip:** If you encounter problems with encoding (quite likely, if you work with different languages), use the `guess_encoding()` function from the `readr` package and then set the `encoding` option of the data import function. --- # There are different ways of getting data into R (2) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M16 288c-8.8 0-16 7.2-16 16v32c0 8.8 7.2 16 16 16h112v-64zm489-183L407.1 7c-4.5-4.5-10.6-7-17-7H384v128h128v-6.1c0-6.3-2.5-12.4-7-16.9zm-153 31V0H152c-13.3 0-24 10.7-24 24v264h128v-65.2c0-14.3 17.3-21.4 27.4-11.3L379 308c6.6 6.7 6.6 17.4 0 24l-95.7 96.4c-10.1 10.1-27.4 3-27.4-11.3V352H128v136c0 13.3 10.7 24 24 24h336c13.3 0 24-10.7 24-24V160H376c-13.2 0-24-10.8-24-24z"></path></svg> For other file types, you have specialized packages: - for Excel files, you have `readxl` - for STATA, you have `foreign`, `haven` and `readstata13` - for SPSS, you have `haven` and `memisc` - for really big file, use `data.table::fread` - you can also use the package `rio` that chooses the package for you (yes, that's quite meta) For a longer overview of options for different file types [see here](https://cran.r-project.org/web/packages/rio/vignettes/rio.html) and [read here](https://www.datacamp.com/community/tutorials/r-tutorial-read-excel-into-r) for more details on data import. --- # Once we have the data imported <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M505 442.7L405.3 343c-4.5-4.5-10.6-7-17-7H372c27.6-35.3 44-79.7 44-128C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c48.3 0 92.7-16.4 128-44v16.3c0 6.4 2.5 12.5 7 17l99.7 99.7c9.4 9.4 24.6 9.4 33.9 0l28.3-28.3c9.4-9.4 9.4-24.6.1-34zM208 336c-70.7 0-128-57.2-128-128 0-70.7 57.2-128 128-128 70.7 0 128 57.2 128 128 0 70.7-57.2 128-128 128z"></path></svg> First impressions are important: - `dim(dt)` and `class(dt)` - `View(df)` to inspect the whole thing - `head(dt)`, `tail(dt)` or just `dt[sample(5),1:5]` - `summary(dt)` or `tibble::glimpse(df)` - `names(dt)` or `colnames(dt)` <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M496 448H16c-8.84 0-16 7.16-16 16v32c0 8.84 7.16 16 16 16h480c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zm-304-64l-64-32 64-32 32-64 32 64 64 32-64 32-16 32h208l-86.41-201.63a63.955 63.955 0 0 1-1.89-45.45L416 0 228.42 107.19a127.989 127.989 0 0 0-53.46 59.15L64 416h144l-16-32zm64-224l16-32 16 32 32 16-32 16-16 32-16-32-32-16 32-16z"></path></svg> **Protip:** `names()` and `colnames()` are equivalent for data frames, but not for matrices and vectors. --- # Check out the variable labels When we import an SPSS file (`sav`), we can inspect labels and value codes. ``` r library(labelled) attributes(dt$trstun) ## $label ## [1] "Trust in the United Nations" ## ## $format.spss ## [1] "F2.0" ## ## $class ## [1] "haven_labelled" "vctrs_vctr" "double" ## ## $labels ## No trust at all 1 2 3 4 ## 0 1 2 3 4 ## 5 6 7 8 9 ## 5 6 7 8 9 ## Complete trust Refusal Don't know No answer ## 10 77 88 99 ``` --- # Working with the labels We can get the wording of the variables (questions) with *label* and we can get the labels of the answer categories with *labels*. They are both **attributes** of the variable. ``` r lapply(dt, function(x) attributes(x)$label)[1:10] lapply(dt, function(x) attributes(x)$labels)[1:5] ``` --- # There is alwasy another way There is an alternative way to achieve the same thing with the `labelled` package. Note that this will only work on data that is imported with `haven` (and has labels already coded). ``` r var_label(dt) # variable names var_label(dt$gndr) var_label(dt$cntry) <- "country" # can be reassigned look_for(dt, 'internet', details=TRUE) # search in variable and value names val_labels(dt$stfdem) # inspect the value labels ``` --- # Zoom in on some variables <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M304 192v32c0 6.6-5.4 12-12 12h-56v56c0 6.6-5.4 12-12 12h-32c-6.6 0-12-5.4-12-12v-56h-56c-6.6 0-12-5.4-12-12v-32c0-6.6 5.4-12 12-12h56v-56c0-6.6 5.4-12 12-12h32c6.6 0 12 5.4 12 12v56h56c6.6 0 12 5.4 12 12zm201 284.7L476.7 505c-9.4 9.4-24.6 9.4-33.9 0L343 405.3c-4.5-4.5-7-10.6-7-17V372c-35.3 27.6-79.7 44-128 44C93.1 416 0 322.9 0 208S93.1 0 208 0s208 93.1 208 208c0 48.3-16.4 92.7-44 128h16.3c6.4 0 12.5 2.5 17 7l99.7 99.7c9.3 9.4 9.3 24.6 0 34zM344 208c0-75.2-60.8-136-136-136S72 132.8 72 208s60.8 136 136 136 136-60.8 136-136z"></path></svg> We can also inspect individual variables: ``` r table(dt$vote) # this is good for factor-like variables ## ## 1 2 3 ## 29551 7539 2706 prop.table(table(dt$vote)) # turn frequencies into proportions ## ## 1 2 3 ## 0.74256207 0.18944115 0.06799678 summary(dt$polintr) # for continuous variables, use summary ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.000 2.000 3.000 2.626 3.000 4.000 56 ``` --- # Transforming variables (1) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M0 168v-16c0-13.255 10.745-24 24-24h360V80c0-21.367 25.899-32.042 40.971-16.971l80 80c9.372 9.373 9.372 24.569 0 33.941l-80 80C409.956 271.982 384 261.456 384 240v-48H24c-13.255 0-24-10.745-24-24zm488 152H128v-48c0-21.314-25.862-32.08-40.971-16.971l-80 80c-9.372 9.373-9.372 24.569 0 33.941l80 80C102.057 463.997 128 453.437 128 432v-48h360c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24z"></path></svg> Let's make a real factor ``` r summary(dt$new.vote <- factor(dt$vote)) ## 1 2 3 NA's ## 29551 7539 2706 360 levels(dt$new.vote) <- c('voted', 'no vote', 'dont know') # recode levels summary(dt$new.vote) ## voted no vote dont know NA's ## 29551 7539 2706 360 ``` --- # Transforming variables (2) You can use `cut()` to split a continuous variable into categories ``` r dt$satdem.cat <- cut(dt$stfdem, # which variable breaks = c(-Inf, 3, 7, Inf), # interval break points, incl. start and end labels=c("low","medium","high")) # labels of the new categories table(dt$satdem.cat) ## ## low medium high ## 10109 20589 8214 ``` --- # Subsetting (1) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M487.976 0H24.028C2.71 0-8.047 25.866 7.058 40.971L192 225.941V432c0 7.831 3.821 15.17 10.237 19.662l80 55.98C298.02 518.69 320 507.493 320 487.98V225.941l184.947-184.97C520.021 25.896 509.338 0 487.976 0z"></path></svg> When we want to subset from a dataset, remember all the ways in which we can index rows and columns: ``` r dt.subset.1 <- dt [1:10, c('cntry','vote','polintr')] dt.subset.2 <- dt [seq(1, 101, by=10), 1:5] dt.subset.3 <- dt [dt$vote==1, -c(1:5)] dt.subset.nonas <- dt [is.na(dt$vote)==FALSE & is.na(dt$polintr)==FALSE, 1:5] ``` --- # Subsetting (2) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M487.976 0H24.028C2.71 0-8.047 25.866 7.058 40.971L192 225.941V432c0 7.831 3.821 15.17 10.237 19.662l80 55.98C298.02 518.69 320 507.493 320 487.98V225.941l184.947-184.97C520.021 25.896 509.338 0 487.976 0z"></path></svg> You can also use `subset()`, which saves on syntax but is slower and more error-prone ``` r dt.subset.4 <- subset (dt, vote==2 | polintr > median(polintr, na.rm=TRUE)) ``` --- # Getting rid of missing values <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M496 448H16c-8.84 0-16 7.16-16 16v32c0 8.84 7.16 16 16 16h480c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zm-304-64l-64-32 64-32 32-64 32 64 64 32-64 32-16 32h208l-86.41-201.63a63.955 63.955 0 0 1-1.89-45.45L416 0 228.42 107.19a127.989 127.989 0 0 0-53.46 59.15L64 416h144l-16-32zm64-224l16-32 16 32 32 16-32 16-16 32-16-32-32-16 32-16z"></path></svg> **Protip:** Use `complete.cases ()` to remove rows with *any* missing value ``` r sum(is.na(dt$vote)) # count NAs of individual columns ## [1] 360 dt.subset.complete <- dt [complete.cases(dt)==TRUE,] dim(dt.subset.complete) # oops ## [1] 0 642 ``` --- class: center, inverse, top background-image: url("data:image/png;base64,#figs/wrangle.jpg") background-size: contain --- # Now let's do thnigs the *tidy* way <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M432 416H16a16 16 0 0 0-16 16v32a16 16 0 0 0 16 16h416a16 16 0 0 0 16-16v-32a16 16 0 0 0-16-16zm0-128H16a16 16 0 0 0-16 16v32a16 16 0 0 0 16 16h416a16 16 0 0 0 16-16v-32a16 16 0 0 0-16-16zm0-128H16a16 16 0 0 0-16 16v32a16 16 0 0 0 16 16h416a16 16 0 0 0 16-16v-32a16 16 0 0 0-16-16zm0-128H16A16 16 0 0 0 0 48v32a16 16 0 0 0 16 16h416a16 16 0 0 0 16-16V48a16 16 0 0 0-16-16z"></path></svg> So far we used (mostly) functions from `base R` for our data wrangling business. But the `tidyverse` offers a nice, consistent framework for most of our data wrangling needs, so let's move to the `tidyverse` ``` r library (dplyr) # most of the tidy data wrangling functions are in this package library (tidyverse) # which is part of this collection ``` Also, our working environment is getting quite messy. Let's remove objects we don't need to free up memory: ``` r rm (dt.subset.1, dt.subset.2, dt.subset.3, dt.subset.4, dt.subset.complete, dt.subset.nonas) # rm (list = ls()) # to remove all objects ``` --- # Subsetting (the tidy way) To start with, let's see the tidy ways of subsetting ``` r # choose columns dt.subset.1 <- select (dt, cntry, vote, polintr, stfdem) # choose rows dt.subset.1 <- filter (dt.subset.1, vote == 1, polintr > 2, stfdem != 0) # choose rows by ordinal position dt.subset.1 <- slice (dt.subset.1, 1:10) ``` --- # The need for a pipe (1) <svg viewBox="0 0 640 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M260.8 291.06c-28.63-22.94-62-35.06-96.4-35.06C87 256 21.47 318.72 1.43 412.06c-3.55 16.6-.43 33.83 8.57 47.3C18.75 472.47 31.83 480 45.88 480H592c-103.21 0-155-37.07-233.19-104.46zm234.65-18.29L468.4 116.2A64 64 0 0 0 392 64.41L200.85 105a64 64 0 0 0-50.35 55.79L143.61 226c6.9-.83 13.7-2 20.79-2 41.79 0 82 14.55 117.29 42.82l98 84.48C450.76 412.54 494.9 448 592 448a48 48 0 0 0 48-48c0-25.39-29.6-119.33-144.55-127.23z"></path></svg> Normally, when we use many functions, we nest them in each other. For example, this is how it looks when we want to: .pull-left[ 1. Coerce a vector into a character vector 2. Then coerce it into a numeric vector 3. Then take the mean 4. Then take the square root of the mean 5. Then round the number to 2 digits ] .pull-right[ ``` r v = dt[1:100, "stfdem"] round(sqrt( mean( as.numeric( as.character(v) ), na.rm=TRUE) ), digits=2) ## [1] 2.4 ``` ] --- # The need for a pipe (2) Not very easy to read, is it? And it usually looks even worse ``` r round(sqrt(mean(as.numeric(as.character(v)), na.rm=TRUE)), digits=2) ## [1] 2.4 ``` Of course, we can do the operations one by one, but then we will have to reasign the object every step of the way, which is tedious. ``` r v = dt[1:100, "stfdem"] v = as.numeric(as.character(v)) m = sqrt(mean (v, na.rm=T)) round(m, digits = 2) ## [1] 2.4 ``` --- # Meet the pipe <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M248 8C111 8 0 119 0 256s111 248 248 248 248-111 248-248S385 8 248 8zm0 448c-110.3 0-200-89.7-200-200S137.7 56 248 56s200 89.7 200 200-89.7 200-200 200zm84-143.4c-20.8 25-51.5 39.4-84 39.4s-63.2-14.3-84-39.4c-8.5-10.2-23.6-11.5-33.8-3.1-10.2 8.5-11.5 23.6-3.1 33.8 30 36 74.1 56.6 120.9 56.6s90.9-20.6 120.9-56.6c8.5-10.2 7.1-25.3-3.1-33.8-10.2-8.4-25.3-7.1-33.8 3.1zM136.5 211c7.7-13.7 19.2-21.6 31.5-21.6s23.8 7.9 31.5 21.6l9.5 17c2.1 3.7 6.2 4.7 9.3 3.7 3.6-1.1 6-4.5 5.7-8.3-3.3-42.1-32.2-71.4-56-71.4s-52.7 29.3-56 71.4c-.3 3.7 2.1 7.2 5.7 8.3 3.4 1.1 7.4-.5 9.3-3.7l9.5-17zM328 152c-23.8 0-52.7 29.3-56 71.4-.3 3.7 2.1 7.2 5.7 8.3 3.5 1.1 7.4-.5 9.3-3.7l9.5-17c7.7-13.7 19.2-21.6 31.5-21.6s23.8 7.9 31.5 21.6l9.5 17c2.1 3.7 6.2 4.7 9.3 3.7 3.6-1.1 6-4.5 5.7-8.3-3.3-42.1-32.2-71.4-56-71.4z"></path></svg> Wouldn't it be nice if we could write the operations in the order that they have to be executed, without having to reassign objects all the time? <svg viewBox="0 0 352 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M176 80c-52.94 0-96 43.06-96 96 0 8.84 7.16 16 16 16s16-7.16 16-16c0-35.3 28.72-64 64-64 8.84 0 16-7.16 16-16s-7.16-16-16-16zM96.06 459.17c0 3.15.93 6.22 2.68 8.84l24.51 36.84c2.97 4.46 7.97 7.14 13.32 7.14h78.85c5.36 0 10.36-2.68 13.32-7.14l24.51-36.84c1.74-2.62 2.67-5.7 2.68-8.84l.05-43.18H96.02l.04 43.18zM176 0C73.72 0 0 82.97 0 176c0 44.37 16.45 84.85 43.56 115.78 16.64 18.99 42.74 58.8 52.42 92.16v.06h48v-.12c-.01-4.77-.72-9.51-2.15-14.07-5.59-17.81-22.82-64.77-62.17-109.67-20.54-23.43-31.52-53.15-31.61-84.14-.2-73.64 59.67-128 127.95-128 70.58 0 128 57.42 128 128 0 30.97-11.24 60.85-31.65 84.14-39.11 44.61-56.42 91.47-62.1 109.46a47.507 47.507 0 0 0-2.22 14.3v.1h48v-.05c9.68-33.37 35.78-73.18 52.42-92.16C335.55 260.85 352 220.37 352 176 352 78.8 273.2 0 176 0z"></path></svg> It would! And we will! But first we need a new special operator, the pipe `%>%`. <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M496 448H16c-8.84 0-16 7.16-16 16v32c0 8.84 7.16 16 16 16h480c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zm-304-64l-64-32 64-32 32-64 32 64 64 32-64 32-16 32h208l-86.41-201.63a63.955 63.955 0 0 1-1.89-45.45L416 0 228.42 107.19a127.989 127.989 0 0 0-53.46 59.15L64 416h144l-16-32zm64-224l16-32 16 32 32 16-32 16-16 32-16-32-32-16 32-16z"></path></svg> **Protip:** You can use `CTRL+SHIFT+m` or `CMD+SHIFT+m` to type the pipe in **R Studio**. --- # Using the pipe Let's see the pipe in action: ``` r v = dt[1:100, "stfdem"] v %>% as.character() %>% as.numeric() %>% mean(na.rm=TRUE) %>% sqrt() %>% round(digits=2) ## [1] 2.4 ``` Tidy, indeed! For a short tutorial on using the pipe, [read here](https://www.datacamp.com/community/tutorials/pipe-r-tutorial). --- # More on pipes Essentially, the tidy pipe (orginally, from the package `margrittr`) takes the object before the pipe and inserts it as the first argument in the function after the pipe. <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M496 448H16c-8.84 0-16 7.16-16 16v32c0 8.84 7.16 16 16 16h480c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zm-304-64l-64-32 64-32 32-64 32 64 64 32-64 32-16 32h208l-86.41-201.63a63.955 63.955 0 0 1-1.89-45.45L416 0 228.42 107.19a127.989 127.989 0 0 0-53.46 59.15L64 416h144l-16-32zm64-224l16-32 16 32 32 16-32 16-16 32-16-32-32-16 32-16z"></path></svg> **Protip:** You can insert the object before the pipe at any place in a function by using the `.` operator. Not so long ago, there was no simple forward pipe in the base R installation. Since, *R version 4.1.0* introduced the operator `|>`, which functions as a 'native' pipe. But there are some differences with the tidy `magrittr` pipe. If you want to know more about the differences, read [this post](https://ivelasq.rbind.io/blog/understanding-the-r-pipe/). Personally, I have had no reason to use the native pipe so far. --- # Let's go back to data wrangling (with pipes) --- class: center, inverse, top background-image: url("data:image/png;base64,#figs/pipe.jpg") background-size: contain --- # Let's go back to data wrangling (with pipes) Select and filter ``` r dt.subset<- dt %>% select (cntry, vote, polintr, stfdem) %>% # selecting variables filter (vote==1, polintr>2, stfdem!=0) %>% # filtering observations slice (1:3) ``` If we want to keep the output from the pipe, we have to remember to assign it to a (new) object. <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M496 448H16c-8.84 0-16 7.16-16 16v32c0 8.84 7.16 16 16 16h480c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zm-304-64l-64-32 64-32 32-64 32 64 64 32-64 32-16 32h208l-86.41-201.63a63.955 63.955 0 0 1-1.89-45.45L416 0 228.42 107.19a127.989 127.989 0 0 0-53.46 59.15L64 416h144l-16-32zm64-224l16-32 16 32 32 16-32 16-16 32-16-32-32-16 32-16z"></path></svg> **Protip:** Use `pull()` to extract a single vector (variable). --- # Creating new variables is easy <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M352 240v32c0 6.6-5.4 12-12 12h-88v88c0 6.6-5.4 12-12 12h-32c-6.6 0-12-5.4-12-12v-88h-88c-6.6 0-12-5.4-12-12v-32c0-6.6 5.4-12 12-12h88v-88c0-6.6 5.4-12 12-12h32c6.6 0 12 5.4 12 12v88h88c6.6 0 12 5.4 12 12zm96-160v352c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V80c0-26.5 21.5-48 48-48h352c26.5 0 48 21.5 48 48zm-48 346V86c0-3.3-2.7-6-6-6H54c-3.3 0-6 2.7-6 6v340c0 3.3 2.7 6 6 6h340c3.3 0 6-2.7 6-6z"></path></svg> To create new variables, use `mutate()`: ``` r dt %>% mutate (weight.rounded = round(pspwght, digits=0), country.ess = paste(cntry, essround, sep="."), vote.factor = factor (vote), vote.dummy.w = ifelse (vote==1, 1, 0)) %>% select (weight.rounded, country.ess, vote.factor, vote.dummy.w) %>% slice(sample (5)) %>% arrange(weight.rounded) ## weight.rounded country.ess vote.factor vote.dummy.w ## 1 0 AT.11 2 0 ## 2 0 AT.11 1 1 ## 3 0 AT.11 1 1 ## 4 1 AT.11 1 1 ## 5 4 AT.11 1 1 ``` --- # Recoding variables <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M184.561 261.903c3.232 13.997-12.123 24.635-24.068 17.168l-40.736-25.455-50.867 81.402C55.606 356.273 70.96 384 96.012 384H148c6.627 0 12 5.373 12 12v40c0 6.627-5.373 12-12 12H96.115c-75.334 0-121.302-83.048-81.408-146.88l50.822-81.388-40.725-25.448c-12.081-7.547-8.966-25.961 4.879-29.158l110.237-25.45c8.611-1.988 17.201 3.381 19.189 11.99l25.452 110.237zm98.561-182.915l41.289 66.076-40.74 25.457c-12.051 7.528-9 25.953 4.879 29.158l110.237 25.45c8.672 1.999 17.215-3.438 19.189-11.99l25.45-110.237c3.197-13.844-11.99-24.719-24.068-17.168l-40.687 25.424-41.263-66.082c-37.521-60.033-125.209-60.171-162.816 0l-17.963 28.766c-3.51 5.62-1.8 13.021 3.82 16.533l33.919 21.195c5.62 3.512 13.024 1.803 16.536-3.817l17.961-28.743c12.712-20.341 41.973-19.676 54.257-.022zM497.288 301.12l-27.515-44.065c-3.511-5.623-10.916-7.334-16.538-3.821l-33.861 21.159c-5.62 3.512-7.33 10.915-3.818 16.536l27.564 44.112c13.257 21.211-2.057 48.96-27.136 48.96H320V336.02c0-14.213-17.242-21.383-27.313-11.313l-80 79.981c-6.249 6.248-6.249 16.379 0 22.627l80 79.989C302.689 517.308 320 510.3 320 495.989V448h95.88c75.274 0 121.335-82.997 81.408-146.88z"></path></svg> To recode variables, use `recode()` inside `mutate()`: ``` r dt.s <- dt %>% mutate (vote.new = dplyr::recode (as.numeric(vote), '1' = 'voted', '0' = 'no vote', .default = NA_character_), # special tidy NAs stfdem.na = na_if(stfdem, -88), stfdem.newna = replace_na(stfdem, 0)) %>% rename (satdem = stfdem) %>% select (vote, vote.new, stfdem.na, stfdem.newna, satdem) %>% slice(1:5) dt.s ## vote vote.new stfdem.na stfdem.newna satdem ## 1 1 voted 6 6 6 ## 2 1 voted 7 7 7 ## 3 1 voted 6 6 6 ## 4 2 <NA> 6 6 6 ## 5 1 voted 8 8 8 ``` --- # Summarizing variables We can also summarize variables by groups and add the result to the data: ``` r dt.s2 <- dt %>% group_by(cntry) %>% mutate(stfdem.mean = mean(stfdem, na.rm=TRUE), count=n()) %>% select(cntry, vote, stfdem, stfdem.mean, count) %>% ungroup() dt.s2 ## # A tibble: 40,156 × 5 ## cntry vote stfdem stfdem.mean count ## <chr+lbl> <dbl+lbl> <dbl+lbl> <dbl> <int> ## 1 AT [Austria] 1 [Yes] 6 [6] 5.75 2354 ## 2 AT [Austria] 1 [Yes] 7 [7] 5.75 2354 ## 3 AT [Austria] 1 [Yes] 6 [6] 5.75 2354 ## 4 AT [Austria] 2 [No] 6 [6] 5.75 2354 ## 5 AT [Austria] 1 [Yes] 8 [8] 5.75 2354 ## 6 AT [Austria] 1 [Yes] 3 [3] 5.75 2354 ## 7 AT [Austria] 1 [Yes] 6 [6] 5.75 2354 ## 8 AT [Austria] 1 [Yes] 8 [8] 5.75 2354 ## 9 AT [Austria] 1 [Yes] 8 [8] 5.75 2354 ## 10 AT [Austria] 1 [Yes] 7 [7] 5.75 2354 ## # ℹ 40,146 more rows ``` --- # Summarizing variables There is another way to produce tables of aggregate values: ``` r dt.aggregate <- dt %>% group_by(cntry) %>% summarize(stfdem.mean = mean(stfdem, na.rm=TRUE), count=n()) %>% arrange(desc(stfdem.mean)) # order by the values of a variable in descending order dt.aggregate ## # A tibble: 24 × 3 ## cntry stfdem.mean count ## <chr+lbl> <dbl> <int> ## 1 CH [Switzerland] 7.53 1384 ## 2 NO [Norway] 7.19 1337 ## 3 FI [Finland] 6.78 1563 ## 4 SE [Sweden] 6.59 1230 ## 5 NL [Netherlands] 5.97 1695 ## 6 IS [Iceland] 5.92 842 ## 7 IE [Ireland] 5.85 2017 ## 8 AT [Austria] 5.75 2354 ## 9 DE [Germany] 5.70 2420 ## 10 BE [Belgium] 5.27 1594 ## # ℹ 14 more rows ``` --- # Merging data (1) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M500 128c6.627 0 12-5.373 12-12V44c0-6.627-5.373-12-12-12h-72c-6.627 0-12 5.373-12 12v12H96V44c0-6.627-5.373-12-12-12H12C5.373 32 0 37.373 0 44v72c0 6.627 5.373 12 12 12h12v256H12c-6.627 0-12 5.373-12 12v72c0 6.627 5.373 12 12 12h72c6.627 0 12-5.373 12-12v-12h320v12c0 6.627 5.373 12 12 12h72c6.627 0 12-5.373 12-12v-72c0-6.627-5.373-12-12-12h-12V128h12zm-52-64h32v32h-32V64zM32 64h32v32H32V64zm32 384H32v-32h32v32zm416 0h-32v-32h32v32zm-40-64h-12c-6.627 0-12 5.373-12 12v12H96v-12c0-6.627-5.373-12-12-12H72V128h12c6.627 0 12-5.373 12-12v-12h320v12c0 6.627 5.373 12 12 12h12v256zm-36-192h-84v-52c0-6.628-5.373-12-12-12H108c-6.627 0-12 5.372-12 12v168c0 6.628 5.373 12 12 12h84v52c0 6.628 5.373 12 12 12h200c6.627 0 12-5.372 12-12V204c0-6.628-5.373-12-12-12zm-268-24h144v112H136V168zm240 176H232v-24h76c6.627 0 12-5.372 12-12v-76h56v112z"></path></svg> It is easy to merge datasets together, when we have a common variable. Just be careful: there are different ways to merge (compare the results across the four tabs): .panelset[ .panel[.panel-name[Inner join] ``` r d1 <- data.frame(cbind ("ID" = c("A","C"), "values" = c(1:2))) d2 <- data.frame(cbind ("ID" = c("A","D"), "values" = c(3:4))) inner_join(d1, d2, by='ID') # only observations that are in both are kept ``` ``` ## ID values.x values.y ## 1 A 1 3 ``` ] .panel[.panel-name[Left join] ``` r d1 <- data.frame(cbind ("ID" = c("A","C"), "values" = c(1:2))) d2 <- data.frame(cbind ("ID" = c("A","D"), "values" = c(3:4))) left_join(d1, d2, by='ID') # all observations from d1 are kept ``` ``` ## ID values.x values.y ## 1 A 1 3 ## 2 C 2 <NA> ``` ] .panel[.panel-name[Right join] ``` r d1 <- data.frame(cbind ("ID" = c("A","C"), "values" = c(1:2))) d2 <- data.frame(cbind ("ID" = c("A","D"), "values" = c(3:4))) right_join(d1, d2, by='ID') # all observations from d2 are kept ``` ``` ## ID values.x values.y ## 1 A 1 3 ## 2 D <NA> 4 ``` ] .panel[.panel-name[Full join] ``` r d1 <- data.frame(cbind ("ID" = c("A","C"), "values" = c(1:2))) d2 <- data.frame(cbind ("ID" = c("A","D"), "values" = c(3:4))) full_join(d1, d2, by='ID') # all observations are kept ``` ``` ## ID values.x values.y ## 1 A 1 3 ## 2 C 2 <NA> ## 3 D <NA> 4 ``` ] ] --- # Merging data (2) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zm-248 50c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"></path></svg> Beware situations where the merging variable has non-unique values in one of the datasets. New rows can be silently added to the left-hand side dataset, for example: ``` r d1 <- data.frame(cbind ("ID" = c("A","C"), "values" = c(1:2))) d2 <- data.frame(cbind ("ID" = c("A","A"), "values" = c(3:4))) left_join(d1, d2, by='ID') # now the result has three rows! ``` ``` ## ID values.x values.y ## 1 A 1 3 ## 2 A 1 4 ## 3 C 2 <NA> ``` --- # Combining datasets <svg viewBox="0 0 576 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M564 224c6.627 0 12-5.373 12-12v-72c0-6.627-5.373-12-12-12h-72c-6.627 0-12 5.373-12 12v12h-88v-24h12c6.627 0 12-5.373 12-12V44c0-6.627-5.373-12-12-12h-72c-6.627 0-12 5.373-12 12v12H96V44c0-6.627-5.373-12-12-12H12C5.373 32 0 37.373 0 44v72c0 6.627 5.373 12 12 12h12v160H12c-6.627 0-12 5.373-12 12v72c0 6.627 5.373 12 12 12h72c6.627 0 12-5.373 12-12v-12h88v24h-12c-6.627 0-12 5.373-12 12v72c0 6.627 5.373 12 12 12h72c6.627 0 12-5.373 12-12v-12h224v12c0 6.627 5.373 12 12 12h72c6.627 0 12-5.373 12-12v-72c0-6.627-5.373-12-12-12h-12V224h12zM352 64h32v32h-32V64zm0 256h32v32h-32v-32zM64 352H32v-32h32v32zm0-256H32V64h32v32zm32 216v-12c0-6.627-5.373-12-12-12H72V128h12c6.627 0 12-5.373 12-12v-12h224v12c0 6.627 5.373 12 12 12h12v160h-12c-6.627 0-12 5.373-12 12v12H96zm128 136h-32v-32h32v32zm280-64h-12c-6.627 0-12 5.373-12 12v12H256v-12c0-6.627-5.373-12-12-12h-12v-24h88v12c0 6.627 5.373 12 12 12h72c6.627 0 12-5.373 12-12v-72c0-6.627-5.373-12-12-12h-12v-88h88v12c0 6.627 5.373 12 12 12h12v160zm40 64h-32v-32h32v32zm0-256h-32v-32h32v32z"></path></svg> We can also combine combine datasets by row (adding observations from two or more datasets): ``` r d1 <- dt [1:100, ] d2 <- dt [101:200, ] d.combined <- bind_rows(d1, d2, .id = "id.dataset") d.combined [c(1:2, 101:102), 1:3] ## id.dataset name essround ## 1 1 ESS11e02 11 ## 2 1 ESS11e02 11 ## 101 2 ESS11e02 11 ## 102 2 ESS11e02 11 ``` The last argument creates an *id* variable for the dataset from which the observation came from. --- # Batch processing Often we want to do operations on more than one variable. Then we can use `mutate_all()`, `mutate_at`, `mutate_if` and their equivalents. ``` r dt [1:2, 19:20] ## actrolga psppipla ## 1 5 4 ## 2 2 3 dt.s <- dt %>% mutate_at (19:20, mean, na.rm=TRUE) %>% mutate_at (19:20, round, digits=1) dt.s [1:2,19:20] ## actrolga psppipla ## 1 2.1 2.2 ## 2 2.1 2.2 ``` --- # Batch renaming We can also rename batches of column names. Note that we can select names that we want with `starts_with()`, `ends_with`, `contains()` and `matches()`. ``` r dt <- dt %>% rename_at(vars(starts_with("trst")), ~ paste0("newname", .)) ``` --- # More data <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"></path></svg> For the next steps, let's get some data straight from the internet. The Dutch Statistics Office (CBS) has its own package that allows direct access to its data: ``` r library(cbsodataR) im <- cbs_get_data('60032') %>% # Migratie; land van herkomst / vestiging, geboorteland, geslacht; 1995-2022 cbs_add_label_columns() %>% cbs_add_date_column() ``` --- # Some reorganization These operations should all be familiar: ``` r ims2 <- im %>% filter(Geslacht_label != 'Totaal mannen en vrouwen', LandVanHerkomstVestiging_label == 'Totaal landen', Geboorteland_label == 'Totaal', Perioden_freq == 'Y') %>% rename (year = Perioden_label, immigration = Immigratie_1, sex = Geslacht_label) %>% select (year, immigration, sex) ``` --- # Long to wide and back to long <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M504.971 359.029c9.373 9.373 9.373 24.569 0 33.941l-80 79.984c-15.01 15.01-40.971 4.49-40.971-16.971V416h-58.785a12.004 12.004 0 0 1-8.773-3.812l-70.556-75.596 53.333-57.143L352 336h32v-39.981c0-21.438 25.943-31.998 40.971-16.971l80 79.981zM12 176h84l52.781 56.551 53.333-57.143-70.556-75.596A11.999 11.999 0 0 0 122.785 96H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12zm372 0v39.984c0 21.46 25.961 31.98 40.971 16.971l80-79.984c9.373-9.373 9.373-24.569 0-33.941l-80-79.981C409.943 24.021 384 34.582 384 56.019V96h-58.785a12.004 12.004 0 0 0-8.773 3.812L96 336H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12h110.785c3.326 0 6.503-1.381 8.773-3.812L352 176h32z"></path></svg> There are two functions to help us reshape data from long to wide format and vice versa: `tidyr::pivot_longer` and `tidyr::pivot_wider`. Let's try to make them work with the `im` dataset. --- # Pivoting data: long to wide <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M504.971 359.029c9.373 9.373 9.373 24.569 0 33.941l-80 79.984c-15.01 15.01-40.971 4.49-40.971-16.971V416h-58.785a12.004 12.004 0 0 1-8.773-3.812l-70.556-75.596 53.333-57.143L352 336h32v-39.981c0-21.438 25.943-31.998 40.971-16.971l80 79.981zM12 176h84l52.781 56.551 53.333-57.143-70.556-75.596A11.999 11.999 0 0 0 122.785 96H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12zm372 0v39.984c0 21.46 25.961 31.98 40.971 16.971l80-79.984c9.373-9.373 9.373-24.569 0-33.941l-80-79.981C409.943 24.021 384 34.582 384 56.019V96h-58.785a12.004 12.004 0 0 0-8.773 3.812L96 336H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12h110.785c3.326 0 6.503-1.381 8.773-3.812L352 176h32z"></path></svg> Often data is in a *long* format, when we need it *wide*. ``` r ims2[c(1,dim(ims2)[1]), ] ## # A tibble: 2 × 3 ## year immigration sex ## <fct> <int> <fct> ## 1 1995 51481 Mannen ## 2 2022 208252 Vrouwen ``` ``` r ims2.w <- pivot_wider(ims2, # the dataset names_from = sex, # which variable (with categories) to unpack values_from = immigration, # where to get the values from names_prefix = 'im_') # optional, to change labels head(ims2.w, 2) ## # A tibble: 2 × 3 ## year im_Mannen im_Vrouwen ## <fct> <int> <int> ## 1 1995 51481 44618 ## 2 1996 56556 52193 ``` --- # Pivoting data: wide to long <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M504.971 359.029c9.373 9.373 9.373 24.569 0 33.941l-80 79.984c-15.01 15.01-40.971 4.49-40.971-16.971V416h-58.785a12.004 12.004 0 0 1-8.773-3.812l-70.556-75.596 53.333-57.143L352 336h32v-39.981c0-21.438 25.943-31.998 40.971-16.971l80 79.981zM12 176h84l52.781 56.551 53.333-57.143-70.556-75.596A11.999 11.999 0 0 0 122.785 96H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12zm372 0v39.984c0 21.46 25.961 31.98 40.971 16.971l80-79.984c9.373-9.373 9.373-24.569 0-33.941l-80-79.981C409.943 24.021 384 34.582 384 56.019V96h-58.785a12.004 12.004 0 0 0-8.773 3.812L96 336H12c-6.627 0-12 5.373-12 12v56c0 6.627 5.373 12 12 12h110.785c3.326 0 6.503-1.381 8.773-3.812L352 176h32z"></path></svg> But then other times data is in a *wide* format, and we need it *long*. ``` r ims2.l <- pivot_longer(ims2.w, # the dataset cols= c(im_Mannen, im_Vrouwen), # which variables to collapse names_to = 'sex', # name of the new variable with the categories values_to = 'immigration') # name of the new variable with the data head(ims2.l, 2) ## # A tibble: 2 × 3 ## year sex immigration ## <fct> <chr> <int> ## 1 1995 im_Mannen 51481 ## 2 1995 im_Vrouwen 44618 ``` --- # More on data wrangling <svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336 0H48C21.49 0 0 21.49 0 48v464l192-112 192 112V48c0-26.51-21.49-48-48-48zm0 428.43l-144-84-144 84V54a6 6 0 0 1 6-6h276c3.314 0 6 2.683 6 5.996V428.43z"></path></svg> A good short introduction to data wrangling with the `tidyverse` is available [here](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html). <svg viewBox="0 0 576 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M480 416v16c0 26.51-21.49 48-48 48H48c-26.51 0-48-21.49-48-48V176c0-26.51 21.49-48 48-48h16v48H54a6 6 0 0 0-6 6v244a6 6 0 0 0 6 6h372a6 6 0 0 0 6-6v-10h48zm42-336H150a6 6 0 0 0-6 6v244a6 6 0 0 0 6 6h372a6 6 0 0 0 6-6V86a6 6 0 0 0-6-6zm6-48c26.51 0 48 21.49 48 48v256c0 26.51-21.49 48-48 48H144c-26.51 0-48-21.49-48-48V80c0-26.51 21.49-48 48-48h384zM264 144c0 22.091-17.909 40-40 40s-40-17.909-40-40 17.909-40 40-40 40 17.909 40 40zm-72 96l39.515-39.515c4.686-4.686 12.284-4.686 16.971 0L288 240l103.515-103.515c4.686-4.686 12.284-4.686 16.971 0L480 208v80H192v-48z"></path></svg> These are great animations of what the tidyverse verbs do: Part I by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/) and Part II by [Andrew Heiss](https://www.andrewheiss.com/blog/2024/04/04/group_by-summarize-ungroup-animations/). <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M464 32H48C21.49 32 0 53.49 0 80v352c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V80c0-26.51-21.49-48-48-48zM224 416H64v-96h160v96zm0-160H64v-96h160v96zm224 160H288v-96h160v96zm0-160H288v-96h160v96z"></path></svg> Here is a [good resource](https://cran.r-project.org/web/packages/labelled/vignettes/labelled.html) for working with SPSS labelled data. This is another useful [blog post](https://martinctc.github.io/blog/working-with-spss-labels-in-r/) for working with survey data. --- # Exporting data <svg viewBox="0 0 576 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M384 121.9c0-6.3-2.5-12.4-7-16.9L279.1 7c-4.5-4.5-10.6-7-17-7H256v128h128zM571 308l-95.7-96.4c-10.1-10.1-27.4-3-27.4 11.3V288h-64v64h64v65.2c0 14.3 17.3 21.4 27.4 11.3L571 332c6.6-6.6 6.6-17.4 0-24zm-379 28v-32c0-8.8 7.2-16 16-16h176V160H248c-13.2 0-24-10.8-24-24V0H24C10.7 0 0 10.7 0 24v464c0 13.3 10.7 24 24 24h336c13.3 0 24-10.7 24-24V352H208c-8.8 0-16-7.2-16-16z"></path></svg> After all this hard work, it would be a shame if we lose it. We can save data in multiple formats. R has its own format `.RData`, which is great, but not for people who don`t use R. ``` r save(dt, file='./data_out/mydata.RData') ``` Note that we import `.RData` files with `load(file = "./pathtoyour/mydata.RData")`. Saving as a `csv` is usually a good idea, especially if you work with others. We can also save `sav` and `dta` files with the `haven` package. ``` r write.csv (dt, './data_out/mydata.csv') write_sav(df, "./data_out/mydata.sav") ``` --- # How to get in touch? <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"></path></svg> demetriodor@gmail.com <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M131.5 217.5L55.1 100.1c47.6-59.2 119-91.8 192-92.1 42.3-.3 85.5 10.5 124.8 33.2 43.4 25.2 76.4 61.4 97.4 103L264 133.4c-58.1-3.4-113.4 29.3-132.5 84.1zm32.9 38.5c0 46.2 37.4 83.6 83.6 83.6s83.6-37.4 83.6-83.6-37.4-83.6-83.6-83.6-83.6 37.3-83.6 83.6zm314.9-89.2L339.6 174c37.9 44.3 38.5 108.2 6.6 157.2L234.1 503.6c46.5 2.5 94.4-7.7 137.8-32.9 107.4-62 150.9-192 107.4-303.9zM133.7 303.6L40.4 120.1C14.9 159.1 0 205.9 0 256c0 124 90.8 226.7 209.5 244.9l63.7-124.8c-57.6 10.8-113.2-20.8-139.5-72.5z"></path></svg> [http://dimiter.eu](http://dimiter.eu) <svg viewBox="0 0 484 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <g groupmode="layer" id="layer6" label="icon"> <path id="Shape_1_" class="st1" d="M 324.19873,96 H 215.27696 C 130.71036,96 61.027479,165.68288 61.027479,250.24948 v 6.08879 c 0,30.44398 9.47146,58.18182 25.70825,83.21353 L 5.5518005,416 123.94503,378.7907 c 25.70825,19.61945 58.18182,30.44397 92.685,30.44397 h 107.5687 c 85.91965,0 154.24947,-69.68287 154.24947,-152.8964 v -6.08879 C 478.4482,165.68288 408.76534,96 324.19873,96 Z M 406,276 c 0,46.68076 -35.23395,75.66979 -81.23818,75.66979 H 213.13392 C 166.45316,351.66979 132,322.68076 132,276 v -40 c 0,-46.68077 34.45321,-81.20125 81.13397,-81.20125 h 111.6279 C 371.44264,154.79875 406,189.31924 406,236 Z" style="stroke-width:1" nodetypes="sssscccssssscsssssscc"></path> </g></svg> @dtoshkov.bsky.social <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @DToshkov <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> [github.com/demetriodor](https://github.com/demetriodor/) <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"></path></svg> Dimiter Toshkov