1.4 The tidyverse

R is a functional language, which means that it uses nested parentheses, which can make code difficult to read.

1.4.1 Pipeline Operator

The pipeline operator |> (formerly %>%) and the dplyr package can be used to remedy the situation.

Hadley Wickham13 provided an example in 2014 to illustrate how it works:

hourly_delay <- filter(
   summarise(
     group_by( 
       filter(
         flights, 
         !is.na(dep_delay)
       ), 
       date, hour
     ), 
     delay = mean(dep_delay), 
     n = n()
   ), 
   n > 10 
)

Without necessarily knowing how each of the internal functions works, we can still get a sense for what the overall nested structure does, and realize (albeit, with a fair amount of work) that the basic object on which we operate is the flights data frame.

The pipeline operator |> removes the need for nesting function calls, in favor of passing data from one function to the next:

library(dplyr)

hourly_delay <- flights |>  
    filter(!is.na(dep_delay)) |> 
    group_by(date, hour) |> 
    summarise(delay = mean(dep_delay),n = n()) |> 
    filter(n > 10)

It is now obvious that the flights data frame is the base object, for instance.

The gap between pseudo-code and “code that runs” is significantly reduced. The beauty of this approach is that the block of code can now be ‘read’ directly: the flights data frame is

  1. filtered (to remove missing values of the dep_delay variable);

  2. grouped by hours within days;

  3. the mean delay is calculated within groups, and

  4. the mean delay is returned for those hours with more than n > 10 flights.

The pipeline rules are simple: the object immediately to the left of the pipeline is passed as the first argument to the function immediately to its right:

  • data |> function is equivalent to function(data)

  • data |> function(arg=value) is equivalent to function(data, arg=value)

For instance:

library(dplyr)
swiss |> summary()
   Fertility      Agriculture     Examination      Education    
 Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
 1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
 Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
 Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
 3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
 Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
    Catholic       Infant.Mortality   threshold     
 Min.   :  2.150   Min.   :10.80    Min.   :0.0000  
 1st Qu.:  5.195   1st Qu.:18.15    1st Qu.:1.0000  
 Median : 15.140   Median :20.00    Median :1.0000  
 Mean   : 41.144   Mean   :19.94    Mean   :0.9362  
 3rd Qu.: 93.125   3rd Qu.:21.70    3rd Qu.:1.0000  
 Max.   :100.000   Max.   :26.60    Max.   :1.0000  

The magrittr vignette provides additional information on the magrittr package, on which dplyr is based.

1.4.2 Tidy Data

The pipeline operator is also compatible with the tidyverse suite of packages, championed by Wickham (including the ever popular ggplot2); cheat sheets are available here.

Tidy data has a specific structure:

  • each variable is a column;

  • each observation is a row;

  • each type of observational unit is a table.

Two tidyr functions are used to reshape tables to a tidy format: gather() and spread().

gather() requires:

  • a data frame to reshape;

  • a key column (against which to reshape);

  • a value column (which will contain the new variable of interest), and

  • the indices of the columns that need to be collapsed.

Consider the following dataset:

cities <- data.frame(
  city=c("Toronto","Montreal","Vancouver",
         "Ottawa","Calgary","Edmonton",
         "Quebec City","Winnipeg","Hamilton"),
  prov=c("Ontario","Quebec","BC",
    "Ontario","Alberta","Alberta",
    "Quebec","Manitoba","Ontario"),
  pop.2016=c(6202225,4291732,2642825,
             1488307,1481806,1418118,
             839311,834678,785184),
  pop.2011=c(5928040,4104074,2463431,
    1371576,1392609,1321441,
    806406,783099,747545)
) 
cities
         city     prov pop.2016 pop.2011
1     Toronto  Ontario  6202225  5928040
2    Montreal   Quebec  4291732  4104074
3   Vancouver       BC  2642825  2463431
4      Ottawa  Ontario  1488307  1371576
5     Calgary  Alberta  1481806  1392609
6    Edmonton  Alberta  1418118  1321441
7 Quebec City   Quebec   839311   806406
8    Winnipeg Manitoba   834678   783099
9    Hamilton  Ontario   785184   747545

It is not presented in a tidy format, because population shows up in two columns.

In a tidy format, it would instead look like:

cities.tidy <- tidyr::gather(cities,"year","population",3:4)
cities.tidy$year <- ifelse(cities.tidy$year=="pop.2016",2016,2011)
cities.tidy
          city     prov year population
1      Toronto  Ontario 2016    6202225
2     Montreal   Quebec 2016    4291732
3    Vancouver       BC 2016    2642825
4       Ottawa  Ontario 2016    1488307
5      Calgary  Alberta 2016    1481806
6     Edmonton  Alberta 2016    1418118
7  Quebec City   Quebec 2016     839311
8     Winnipeg Manitoba 2016     834678
9     Hamilton  Ontario 2016     785184
10     Toronto  Ontario 2011    5928040
11    Montreal   Quebec 2011    4104074
12   Vancouver       BC 2011    2463431
13      Ottawa  Ontario 2011    1371576
14     Calgary  Alberta 2011    1392609
15    Edmonton  Alberta 2011    1321441
16 Quebec City   Quebec 2011     806406
17    Winnipeg Manitoba 2011     783099
18    Hamilton  Ontario 2011     747545

spread(), on the other hand, generates multiple columns from two columns; it requires

  • a data frame to reshape;

  • a key column, and

  • values in the value column to become new values.

For instance, we could reverse the “tidying” of cities.tidy with:

cities.back.to.wide <- tidyr::spread(cities.tidy,year,population)
colnames(cities.back.to.wide) <- c("city","prov","pop.2011","pop.2016")
cities.back.to.wide
         city     prov pop.2011 pop.2016
1     Calgary  Alberta  1392609  1481806
2    Edmonton  Alberta  1321441  1418118
3    Hamilton  Ontario   747545   785184
4    Montreal   Quebec  4104074  4291732
5      Ottawa  Ontario  1371576  1488307
6 Quebec City   Quebec   806406   839311
7     Toronto  Ontario  5928040  6202225
8   Vancouver       BC  2463431  2642825
9    Winnipeg Manitoba   783099   834678

Other useful wrangling functions include separate() and unite(). What do you think these do? (How could you find out?)

1.4.3 The dplyr Package

The dplyr package provides functions to transform tabular data. Its most useful functions are compatible with the pipeline operator |>:

  • select(): to extract a subset of variables from the data frame;

  • filter(): to extract a subset of observations from the data frame;

  • arrange(): to sort the data frame;

  • mutate(): to create new variables from existing variables;

  • summarise(): to create so-called pivot tables;

  • group_by(): … self-evident?

We will showcase these functions with the help of various examples. Try to guess what the outputs would be before looking at them.14

cities |> select(prov,pop.2016)
      prov pop.2016
1  Ontario  6202225
2   Quebec  4291732
3       BC  2642825
4  Ontario  1488307
5  Alberta  1481806
6  Alberta  1418118
7   Quebec   839311
8 Manitoba   834678
9  Ontario   785184
cities |> select(-pop.2016)
         city     prov pop.2011
1     Toronto  Ontario  5928040
2    Montreal   Quebec  4104074
3   Vancouver       BC  2463431
4      Ottawa  Ontario  1371576
5     Calgary  Alberta  1392609
6    Edmonton  Alberta  1321441
7 Quebec City   Quebec   806406
8    Winnipeg Manitoba   783099
9    Hamilton  Ontario   747545
cities |> filter(pop.2016>1000000)
       city    prov pop.2016 pop.2011
1   Toronto Ontario  6202225  5928040
2  Montreal  Quebec  4291732  4104074
3 Vancouver      BC  2642825  2463431
4    Ottawa Ontario  1488307  1371576
5   Calgary Alberta  1481806  1392609
6  Edmonton Alberta  1418118  1321441
cities |> filter(pop.2016>1000000, prov %in% c("Ontario","Quebec"))
      city    prov pop.2016 pop.2011
1  Toronto Ontario  6202225  5928040
2 Montreal  Quebec  4291732  4104074
3   Ottawa Ontario  1488307  1371576
cities |> mutate(pop.increase = pop.2016/pop.2011-1)
         city     prov pop.2016 pop.2011 pop.increase
1     Toronto  Ontario  6202225  5928040   0.04625222
2    Montreal   Quebec  4291732  4104074   0.04572481
3   Vancouver       BC  2642825  2463431   0.07282282
4      Ottawa  Ontario  1488307  1371576   0.08510721
5     Calgary  Alberta  1481806  1392609   0.06405028
6    Edmonton  Alberta  1418118  1321441   0.07316028
7 Quebec City   Quebec   839311   806406   0.04080451
8    Winnipeg Manitoba   834678   783099   0.06586524
9    Hamilton  Ontario   785184   747545   0.05035015
cities |> summarise(median.2011=median(pop.2011), variance.2011=var(pop.2011))
  median.2011 variance.2011
1     1371576  3.209519e+12
cities |> summarise(mean.2016=mean(pop.2016), sum.2016=sum(pop.2016), n=n())
  mean.2016 sum.2016 n
1   2220465 19984186 9
cities |> arrange(pop.2016)
         city     prov pop.2016 pop.2011
1    Hamilton  Ontario   785184   747545
2    Winnipeg Manitoba   834678   783099
3 Quebec City   Quebec   839311   806406
4    Edmonton  Alberta  1418118  1321441
5     Calgary  Alberta  1481806  1392609
6      Ottawa  Ontario  1488307  1371576
7   Vancouver       BC  2642825  2463431
8    Montreal   Quebec  4291732  4104074
9     Toronto  Ontario  6202225  5928040
cities |> arrange(desc(pop.2011))
         city     prov pop.2016 pop.2011
1     Toronto  Ontario  6202225  5928040
2    Montreal   Quebec  4291732  4104074
3   Vancouver       BC  2642825  2463431
4     Calgary  Alberta  1481806  1392609
5      Ottawa  Ontario  1488307  1371576
6    Edmonton  Alberta  1418118  1321441
7 Quebec City   Quebec   839311   806406
8    Winnipeg Manitoba   834678   783099
9    Hamilton  Ontario   785184   747545
cities |> arrange(prov,desc(pop.2016))
         city     prov pop.2016 pop.2011
1     Calgary  Alberta  1481806  1392609
2    Edmonton  Alberta  1418118  1321441
3   Vancouver       BC  2642825  2463431
4    Winnipeg Manitoba   834678   783099
5     Toronto  Ontario  6202225  5928040
6      Ottawa  Ontario  1488307  1371576
7    Hamilton  Ontario   785184   747545
8    Montreal   Quebec  4291732  4104074
9 Quebec City   Quebec   839311   806406
cities |> group_by(prov) |> 
  summarise(mean.2016 = mean(pop.2016))
# A tibble: 5 × 2
  prov     mean.2016
  <chr>        <dbl>
1 Alberta   1449962 
2 BC        2642825 
3 Manitoba   834678 
4 Ontario   2825239.
5 Quebec    2565522.
cities |> mutate(pop.increase = pop.2016/pop.2011-1) |> 
  select(city, pop.increase) |> 
  arrange(desc(pop.increase))
         city pop.increase
1      Ottawa   0.08510721
2    Edmonton   0.07316028
3   Vancouver   0.07282282
4    Winnipeg   0.06586524
5     Calgary   0.06405028
6    Hamilton   0.05035015
7     Toronto   0.04625222
8    Montreal   0.04572481
9 Quebec City   0.04080451

dplyr also comes with “database” functionality (bind_cols(), bind_rows(), union(), intersect(), setdiff(), left_join(), inner_join(), semi_join(), anti_join(), etc.).

Do not hesitate to bookmark, consult, and borrow from the excellent [1] (and from the subsequent modules) for more examples, and to practice, practice, practice: we learn programming by programming.

References

[1]
H. Wickham and G. Grolemund, R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.