Subsetting and Aggregations with `dplyr`

Author

Silvie Cinková

Published

August 1, 2025

1 `dplyr`: Operations on data frames

on rows
on columns

Select rows, select columns. According to which properties?

Rows:

Position: first/last n rows, nth to n+20th row,
Condition(s) on values in one or more variable columns (e.g. the row with the maximum population
Sampling: random sample, fraction
Deduplication
Sorting according to values columns
Grouping: divide rows into groups according to values of one or more discrete variables (this is not visible anywhere but you need to do it to get group-wise aggregations. For instance, you have a data frame of some physical measures of individual males and females and you want the mean height of males separately from the mean height of females separately.

Columns:

Column position: you want to address the column by its position rather than name
Column name: you want to address the column by its name
Column selection:
- you want to select a column for an operation for dropping it from the data frame
- you want to transform a column from one data type to another

2 Libraries

library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(gapminder, warn.conflicts = FALSE, quietly = TRUE)
library(magrittr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE) #for diamonds dataset

3 `magrittr` pipe in `dplyr`: `%>%`

René Magritte 1898-1967: Belgian surrealist fascinated by semiotics

$https://commons.wikimedia.org/wiki/File:Duisburg,_Landschaftspark_Duisburg-Nord_--_2020_--_7824-6.jpg$

… but the concept behind is rather the plumbing pipeline

dplyr was designed for “rapid prototyping” in data manipulation. Often you just want to quickly try out a manipulation without saving every partial step. This is why most dplyr code is written in pipes. %>% is the most common pipe operator and dplyr adopted it from the magrittr library (which introduced various pipe operators to R).
R library of workflow pipes named to his tribute.

4 How to use a pipe `%>%`

left is input to the right
. is its placeholder

c("h", "e", "l", "l", "o") %>% 
  toupper(x = .)

[1] "H" "E" "L" "L" "O"

NEVER repeat the input in the argument of the following function. This would just throw an error. Instead, write the . placeholder. If the input is the first argument of the following function, you can leave it out like this:

c("h", "e", "l", "l", "o") %>% toupper()

This is a very common notation - the placeholder is only used when necessary.

5 A series of pipes

paste0(c("H", "E", "L", "L", "O"),  collapse = "-" )

[1] "H-E-L-L-O"

c("h", "e", "l", "l", "o") %>% toupper(x = .) %>% 
  paste0(., sep = "", collapse = "-") %>%
  paste0("Oh, ", ., "!")

[1] "Oh, H-E-L-L-O!"

6 `gapminder` (R dataset)

glimpse(gapminder)

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

We will demonstrate dplyr on the gapminder dataset from the gapminder library. It is a cleaned dataset from gapminder.org. It presents the population size, life expectancy and GDP per capita in 142 countries each five years from 1952 to 2007 (that is 12 records per country). The countries are also associated with five continents: Asia, Europe, Africa, Americas, and Oceania.

7 Subset rows with `dplyr`

filter() pick rows according to a condition
slice() pick rows according to their position
distinct() de-duplicate (pick the first/last occurring)
arrange() sort rows according to values of a variable
group_by() tell dplyr that you want the next operation to run separately on each group of rows with the same value of a categorical variable.

The dplyr library is often explained as pivoted by five “big verbs”. These are the functions filter, arrange, mutate, select, and summarize. We will proceed from rows to columns and list the other most common functions along.

8 `dplyr::filter`

filter(.data = gapminder, year == 2007, country %in% c("Angola", "Bahrain"))

# A tibble: 2 × 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Angola  Africa     2007    42.7 12420476     4797.
2 Bahrain Asia       2007    75.6   708573    29796.

filter(.data = gapminder, is.na(year) | is.na(pop))

# A tibble: 0 × 6
# ℹ 6 variables: country <fct>, continent <fct>, year <int>, lifeExp <dbl>,
#   pop <int>, gdpPercap <dbl>

This function wants the data set and conditions to test on its rows. Note the is.na function in the second example. The opposite would be !is.na. Both test rows on empty values (NA) in the variables. This example would return all rows that have an empty value either in year or in pop.

9 `dplyr::slice`

slice(.data = gapminder, c(2:3, seq(from = 1600, to = nrow(gapminder), by = 50)))

# A tibble: 5 × 6
  country        continent  year lifeExp      pop gdpPercap
  <fct>          <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan    Asia       1957    30.3  9240934      821.
2 Afghanistan    Asia       1962    32.0 10267083      853.
3 United Kingdom Europe     1967    71.4 54959000    14143.
4 Vietnam        Asia       1977    55.8 50533506      714.
5 Zimbabwe       Africa     1987    62.4  9216418      706.

In fact, this is a group of similarly named functions that all give you some rows from a data frame. Some are based on the position of the rows in the data frame, but some are based on values in a column.

10 `slice` with helper function `n()`

n() returns the number of rows.
dplyr helpers work only inside other functions
- No arguments! n(gapminder) will not work

slice(gapminder, 1, 2, n(), n()-1 ) #n: last, 2nd last row

# A tibble: 4 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Zimbabwe    Africa     2007    43.5 12311143      470.
4 Zimbabwe    Africa     2002    40.0 11926563      672.

11 `slice_head`, `slice_tail`

slice_head(gapminder, n = 2)

# A tibble: 2 × 6
  country     continent  year lifeExp     pop gdpPercap
  <fct>       <fct>     <int>   <dbl>   <int>     <dbl>
1 Afghanistan Asia       1952    28.8 8425333      779.
2 Afghanistan Asia       1957    30.3 9240934      821.

dplyr::slice_tail(gapminder, n = 2)

# A tibble: 2 × 6
  country  continent  year lifeExp      pop gdpPercap
  <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
1 Zimbabwe Africa     2002    40.0 11926563      672.
2 Zimbabwe Africa     2007    43.5 12311143      470.

first n rows, last n rows

12 slice_sample

slice_sample(gapminder, n = 3)

# A tibble: 3 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Mongolia    Asia       1962    48.3  1010280     1056.
2 Korea, Rep. Asia       1957    52.7 22611552     1488.
3 Burundi     Africa     1982    47.5  4580410      560.

slice_sample(gapminder, n = 3)

# A tibble: 3 × 6
  country          continent  year lifeExp      pop gdpPercap
  <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
1 Iraq             Asia       1957    48.4  6248643     6229.
2 El Salvador      Americas   1967    55.9  3232927     4359.
3 Korea, Dem. Rep. Asia       1972    64.0 14781241     3702.

random sample of rows

13 `slice_max`, `slice_min`

slice_max(gapminder, order_by = lifeExp, n = 2, with_ties = TRUE)

# A tibble: 2 × 6
  country          continent  year lifeExp       pop gdpPercap
  <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
1 Japan            Asia       2007    82.6 127467972    31656.
2 Hong Kong, China Asia       2007    82.2   6980412    39725.

slice_min(gapminder, order_by = lifeExp, n = 2, with_ties = TRUE)

# A tibble: 2 × 6
  country     continent  year lifeExp     pop gdpPercap
  <fct>       <fct>     <int>   <dbl>   <int>     <dbl>
1 Rwanda      Africa     1992    23.6 7290203      737.
2 Afghanistan Asia       1952    28.8 8425333      779.

rows with the maximum/minimum values in the given variable

14 `arrange`

arrange(gapminder, year) %>% slice(., 1, 2, n(), n()-1 )

# A tibble: 4 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Albania     Europe     1952    55.2  1282697     1601.
3 Zimbabwe    Africa     2007    43.5 12311143      470.
4 Zambia      Africa     2007    42.4 11746035     1271.

gapminder %>% arrange(.data = ., year) %>% slice(., 1, 2, n(), n()-1 )

# A tibble: 4 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Albania     Europe     1952    55.2  1282697     1601.
3 Zimbabwe    Africa     2007    43.5 12311143      470.
4 Zambia      Africa     2007    42.4 11746035     1271.

15 `distinct`

gapminder %>% distinct(continent)

# A tibble: 5 × 1
  continent
  <fct>    
1 Asia     
2 Europe   
3 Africa   
4 Americas 
5 Oceania

Deduplicates rows according to values in columns that you select.

You often want .keep_all = TRUE!

It has default .keep_all = FALSE, which throws away all columns that were not mentioned in the deduplication!

16 Add rows to data frame

Row is never a vector!

ad2002 <- gapminder %>% filter(country %in% c("Albania", "Denmark"), year == 2002)
ad2007 <- gapminder %>% filter(country %in% c("Albania", "Denmark"), year == 2007)
bind_rows(ad2002, ad2007)

# A tibble: 4 × 6
  country continent  year lifeExp     pop gdpPercap
  <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
1 Albania Europe     2002    75.7 3508512     4604.
2 Denmark Europe     2002    77.2 5374693    32167.
3 Albania Europe     2007    76.4 3600523     5937.
4 Denmark Europe     2007    78.3 5468120    35278.

17 Subset columns: `select`

gapminder %>% select(c(country, pop, 6)) %>% colnames()

[1] "country"   "pop"       "gdpPercap"

gapminder %>% select(!c(continent, lifeExp, gdpPercap)) %>% colnames()

[1] "country" "year"    "pop"

gapminder %>% select(TIME = year, country, population = pop) %>% colnames()

[1] "TIME"       "country"    "population"

select can do almost everything with columns: select by name or position positively and negatively, as well as rename columns on the fly. Peruse its Help and https://dplyr.tidyverse.org/reference/select.html for more detail; it is a complex function.

18 `rename` and `relocate`

keep all columns even if you do not mention them

gapminder %>% rename(COUNTRY = country) %>% colnames()

[1] "COUNTRY"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

gapminder %>% relocate(gdpPercap, pop) %>% colnames()

[1] "gdpPercap" "pop"       "country"   "continent" "year"      "lifeExp"

19 Helpers select columns by names

contains, matches, starts_with, ends_with

gapminder %>% select(contains("co")) %>% colnames()

[1] "country"   "continent"

gapminder %>% select(matches("ou?n")) %>% colnames()

[1] "country"   "continent"

These are functions that only work with “big” dplyr functions like select, rename, relocate, and mutate. For more detail see https://tidyselect.r-lib.org/reference/language.html. matches even understands so-called regular expressions ($\approx$ more powerful wildcards).

20 Helpers select columns by position

everything, last_col

gapminder %>% select(c(pop, last_col(), everything())) %>% colnames()

[1] "pop"       "gdpPercap" "country"   "continent" "year"      "lifeExp"

21 Helpers select columns by classes or values

where

gapminder %>% select(c(where(~ !is.numeric(.x)))) %>% colnames()

[1] "country"   "continent"

gapminder %>% slice_head(n = 2) %>% 
  select(where(~ is.numeric(.x))) %>%
  select(where(~ max(.x) > 100))

# A tibble: 2 × 3
   year     pop gdpPercap
  <int>   <int>     <dbl>
1  1952 8425333      779.
2  1957 9240934      821.

22 Function notation

on column values - typically you check vector class
Formula notation: ~ somefunction(.x, …)

gapminder %>% slice_head(n = 2) %>% 
  select(where(is.numeric))

# A tibble: 2 × 4
   year lifeExp     pop gdpPercap
  <int>   <dbl>   <int>     <dbl>
1  1952    28.8 8425333      779.
2  1957    30.3 9240934      821.

# short for this formula notation
gapminder %>% slice_head(n = 2) %>% 
  select(where(~ is.numeric(.x)))

# A tibble: 2 × 4
   year lifeExp     pop gdpPercap
  <int>   <dbl>   <int>     <dbl>
1  1952    28.8 8425333      779.
2  1957    30.3 9240934      821.

rather just for fun: feed it almost any function that works for all columns (no error)

gapminder %>% slice_head(n = 2) %>% 
  select(where(~ is.numeric(.x))) %>%
  select(where(~ max(.x) > 100))

# A tibble: 2 × 3
   year     pop gdpPercap
  <int>   <int>     <dbl>
1  1952 8425333      779.
2  1957 9240934      821.

Learn this notation as an idiom: starts with ~, its first argument is .x (mind the dot!).

You have already met something similar ggplot2, in two different contexts:

In geom_smooth: dependent variable ~ independent variable.
In facet_wrap and facet_grid: facets in rows ~, ~ facets in columns, facets in rows ~ facets in columns.

23 Add new columns with `bind_cols`

add vectors or data frames

new_df <- select(gapminder, c(new_pop = pop, new_lifeExp = lifeExp))
new_vec <- gapminder$country # base R column subset
gapminder %>% bind_cols(new_df, new_country = new_vec) %>% glimpse()

Rows: 1,704
Columns: 9
$ country     <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"…
$ continent   <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia…
$ year        <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997…
$ lifeExp     <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40…
$ pop         <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, …
$ gdpPercap   <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134…
$ new_pop     <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, …
$ new_lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40…
$ new_country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"…

This is a mock use case, obviously. You would normally not add a copy of an already existing column. A real use could be that you compute some new values from the existing data frame in a very complicated way, or in an external tool - simply that something would prevent you from computing the new column directly in the data frame with mutate (see later). You must be sure that the column to add is equally long as the columns in your existing data frame and that the values refer to the same year and country, in this particular case (there is no programmatic way to find out). With bind_cols you are not joining tables in the SQL sense. We will learn that in a separate session.

24 `pull` : extract column to create a vector

pull extracts one column as vector/factor

gapminder %>% select(continent) %>% 
  pull() %>% str()

 Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...

gapminder %>% distinct(continent) %>% pull() %>% levels()

[1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

You quite often want to use it when a categorical variable is a character vector and you want to quickly list its categories.

When it is a factor, you ask about its levels.

And here we have finished selecting and filtering.

25 Aggregation with `summarize`

categorical variables: count
numeric variables: mean, median, etc.

gapminder %>% filter(year == 2007, 
                     continent == "Europe") %>%
  summarize(mean_pop = mean(pop),  
            mean_lifeExp = round(mean(lifeExp)))

# A tibble: 1 × 2
   mean_pop mean_lifeExp
      <dbl>        <dbl>
1 19536618.           78

So far, we have worked with raw observations. Now we switch to aggregating values to summary statistics. Imagine all rows with European countries in the year 2007. This would be a table of 30 rows (because the dataset contains data from thirty European countries. In the table above, we aggregated their population size and life expectancy to mean to end up with a table of one row and two columns. When you summarize, you always lose all columns you have not used. It is an entirely separate table from the one with the raw observations: it does not carry the initial information from the raw observations, and there is no way you could reconstruct the original one.

26 grouped aggregations (`group_by`)

gap2000 <- gapminder %>% filter(year > 1997) %>% 
  group_by(continent, year) %>% 
  summarize(mean_pop = mean(pop), 
            mean_lifeExp = mean(lifeExp) )

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

gap2000

# A tibble: 10 × 4
# Groups:   continent [5]
   continent  year   mean_pop mean_lifeExp
   <fct>     <int>      <dbl>        <dbl>
 1 Africa     2002  16033152.         53.3
 2 Africa     2007  17875763.         54.8
 3 Americas   2002  33990910.         72.4
 4 Americas   2007  35954847.         73.6
 5 Asia       2002 109145521.         69.2
 6 Asia       2007 115513752.         70.7
 7 Europe     2002  19274129.         76.7
 8 Europe     2007  19536618.         77.6
 9 Oceania    2002  11727414.         79.7
10 Oceania    2007  12274974.         80.7

The most common aggregation is with a combination of numeric and categorical variables, where the categorical variables have more values. In the example in the previous slide, the data frame contained each country only once, so in fact summarize just dropped the irrelevant columns and computed both mean values on all rows together. But the original data frame contains data from twelve years and five continents, and the logical question to ask would be “What was the average population size and life expectancy for each continent in each year?” For this, we need a preparatory step where to tell dplyr which combinations of categories it should summarize, and this step uses a function called group_by. It takes the names of the variables by which it is supposed to break the statistics.

27 grouped data frame

groups(gap2000)

[[1]]
continent

gap2000 <- gap2000 %>% ungroup()
groups(gap2000)

list()

When you use group_by, the groups remain sticking with the data frame until you override them with a different grouping or ungroup(). This is particularly tricky when you work in a long pipe. You would better ungroup the data frame as soon as you are done with the operation for which you grouped it.

Sometimes you need to group with rowwise instead.

Later on, you will also get acquainted with another grouping function, rowwise. It makes each row a separate group. It would be hard to find a sensible use case with summarizing gapminder. The need for rowwise mostly arises when computing new columns with mutate (also later) with functions that would always input the entire column, one of them being sum.

1 dplyr: Operations on data frames

2 Libraries

3 magrittr pipe in dplyr: %>%

4 How to use a pipe %>%

5 A series of pipes

6 gapminder (R dataset)

7 Subset rows with dplyr

8 dplyr::filter

9 dplyr::slice

10 slice with helper function n()

11 slice_head, slice_tail

12 slice_sample

13 slice_max, slice_min

14 arrange

15 distinct

16 Add rows to data frame

17 Subset columns: select

18 rename and relocate

19 Helpers select columns by names

20 Helpers select columns by position

21 Helpers select columns by classes or values

22 Function notation

23 Add new columns with bind_cols

24 pull : extract column to create a vector

25 Aggregation with summarize

26 grouped aggregations (group_by)

27 grouped data frame

28

1 `dplyr`: Operations on data frames

3 `magrittr` pipe in `dplyr`: `%>%`

4 How to use a pipe `%>%`

6 `gapminder` (R dataset)

7 Subset rows with `dplyr`

8 `dplyr::filter`

9 `dplyr::slice`

10 `slice` with helper function `n()`

11 `slice_head`, `slice_tail`

13 `slice_max`, `slice_min`

14 `arrange`

15 `distinct`

17 Subset columns: `select`

18 `rename` and `relocate`

23 Add new columns with `bind_cols`

24 `pull` : extract column to create a vector

25 Aggregation with `summarize`

26 grouped aggregations (`group_by`)