Exploring a data frame with dplyr and ggplot2.

First steps.

Silvie Cinková

2025-07-28

readr, dplyr, and ggplot2

library(dplyr)
library(readr)
library(ggplot2)
library(glue) # just to make long strings wrap in PDF

Path management

project_path <- "~/NPFL112_2025_ZS/"
datasaving_folder <- "DATA.NPFL112"
output_folder <- "OUTPUT_FILES"

Read the Gapminder labor cost data set

project_path <- "~/NPFL112_2025_ZS/"
datasaving_folder <- "DATA.NPFL112"
myfilepath <- file.path(project_path, "DATA.NPFL112/gapminder_hourly_labour_cost_constant_2017_usd--by--geo--time.csv" )
laborcost_df <- read_csv(file = myfilepath, 
                         show_col_types = TRUE)

dplyr::glimpse

  • peek at the dataset (tilted 90°)
glimpse(laborcost_df)
Rows: 548
Columns: 3
$ geo                                  <chr> "arg", "arg", "arm", "arm", "arm"…
$ time                                 <dbl> 2011, 2012, 2011, 2012, 2013, 201…
$ hourly_labour_cost_constant_2017_usd <dbl> 0.92, 1.04, 4.23, 4.59, 6.12, 6.0…

summary

summary(laborcost_df)
     geo                 time      hourly_labour_cost_constant_2017_usd
 Length:548         Min.   :1994   Min.   : 0.000                      
 Class :character   1st Qu.:2005   1st Qu.: 9.867                      
 Mode  :character   Median :2011   Median :18.320                      
                    Mean   :2010   Mean   :19.686                      
                    3rd Qu.:2017   3rd Qu.:26.915                      
                    Max.   :2020   Max.   :48.720                      

summary with categorical columns as factors

      geo           time      hourly_labour_cost_constant_2017_usd
 cze    : 22   Min.   :1994   Min.   : 0.000                      
 svn    : 22   1st Qu.:2005   1st Qu.: 9.867                      
 cyp    : 21   Median :2011   Median :18.320                      
 deu    : 21   Mean   :2010   Mean   :19.686                      
 pol    : 21   3rd Qu.:2017   3rd Qu.:26.915                      
 svk    : 21   Max.   :2020   Max.   :48.720                      
 (Other):420                                                      

Rename a column with base R

hourly_labour_cost_constant_2017_usd too long, shorten to labor_cost.

colnames(laborcost_df)[colnames(laborcost_df) ==
                         "hourly_labour_cost_constant_2017_usd"] <- "labor_cost"
colnames(laborcost_df)
[1] "geo"        "time"       "labor_cost"

Rename a column with dplyr

laborcost_df <- rename(.data = laborcost_df,
                       labor_cost = hourly_labour_cost_constant_2017_usd
                       )
colnames(laborcost_df)
[1] "geo"        "time"       "labor_cost"

Filter rows with dplyr

cze_deu_df <- dplyr::filter(laborcost_df, geo %in% c("cze", "deu"))
write_csv(cze_deu_df, file.path(project_path, datasaving_folder, "gapminder_laborcost_cze_deu.csv"))

List distinct values with dplyr::distinct

dplyr::distinct(.data = laborcost_df, geo, .keep_all = FALSE)
geo
arg
arm
aus
aut
aze
bel
bgr
can
che
chl
cri
cyp
cze
deu
dnk
esp
est
fin
fra
gbr
geo
grc
hrv
hun
irl
isl
isr
ita
kaz
ltu
lux
lva
mda
mkd
mlt
mus
nld
nor
nzl
phl
pol
prt
rou
rus
svk
svn
swe
ukr

Plot the data set

  • obviously not a very helpful plot, but anyway…
laborcost_plot <- ggplot(data = laborcost_df, 
       mapping = aes(x = time, 
                     y = labor_cost,
                     color = geo)) + 
  geom_point()
laborcost_plot

Comment on the plot

How do you call plots with points and two axes?

How many variables does the plot capture and how? Which are the types of variables?

Look at the script. Try to dissect it in parts and interpret them.

Different mapping in the same plot

laborcost_plot_size <- ggplot(data = laborcost_df,
                              mapping = aes(x = geo,
                                            y = time, 
                                            alpha = labor_cost)) + 
  geom_point()
laborcost_plot_size

Comment on this plot as well

How does it capture the variables now? Is it telling a different story?

Save a plot

ggsave(filename = file.path(project_path, output_folder, "laborcost_plot.svg"), 
       plot = laborcost_plot)
ggsave(filename = file.path(project_path, output_folder, "laborcost_plot.png"), 
       plot = laborcost_plot,
       device = grDevices::png) 
# device = "png" or this when RStudio hiccups
ggsave(filename = file.path(project_path, output_folder, "laborcost_plot.pdf"),
       plot = laborcost_plot)
list.files(path = ,file.path(project_path, output_folder), pattern = "laborcost_plot")
[1] "laborcost_plot.pdf" "laborcost_plot.png" "laborcost_plot.svg"

First insights about ggplot2

  • Specific syntax

  • A plot is an object (goes in a variable)

  • Can be saved to files - include desired format in the file name

  • Maps variables on X, Y, color, transparency…

ggplot2 \(\approx\) implemented Grammar of Graphics

https://ggplot2.tidyverse.org/

  • All plots have the same logic and components in a few layers.

  • Not just drawings, statistical transformations behind the scenes (e.g. histogram)

  • When you see a ggplot2 plot you have an idea how the source table is structured

Layers of ggplot2

  • Data

  • Aesthetic mappings + Facets (subgraphs)

  • Geometric objects (aka geoms)

  • Statistical transformations (aka stats)

  • Coordinate system

  • Theme

Data

  • data frame with tidy data structure

    • each observation on one row

    • each variable in one column

  • categorical variables automatically read as factors

Aesthetic scales (aka mappings, aesthetics)

  • axes X, Y

  • shape / linetype

  • color / fill / stroke

  • size / linewidth

  • alpha (transparency)

  • label

Geometric objects aka geoms

plot types, such as:

  • histogram

  • scatterplot

  • barplot

  • boxplot

  • heatmap

  • and many others

A neater example dataset: just Czechia and Germany

  • only two values of a categorical variable, under 50 rows
cze_deu_df <- read_csv(file.path(project_path, datasaving_folder,"gapminder_laborcost_cze_deu.csv"))
ggplot(data = cze_deu_df, 
       mapping = aes(x = time, y = labor_cost, color = geo)) + 
  geom_point(size = 7)

Syntax

ggplot(data, mappings) + geom_…( )

or

ggplot(data) + geom_…(mappings)