Aesthetic scales and geometries with statistical transformations

Different plots for different stories

  • use ggplot2 cheatsheet (hardcopy or available in RStudio Help top left menu bar)

  • some demonstrations on built-in dataset diamonds

library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
colnames(diamonds)
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"      

Structure of the diamonds data frame

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Histogram

  • one quantitative variable

  • frequency distribution of value intervals

ggplot(data = diamonds) + 
  geom_histogram(mapping = aes(x = price), binwidth = 1000) + 
  theme(axis.text = element_text(size = 24), 
        axis.title = element_text(size = 24))

Density

  • like histogram but computes how likely you pay that exact sum in USD if you pick a random diamond.
ggplot(data = diamonds) + geom_density(mapping = aes(x = price))

(Empirical) Cumulative Density (Function)

  • shows you how likely you pay that and lower price for a random diamond
ggplot(data = diamonds) + geom_density(mapping = aes(x = price), stat = "ecdf")

Boxplot

ggplot(data = diamonds) + geom_boxplot(mapping = aes( y = price)) 

Boxplots compared

  • You can break them by a discrete variable (categorical or ordinal)

  • for instance map cut on the x-axis …

ggplot(data = diamonds) + geom_boxplot(mapping = aes(y = price,  x = cut)) 

Boxplots compared

  • or map cut on fill or color
ggplot(data = diamonds) + geom_boxplot(mapping = aes( y = price, fill = cut)) 

Bar plot with counts (default)

Each bar represents one variable category (value), height shows count (by default)

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut), stat = "count") 

Override default stat on Y in a bar plot

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = price), 
           stat = "summary", fun = mean) # note where to (not) write quotes 

Bar positions with another categorical variable: stack

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")

# default

Bar positions with another categorical variable: fill

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

Bar positions with another categorical variable: dodge

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Scatterplot

  • best to explore the association between two continuous variables
biggest_diamonds <- slice_max(diamonds, order_by = depth, n = 40)
ggplot(data = biggest_diamonds) + 
  geom_point(mapping = aes(x = depth, y = table) )

Alleviate overplotting with alpha and jitter

ggplot(biggest_diamonds) + 
  geom_point(aes(x = depth, 
                 y = table), 
             alpha = 0.3)

set.seed(2525) # reproducible random numbers
ggplot(biggest_diamonds) + 
  geom_point(aes(x = depth, 
                 y = table), position = "jitter")

alpha is one of the aesthetic scales, just like color, the X and Y axes, or shape. It controls the transparency. It takes values between 0 and 1. Alpha at 0.1 means that each point has only a 10%-visibility. Jittering is a technique that adds small random noise to each value. Without jittering, some points lie on a line. Jittering will scatter them a bit.

set.seed is a function you use when you compute something with random numbers but want them to be reproducible; that is, you want the same random numbers every time you run this script. The function wants a random number. Here it makes sure that the jittered points will always be positioned like you see in the figure. You can set a seed, run a function with random numbers, and if you do not like the result, re-run it with a different seed. Iterate until you are happy with the result.

Overplotting reduction: Scatterplot with hexagonal bins

ggplot(data = diamonds) + geom_hex(aes(x = depth, y = table),  binwidth = 2)

Overplotting reduction: regression models

set.seed(559900)
diamonds_sample <- sample_n(tbl = diamonds, size = 1000) 
ggplot(data = diamonds_sample) + 
  geom_smooth(aes(x = depth, y = table), method = "gam", se = TRUE )

Linear regression model

ggplot(data = diamonds_sample) + 
  geom_smooth(aes(x = depth, y = table), method = "lm", se = TRUE, formula = y~x )

Scatterplots with discrete variables

ggplot(data = diamonds, mapping = aes(x = clarity, y = color)) + 
  geom_count( )

Several geoms in one plot: smooth and scatterplot

ggplot(data = diamonds_sample, mapping = aes(x = depth, y = table)) + 
  geom_smooth( method = "lm", se = TRUE, formula = y~x ) +
  geom_point(alpha = 0.2, position = "jitter")

Combination with geom_text

  • geom_text is typically used like a scatterplot with labels instead of points.

  • or to label bars in a bar plot:

ggplot(diamonds, aes(x = cut)) +
  geom_bar() +
  geom_text(
    aes(label = after_stat(count)),
    stat = "count",
    vjust = -0.1,# sligthly above the bar 
    color = "seagreen"
  ) 

Facets (wrap)

  • subgraphs: yet another way to break by a categorical variable
ggplot(data = diamonds) + 
  geom_smooth(mapping = aes(x = carat, y = price, color = color),
              method = "lm", formula = y ~ x) + 
  facet_wrap(~ clarity, ncol = 4)

Facets (grid)

ggplot(data = diamonds) + geom_boxplot(aes(y = price, fill = cut)) +
  facet_grid(color ~ clarity)