Visualization exercise

1 Examine billionaires

library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
library(readr, warn.conflicts = FALSE, quietly = TRUE)

2 Your task for this session

Write a report about the presented dataset with at least five plots and formulate their messages in words (interpret each plot). It is not necessary that you use a lot of different geoms, but you should make sure that each plot answers a sensible question about the data.

The purpose of this exercise is to motivate you to think about data as a trove of arguments for your imagined cause or insights in your imagined research question. This frame does not need to be very sophisticated; it’s merely there to make you think and ask questions from the data and make meaningful and interpretable plots.

The data set lists global billionaires over a time span. You can be illustrating the consequences of a hypothetical global tax on individual continents, or demonstrate societal changes on billionaires throughout decades - whatever. Negative results are fine (i.e. your plot does not prove a trend that you expected), but please describe what you were expecting to see instead, in terms of the variables rather than the visual elements (i.e., “I expected more billionaires to be under thirty” rather than “I thought the curve would begin at a lower value”).

If you get really stuck, you can peek at the source of this file (the corresponding .qmd file) where you will see the code chunks that generated the example plots below.

3 Your task for the next session

In this session, you are going to generate plenty of ideas and implement their visualizations. Note all questions to which you could not imagine (or implement) adequate plots. You are very welcome to sketch by hand plots you fancied. Together we will try to generalize your inputs and find principled solutions.

4 Your file and data

Create a Quarto file named <YOURLOGIN>_E_03_01_01.qmd and save it in my_exercise_scripts.
Replace the YAML header with the file YAML_header_multiformat in your ATRIUM resources folder. Mind to render an html file when you are done. Please do not use #| echo: false in your code chunks (leave all your code visible when rendered).

Read file ~/R_BEGINNERS_SHORT/datasets_ATRIUM/billionaires_combined.tsv and save it in a variable billionaires_df

billionaires_df <- read_tsv("~/R_BEGINNERS_SHORT/datasets_ATRIUM/billionaires_combined.tsv")

Rows: 28986 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (17): person, name.x, state, headquarters, source, industry, gender, las...
dbl  (4): time, daily_income, age, birth_comb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore the dataset and keep noting what you are looking for and what you see. At least ten plots. Some suggestions with examples follow.
Do not worry about warning messages from ggplot. They often mean that some observations were dropped because they contained empty values in the variables you wanted to plot or that there were not enough observations to compute a statistics.

5 Billionaires counts over time, broken by sex

6 Daily income by age, sex and time

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 1912 rows containing non-finite outside the scale range
(`stat_smooth()`).

7 Daily income by age and sex, any time

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 1912 rows containing non-finite outside the scale range
(`stat_smooth()`).

It seems that female billionaires beyond ninety are much richer than male billionaires!

Let us look at a linear model to get the most salient overall trend.

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 1912 rows containing non-finite outside the scale range
(`stat_smooth()`).

The curvy smoothing displayed a drop in female billionaires around 80, but the overall trend is that generally, females surpass males around 75 (if we consider the confidence intervals in the ribbons).

8 How to make a subset of the dataset

Example: Create billionaires_2020 by filtering only year 2020.

billionaires_df <- read_tsv(file = "~/R_BEGINNERS_SHORT/datasets_ATRIUM/billionaires_combined.tsv", show_col_types = FALSE)

billionaires_2020 <- filter(billionaires_df, time == 2020)

9 Compare the count of males and female billionaires for each world region

Suggestion: barplot with dodged bars, faceted by years.

Tilt axis text to make it legible

When axis labels overlap, add this code and fiddle with angle, hjust (or vjust):

+ theme(axis.text.x = element_text(angle = 60, hjust = 1))

10 Sex, age, and residence in 2020 vs. in 2010

residence/continents \(\approx\) world_6region

billionaires_2010vs2020 <- 
  filter(billionaires_df, time %in% c("2020", "2010"))
nrow(billionaires_2010vs2020)

[1] 3375

colnames(billionaires_2010vs2020)

 [1] "person"             "time"               "daily_income"      
 [4] "name.x"             "state"              "headquarters"      
 [7] "source"             "industry"           "age"               
[10] "gender"             "last_name"          "sex"               
[13] "permanent_country"  "company"            "birth_comb"        
[16] "countries"          "name.y"             "income_groups"     
[19] "main_religion_2008" "world_6region"      "west_and_rest"

Warning in geom_density(mapping = aes(x = age, color = sex), binwidth = 1):
Ignoring unknown parameters: `binwidth`

Warning: Removed 318 rows containing non-finite outside the scale range
(`stat_density()`).

Warning: Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning: Removed 318 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Warning: Removed 318 rows containing non-finite outside the scale range
(`stat_ydensity()`).

Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

11 West and rest over time

Are Western billionaires richer than those from other regions? Is there any development?

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'