#|name: getwd
getwd()
[1] "/lnet/aic/personal/cinkova/R_BEGINNERS_SHORT"
Print path to your current Working Directory
#|name: getwd
getwd()
[1] "/lnet/aic/personal/cinkova/R_BEGINNERS_SHORT"
Set a different Working Directory
setwd("~/folder/subfolder/") # ~ means your home
Make sure that your Working directory is your home.
Create a new folder in your home. Call it R_BEGINNERS_SHORT
.
Enter that folder. Make it your Working Directory. (Gear icon \(\rightarrow\) Set As Working Directory
).
Create new folders datasets_ATRIUM
and my_output_files
.
In the File
tab, select New project
\(\rightarrow\) In an existing directory
, and pick R_BEGINNERS_SHORT
.
If you execute this procedure, you will not need to adapt file paths in the teaching materials to your user account, except perhaps the user account name.
.Rproj
file stores project configuration
When you open this project next time, it tries to restore the workspace from last time.
library(glue) # enables multiline with \\
<- glue("https://raw.githubusercontent.com/open-numbers/ddf--gapminder--\\
URL systema_globalis/refs/heads/master/countries-etc-datapoints/ddf--\\
datapoints--hourly_labour_cost_constant_2017_usd--by--geo--time.csv")
<- glue("datasets_ATRIUM/\\
my_destination gapminder_hourly_labour_cost_constant_2017_\\
usd--by--geo--time.csv")
download.file(
url = URL,
destfile = my_destination
)
The download.file
function is universal to download any file from anywhere. Sometimes you can copy a download link from a website and use this URL to download the file programmatically.
This is how to download some data from GitHub, which is a bit specific. Here I work with data from Gapminder on Github. Their repository is very large and this was a largely random pick: https://github.com/open-numbers/ddf–gapminder–systema_globalis/tree/master/countries-etc-datapoints
. This repository contains a table that explains each data set, but I am going to select one that is intelligible without reading much metadata. It is going to be a table about average labor cost in a given country in a given year: https://raw.githubusercontent.com/open-numbers/ddf–gapminder–systema_globalis/refs/heads/master/countries-etc-datapoints/ddf–datapoints–hourly_labour_cost_constant_2017_usd–by–geo–time.csv.
Manually navigate to the file you want and copy its URL. Mind to use the URL that appears when you hit the Raw
button ( starting with https://raw.githubusercontent
) to download the contents of the file. On the default https://github.com/….
you would only download a html file of the website you are seeing.
Use the download.file
function. Leave all arguments at default, except url
and destfile
. Put the file into the new empty datasets_ATRIUM
folder. Use the end part of the original file name and give it a prefix gapminder_
and keep doing this with all files that you happen do download from this source. This will help you keep a system in your files.
.csv
/.tsv
fileplain text with column separators: ;
,
or tabulator
inspect the file reading it as text (first 3 lines)
<- glue("datasets_ATRIUM/gapminder_hourly_labour_cost_constant_2017_usd\\
mypath --by--geo--time.csv")
library(readr)
read_lines(
file = mypath,
n_max = 3)
[1] "geo,time,hourly_labour_cost_constant_2017_usd"
[2] "arg,2011,0.92"
[3] "arg,2012,1.04"
readLines(
con = mypath,
n = 3)
[1] "geo,time,hourly_labour_cost_constant_2017_usd"
[2] "arg,2011,0.92"
[3] "arg,2012,1.04"
What you are seeing are the first three lines of a tabular file we have just read as a text file, assuming no columns or headers. This comes handy when a file is too large to open interactively in a text editor, for instance.
A tabular file is a plaintext file where each line is one table row and the columns are on each line separated by the same character (throughout the file). The best-known tabular format is comma-separated values (csv
). The original U.S. format uses comma. The European csv uses semicolons because comma is often reserved for the decimal operator (vs. decimal point in the U.S.). To skip these issues altogether, you better save your files as tsv
(tab-separated values).
In the code above you see two functions that look similar and whose output looks exact the same. One is a base-R function, the other is from a tidyverse
package called readr
. Feel free to choose either and just make a mental note that there is an alternative. Sometimes, when a file is tricky to read in with one function, it goes well with the other.
Look at the Help to either function and explore its other arguments using the file you have just loaded.
readr
read_csv
, read_csv2
, read_tsv
: tailored to the common separators ,
, ;
, tab
read_delim
: you name the separator (aka delimiter), more arguments
read_csv(file = mypath,
n_max = 3) #just top 3 rows
Rows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): geo
dbl (2): time, hourly_labour_cost_constant_2017_usd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 3
geo time hourly_labour_cost_constant_2017_usd
<chr> <dbl> <dbl>
1 arg 2011 0.92
2 arg 2012 1.04
3 arm 2011 4.23
read_csv
read_csv(file = mypath,
col_names = c("country", "year", "USD_hour_2017"),
n_max = 3)
Rows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, year, USD_hour_2017
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 3
country year USD_hour_2017
<chr> <chr> <chr>
1 geo time hourly_labour_cost_constant_2017_usd
2 arg 2011 0.92
3 arg 2012 1.04
<- glue("https://raw.githubusercontent.com/open-numbers/ddf--gapminder--\\
URL2 systema_globalis/refs/heads/master/countries-etc-datapoints/\\
ddf--datapoints--hourly_labour_cost_constant_2017_usd--by--geo--\\
time.csv")
read_csv(file = URL2,
n_max = 3)
Rows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): geo
dbl (2): time, hourly_labour_cost_constant_2017_usd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 3
geo time hourly_labour_cost_constant_2017_usd
<chr> <dbl> <dbl>
1 arg 2011 0.92
2 arg 2012 1.04
3 arm 2011 4.23
<- glue("https://docs.google.com/spreadsheets/d/1qHalit8s\\
URL3 XC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/export?format=xlsx")
download.file(url = URL3,
destfile =
"datasets_ATRIUM/gapminder_geonames.xlsx",
mode = "wb")
With Windows formats and on Windows-operated computers, set mode
to wb
. Otherwise the file may get corrupted during the transmission.
readxl
reads only local file paths, not URLs.library(readxl)
read_xlsx(path = "datasets_ATRIUM/gapminder_geonames.xlsx",
n_max = 3) # just three rows
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...5`
# A tibble: 3 × 7
Data: Geographies — v…¹ ...2 ...3 Free data from www.g…² ...5 id version
<chr> <chr> <lgl> <chr> <lgl> <chr> <chr>
1 Updated: July 1, 2021 <NA> NA CC BY 4.0 LICENCE NA geo v2
2 Concept: Geog… NA Are you seeing this o… NA <NA> <NA>
3 Unit: <NA> NA gapm.io/datageo NA <NA> <NA>
# ℹ abbreviated names: ¹`Data: Geographies — v2`,
# ²`Free data from www.gapminder.org`
# readxl::read_xlsx(path = "datasets_ATRIUM/DataGeographies-v2-by-Gapminder.xlsx") #the same file
read_xlsx
reads the first sheet by default::excel_sheets(path = "datasets_ATRIUM/gapminder_geonames.xlsx") readxl
[1] "ABOUT" "list-of-countries-etc" "list-of-regions"
[4] "list-of-income-levels" "global" "geo-names"
::read_xlsx(path = "datasets_ATRIUM/gapminder_geonames.xlsx", sheet = 2,
readxln_max = 3) # or sheet = "list-of-countries-etc"
# A tibble: 3 × 13
geo name four_regions eight_regions six_regions members_oecd_g77 Latitude
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 aus Austra… asia east_asia_pa… east_asia_… oecd -25
2 brn Brunei asia east_asia_pa… east_asia_… g77 4.5
3 khm Cambod… asia east_asia_pa… east_asia_… g77 13
# ℹ 6 more variables: Longitude <dbl>, `UN member since` <dttm>,
# `World bank region` <chr>, `World bank, 4 income groups 2017` <chr>,
# `World bank, 3 income groups 2017` <chr>, UNHCR <chr>
library(googlesheets4)
<- glue("https://docs.google.com/spreadsheets/d/1qHalit8sXC\\
shURL 0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=425865495#gid=425865495")
gs4_deauth() # skip logging in at GoogleDrive
::read_sheet(shURL, sheet = 2,
googlesheets4n_max = 3)
✔ Reading from "Data Geographies - v2 - by Gapminder".
✔ Range ''list-of-countries-etc''.
# A tibble: 3 × 13
geo name four_regions eight_regions six_regions members_oecd_g77 Latitude
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 aus Austra… asia east_asia_pa… east_asia_… oecd -25
2 brn Brunei asia east_asia_pa… east_asia_… g77 4.5
3 khm Cambod… asia east_asia_pa… east_asia_… g77 13
# ℹ 6 more variables: Longitude <dbl>, `UN member since` <dttm>,
# `World bank region` <chr>, `World bank, 4 income groups 2017` <chr>,
# `World bank, 3 income groups 2017` <chr>, UNHCR <chr>
<- readxl::read_xlsx("datasets_ATRIUM/gapminder_geonames.xlsx",
gapminder_countries sheet = 2,
n_max = 3)
::write_tsv(x = gapminder_countries,
readrfile = "my_output_files/gapminder_countries.tsv")
create a file to save your exercise scripts
dir.create(path = "~/R_BEGINNERS_SHORT/my_exercise_scripts/",
mode = '750', recursive = TRUE )
Warning in dir.create(path = "~/R_BEGINNERS_SHORT/my_exercise_scripts/", :
'/home/cinkova/R_BEGINNERS_SHORT/my_exercise_scripts' already exists
list files in a folder
just those with qmd
in their names
recursive
: search in subfolders?
list.files(path = "~/R_BEGINNERS_SHORT", recursive = FALSE, include.dirs = FALSE, pattern = "qmd", full.names = TRUE)
[1] "/home/cinkova/R_BEGINNERS_SHORT/01_Introduction.qmd"
[2] "/home/cinkova/R_BEGINNERS_SHORT/02_HowToRStudio.qmd"
[3] "/home/cinkova/R_BEGINNERS_SHORT/03_RStudioFileManagement.qmd"
[4] "/home/cinkova/R_BEGINNERS_SHORT/04_NavigatingRStudioForProgramming.qmd"
[5] "/home/cinkova/R_BEGINNERS_SHORT/05_VariablesFunctions.qmd"
[6] "/home/cinkova/R_BEGINNERS_SHORT/06_WorkingDirectory.qmd"
[7] "/home/cinkova/R_BEGINNERS_SHORT/07_Exploring_dataframes.qmd"
[8] "/home/cinkova/R_BEGINNERS_SHORT/08_DiversePlots.qmd"
[9] "/home/cinkova/R_BEGINNERS_SHORT/09_Aggregations_with_dplyr.qmd"
[10] "/home/cinkova/R_BEGINNERS_SHORT/10_ggplot2OtherLayers.qmd"
[11] "/home/cinkova/R_BEGINNERS_SHORT/11_Computations_mutate_with_dplyr.qmd"
[12] "/home/cinkova/R_BEGINNERS_SHORT/12_JoiningDplyr.qmd"
[13] "/home/cinkova/R_BEGINNERS_SHORT/index.qmd"
mode
= octal notation (access rights to file, just Unix)With mode = '750'
you allow other students and teachers to see and execute files in this folder.