#setwd("~/folder/subfolder/")Reading and writing files and the concept of Working Directory
1 Working directory
- folder from which R sees other files and folders
Print path to your current Working Directory
getwd() # in the console/R file
[1] "/lnet/aic/personal/cinkova/NPFL112_2025_ZS"
2 Set a different Working Directory
~your Home..one folder up.current folder
3 RStudio Projects
.Rprojfile stores project configurationWhen you open this project next time, it tries to restore the work space from last time.
4 Working directory in Quarto code chunks
- working directory = location of the qmd script file
getwd()[1] "/lnet/aic/personal/cinkova/SCRIPTS.NPFL112"
Quarto does not care which folder you said was your working directory. The working directory for Quarto is always where the qmd file is.
5 How to render your copy of this file
store part of file path in a variable
combine it with file names by using
file.path()
project_path <- "~/NPFL112_2025_ZS" # Silvie's path.
# Replace with something that would work for you.
file.path(project_path, "DATA.NPFL112")[1] "~/NPFL112_2025_ZS/DATA.NPFL112"
datasaving_folder <- "DATA.NPFL112" # change this
# you must create your own folder you can write in!I store all materials to this course in a subfolder and in this script I start to read files from other folders. If you decide to reuse parts of my scripts from the SCRIPTS.NPFL112 folder and still want to be able to render them, you need all paths to work for you in your setup. The best option is that you copy the script right to your home folder and replace this project_path variable with just “~”.
6 List files from DATA.NPFL112
list.files(file.path(project_path, "DATA.NPFL112")) [1] "billionaires_combined.tsv"
[2] "DataGeographies-v2-by-Gapminder.xlsx"
[3] "ddf--gapminder--systema_globalis_files_from_GitHubAPI.json"
[4] "Founders_Network.csv"
[5] "gapminder_billionaires_ddf--entities--person.csv"
[6] "gapminder_countries.tsv"
[7] "gapminder_ddf--concepts.csv"
[8] "gapminder_ddf--datapoints--daily_income--by--person--time.csv"
[9] "gapminder_firstmarriage.csv"
[10] "gapminder_geonames.xlsx"
[11] "gapminder_hourly_labour_cost_constant_2017_ usd--by--geo--time.csv"
[12] "gapminder_hourly_labour_cost_constant_2017_usd--by--geo--time.csv"
[13] "gapminder_laborcost_cze_deu.csv"
[14] "gapminder_metadata_filenames.tsv"
[15] "GitHubURLs_Gapminder_SystemaGlobalis.tsv"
[16] "jrc_1.tsv"
[17] "jrc_2.tsv"
[18] "jrc_3.tsv"
[19] "jrc_4.tsv"
[20] "jrc_latin_4.tsv"
[21] "jrc_latin.tsv"
[22] "JRC_Names"
[23] "JRC_Names.tsv"
[24] "migrants.tsv"
[25] "titanic.csv"
7 Open a csv file from DATA.NPFL112
marriage <- read.csv(file = file.path(project_path, datasaving_folder, "gapminder_firstmarriage.csv"))
head(marriage) country X2005
1 Afghanistan 17.83968
2 Albania 23.32651
3 Algeria 29.60000
4 Angola NA
5 Argentina 23.26396
6 Armenia 22.98603
You have seen this last time in a bare-R file!
8 Individual files from (GitHub) URL
When you look for interesting data sets, you will find plenty in the Internet. To download a file from the Internet, you must use its URL. In ordinary websites you just copy it from the address/URL bar at the top or you right-click your mouse at a link and select “Copy link”.
This example is slightly more complex but particularly useful for you as data scientists. What you see is not an ordinary website, but the user interface of GitHub. GitHub is a public repository of files, like Google Drive, but optimized for programmers to facilitate collaboration on software. On GitHub, you must switch to a “Raw” view to obtain the correct URL. Otherwise you would download the html code of the entire interface website you are viewing, instead of the selected file!
Downloading individual files is not the typical use case for GitHub (cf. git and GitHub’s API), but that would be an entirely different topic.
9 Download a file (e.g. from GitHub)
library(glue) # enables multiline with \\
URL <- glue("https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/refs/heads/master/countries-etc-datapoints/ddf--datapoints--hourly_labour_cost_constant_2017_usd--by--geo--time.csv")
my_destination <- glue(file.path(project_path, datasaving_folder, "\\
gapminder_hourly_labour_cost_constant_2017_\\
usd--by--geo--time.csv"))download.file(
url = URL,
destfile = my_destination
)This is a code that downloads a file from its URL to a destination in your computer (or to your user account on a cloud). The important piece is the download.file function. Do not bother about the glue library. I had to use it to fit long strings such as URL on a slide.
The download.file function is universal to download any file from anywhere. Sometimes you can copy a download link from a website and use this URL to download the file programmatically.
This is how to download some data from GitHub, which is a bit specific. Here I work with data from Gapminder on Github. Their repository is very large and this was a largely random pick: https://github.com/open-numbers/ddf–gapminder–systema_globalis/tree/master/countries-etc-datapoints. This repository contains a table that explains each data set, but I am going to select one that is intelligible without reading much metadata. It is going to be a table about average labor cost in a given country in a given year: https://raw.githubusercontent.com/open-numbers/ddf–gapminder–systema_globalis/refs/heads/master/countries-etc-datapoints/ddf–datapoints–hourly_labour_cost_constant_2017_usd–by–geo–time.csv.
Manually navigate to the file you want and copy its URL. Mind to use the URL that appears when you hit the Raw button ( starting with https://raw.githubusercontent) to download the contents of the file. On the default https://github.com/…. you would only download a html file of the website you are seeing.
Use the download.file function. Leave all arguments at default, except url and destfile. Put the file into the the folder where you can store your own data. Use the end part of the original file name and give it a prefix gapminder_ and keep doing this with all files that you happen do download from this source. This will help you keep a system in your files.
10 https://www.gapminder.org/
11 Read files with base R or tidyverse
File too big to open in a text editor?
Inspect it reading it as text lines (first 3 lines)
mypath <- glue(file.path(project_path, datasaving_folder, "gapminder_hourly_labour_cost_constant_2017_\\
usd--by--geo--time.csv"))library(readr)
read_lines(
file = mypath,
n_max = 3)[1] "geo,time,hourly_labour_cost_constant_2017_usd"
[2] "arg,2011,0.92"
[3] "arg,2012,1.04"
readLines(
con = mypath,
n = 3)[1] "geo,time,hourly_labour_cost_constant_2017_usd"
[2] "arg,2011,0.92"
[3] "arg,2012,1.04"
What you are seeing are the first three lines of a tabular file we have just read as a text file, assuming no columns or headers. This comes handy when you do not know the structure of the file and it is too large to open interactively in a text editor, for instance. So you read lines of text. In plain text files, each line ends when the author hit Enter. (So in our terms it is rather a paragraph than a line!)
A tabular file is a plaintext file where each line is one table row and the columns are on each line separated by the same character (throughout the file). The best-known tabular format is comma-separated values (csv). The original U.S. format uses comma. The European csv uses semicolons because comma is often reserved for the decimal operator (vs. decimal point in the U.S.). To skip these issues altogether, you better save your files as tsv (tab-separated values).
In the code above you see two functions that look similar and whose output looks exact the same. One is a base-R function, the other is from a tidyverse package called readr. Feel free to choose either and just make a mental note that there is an alternative. Sometimes, when a file is tricky to read in with one function, it goes well with the other.
Look at the Help to either function and explore its other arguments using the file you have just loaded.
12 Reading a table with readr
read_csv,read_csv2,read_tsv: tailored to the common separators,,;, tabread_delim: you name the separator (aka delimiter), more arguments
read_csv(file = mypath,
n_max = 3) #just top 3 rowsRows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): geo
dbl (2): time, hourly_labour_cost_constant_2017_usd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 3
geo time hourly_labour_cost_constant_2017_usd
<chr> <dbl> <dbl>
1 arg 2011 0.92
2 arg 2012 1.04
3 arm 2011 4.23
13 Other arguments in read_csv
read_csv(file = mypath,
col_names = c("country", "year", "USD_hour_2017"),
n_max = 3)Rows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, year, USD_hour_2017
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 3
country year USD_hour_2017
<chr> <chr> <chr>
1 geo time hourly_labour_cost_constant_2017_usd
2 arg 2011 0.92
3 arg 2012 1.04
14 Read directly from URL
URL2 <- glue("https://raw.githubusercontent.com/open-numbers/ddf--gapminder--\\
systema_globalis/refs/heads/master/countries-etc-datapoints/\\
ddf--datapoints--hourly_labour_cost_constant_2017_usd--by--geo--\\
time.csv")
read_csv(file = URL2,
n_max = 3)Rows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): geo
dbl (2): time, hourly_labour_cost_constant_2017_usd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 3
geo time hourly_labour_cost_constant_2017_usd
<chr> <dbl> <dbl>
1 arg 2011 0.92
2 arg 2012 1.04
3 arm 2011 4.23
15 Download an Excel file
URL3 <- glue("https://docs.google.com/spreadsheets/d/1qHalit8s\\
XC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/export?format=xlsx")
download.file(url = URL3,
destfile =
file.path(project_path, datasaving_folder, "gapminder_geonames.xlsx"),
mode = "wb") # mind the modeWith Windows formats and on Windows-operated computers, set mode to wb. Otherwise the file may get corrupted during the transmission.
16 Read Excel
readxlreads only local file paths, not URLs.
library(readxl)
read_xlsx(path = file.path(project_path, datasaving_folder, "gapminder_geonames.xlsx"),
n_max = 3) # just three rowsNew names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...5`
# A tibble: 3 × 7
Data: Geographies — v…¹ ...2 ...3 Free data from www.g…² ...5 id version
<chr> <chr> <lgl> <chr> <lgl> <chr> <chr>
1 Updated: July 1, 2021 <NA> NA CC BY 4.0 LICENCE NA geo v2
2 Concept: Geog… NA Are you seeing this o… NA <NA> <NA>
3 Unit: <NA> NA gapm.io/datageo NA <NA> <NA>
# ℹ abbreviated names: ¹`Data: Geographies — v2`,
# ²`Free data from www.gapminder.org`
# readxl::read_xlsx(path = "datasets_ATRIUM/DataGeographies-v2-by-Gapminder.xlsx") #the same file17 Excel sheets listed
read_xlsxreads the first sheet by default- Have the spreadsheets listed:
readxl::excel_sheets(path = file.path(project_path, datasaving_folder, "gapminder_geonames.xlsx"))[1] "ABOUT" "list-of-countries-etc" "list-of-regions"
[4] "list-of-income-levels" "global" "geo-names"
readxl::read_xlsx(path = file.path(project_path, datasaving_folder, "gapminder_geonames.xlsx"), sheet = 2,
n_max = 3) # or sheet = "list-of-countries-etc"# A tibble: 3 × 13
geo name four_regions eight_regions six_regions members_oecd_g77 Latitude
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 aus Austra… asia east_asia_pa… east_asia_… oecd -25
2 brn Brunei asia east_asia_pa… east_asia_… g77 4.5
3 khm Cambod… asia east_asia_pa… east_asia_… g77 13
# ℹ 6 more variables: Longitude <dbl>, `UN member since` <dttm>,
# `World bank region` <chr>, `World bank, 4 income groups 2017` <chr>,
# `World bank, 3 income groups 2017` <chr>, UNHCR <chr>
18 Google sheets
- inspect it manually and pick one worksheet
library(googlesheets4)
shURL <- glue("https://docs.google.com/spreadsheets/d/1qHalit8sXC\\
0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=425865495#gid=425865495")
gs4_deauth() # skip logging in at GoogleDrive
googlesheets4::read_sheet(shURL, sheet = 2,
n_max = 3)✔ Reading from "Data Geographies - v2 - by Gapminder".
✔ Range ''list-of-countries-etc''.
# A tibble: 3 × 13
geo name four_regions eight_regions six_regions members_oecd_g77 Latitude
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 aus Austra… asia east_asia_pa… east_asia_… oecd -25
2 brn Brunei asia east_asia_pa… east_asia_… g77 4.5
3 khm Cambod… asia east_asia_pa… east_asia_… g77 13
# ℹ 6 more variables: Longitude <dbl>, `UN member since` <dttm>,
# `World bank region` <chr>, `World bank, 4 income groups 2017` <chr>,
# `World bank, 3 income groups 2017` <chr>, UNHCR <chr>
19 Saving tabular files
gapminder_countries <- readxl::read_xlsx(file.path(project_path,
datasaving_folder, "gapminder_geonames.xlsx"),
sheet = 2,
n_max = 3)
readr::write_tsv(x = gapminder_countries,
file = file.path(project_path,
datasaving_folder, "gapminder_countries.tsv"))20 Create a folder
- create a file to save your exercise scripts
dir.create(path = "HOMEWORK",
mode = '750', recursive = TRUE )mode = octal notation (access rights to file, just Unix)With mode = '750' you allow other students and teachers to see and execute files in this folder.
21 List files into a vector
list.files(path = project_path, recursive = FALSE,
include.dirs = FALSE, pattern = "qmd", full.names = TRUE)character(0)
list files in a folder
just those with
qmdin their namesrecursive: search in subfolders?
22 Set up a file path
- creates a character vector, not a file!
file.path(project_path, "a_new_file.XXX")[1] "~/NPFL112_2025_ZS/a_new_file.XXX"