Introduction
Inspired by a recent post on how to import a directory of csv files at once using purrr and readr by Garrick, in this post we will try achieving the same using base R with no extra packages, and with data·table, another very popular package and as an added bonus, we will play a bit with benchmarking to see which of the methods is the fastest, including the tidyverse approach in the benchmark.
Let us show how to import all csvs from a folder into a data frame, with nothing but base R
To get the source data, download the zip file from this link and unzip it into a folder, we will refer to the folder path as data_dir
.
Contents
Quick import of all csvs with base R
To import all .csv files from the data_dir
directory and place them into a single data frame called result
, all we have to do is:
filePaths <- list.files(data_dir, "\\.csv$", full.names = TRUE)
result <- do.call(rbind, lapply(filePaths, read.csv))
# View part of the result
head(result)
## Month_Year Hospital_Name Hospital_ID
## 1 Aug-15 AMNCH 1049
## 2 Aug-15 AMNCH 1049
## 3 Aug-15 AMNCH 1049
## 4 Aug-15 Bantry General Hospital 704
## 5 Aug-15 Bantry General Hospital 704
## 6 Aug-15 Bantry General Hospital 704
## Hospital_Department ReferralType TotalReferrals
## 1 Paediatric ENT General Referral 2
## 2 Paediatric Gastroenterology General Referral 4
## 3 Paediatric General Surgery General Referral 4
## 4 Gastroenterology General Referral 12
## 5 General Medicine General Referral 18
## 6 General Surgery General Referral 43
A quick explanation of the code:
list.files
- produces a character vector of the names of the files in the named directory, in our casedata_dir
. We have also passed apattern
argument"\\.csv$"
to make sure we only process files with .csv at the end of the name andfull.names = TRUE
to get the file path and not just the name.read.csv
- reads a file in table format and creates a data frame from its contentlapply(X, FUN, ...)
- Gives us a list of data.frames, one for each of the files found bylist.files
. More generally, it returns a list of the same length asX
, each element of which is the result of applyingFUN
to the corresponding element ofX
. In our caseX
is the vector of file names in data_dir (returned bylist.files
) andFUN
isread.csv
, so we are applyingread.csv
to each of the file pathsrbind
- in our case combines the rows of multiple data frames into one, similarly (even though a bit more rigidly) toUNION
inSQL
do.call
- will combine all the data frames produced bylapply
into one usingrbind
. More generally, it constructs and executes a function call from a name or a function and a list of arguments to be passed to it. In our case the function isrbind
and the list is the list of data frames containing the data loaded from the csvs, produced bylapply
.
Reconstructing the results of the original post
To fully reconstruct the results from the original post, we need to do two extra operations
- Add the source file names to the data frame
- Fix and reformat the dates
To do this, we will simply adjust the FUN
in the lapply
- in the above example, we have only used read.csv
. Below, we will make a small function to do the extra steps:
filePaths <- list.files(data_dir, "\\.csv$", full.names = TRUE)
result <- do.call(rbind, lapply(filePaths, function(path) {
df <- read.csv(path, stringsAsFactors = FALSE)
df[["source"]] <- rep(path, nrow(df))
df[["Month_Year"]] <- as.Date(
paste0(sub("-20", "-", df[["Month_Year"]], fixed = TRUE), "-01"),
format = "%b-%y-%d"
)
df
}))
# View part of the result
head(result)
## Month_Year Hospital_Name Hospital_ID
## 1 2015-08-01 AMNCH 1049
## 2 2015-08-01 AMNCH 1049
## 3 2015-08-01 AMNCH 1049
## 4 2015-08-01 Bantry General Hospital 704
## 5 2015-08-01 Bantry General Hospital 704
## 6 2015-08-01 Bantry General Hospital 704
## Hospital_Department ReferralType TotalReferrals
## 1 Paediatric ENT General Referral 2
## 2 Paediatric Gastroenterology General Referral 4
## 3 Paediatric General Surgery General Referral 4
## 4 Gastroenterology General Referral 12
## 5 General Medicine General Referral 18
## 6 General Surgery General Referral 43
## source
## 1 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 2 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 3 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 4 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 5 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 6 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
Lets look at the extra code in the lapply
:
- Instead of just using
read.csv
, we have defined our own little function that will do the extra work for each of the file paths, which are passed to the function aspath
- We read the data into a data frame called
df
usingread.csv
, and can we specifystringsAsFactors = FALSE
, as the tidyverse packages do this by default, while base R’s default is different - We add a new column
source
with the file name stored inpath
, repeated as many times asdf
has rows. This is a bit overkill here and could be done simpler, but it is quite robust and will also work with 0-row data frames - We transform the
Month_Year
into the requested date format withas.Date
. Note that the relatively uglysub()
part is caused mostly by inconsistency in the source data itself - Using
[[
instead of$
is less pleasing to the eye, but we find it to be good practice, so sacrifice a bit of readability
Alternatives to base R
Using data.table
Another popular package that can help us achieve the same is data.table
, so let’s have a look and reconstruct the results with data.table’s features:
library(data.table)
filePaths <- list.files(data_dir, "\\.csv$", full.names = TRUE)
result <- lapply(filePaths, fread)
names(result) <- filePaths
result <- rbindlist(result, use.names = TRUE, idcol = "source")
result[, Month_Year := as.Date(
paste0(sub("-20", "-", Month_Year, fixed = TRUE), "-01"),
format = "%b-%y-%d"
)]
# View part of the result
head(result)
## source
## 1: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 2: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 3: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 4: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 5: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 6: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## Month_Year Hospital_Name Hospital_ID
## 1: 2015-08-01 AMNCH 1049
## 2: 2015-08-01 AMNCH 1049
## 3: 2015-08-01 AMNCH 1049
## 4: 2015-08-01 Bantry General Hospital 704
## 5: 2015-08-01 Bantry General Hospital 704
## 6: 2015-08-01 Bantry General Hospital 704
## Hospital_Department ReferralType TotalReferrals
## 1: Paediatric ENT General Referral 2
## 2: Paediatric Gastroenterology General Referral 4
## 3: Paediatric General Surgery General Referral 4
## 4: Gastroenterology General Referral 12
## 5: General Medicine General Referral 18
## 6: General Surgery General Referral 43
Where
rbindlist
does the same asdo.call("rbind", l)
on data frames, but much fasterfread
is similar toread.table
(andread.csv
, which usesread.table
) but faster and more convenient':='()
is the data.table syntax to create new columns in a data.table
Using the tidyverse
This is covered in much detail in the post that inspired this one.
TL;DR - Just want the code
No time for reading? Click here to get just the code with commentary
Quick benchmarking
First off we are mostly looking at it for the fun of reacting to Twitter discussion, so take it for what it’s worth, by no means this is what we would call proper benchmarking.
Now that we have seen 3 ways to achieve the same goal, let’s look at speed. Note that we will be friendly to the
tidyverse
and not attach the entire package as is done in the original post, however only those packages that we really need to get a more appropriate benchmark.
Full script run benchmark
First, we will perform an execution of an R script containing just the above code chunks (and the tidyverse one) a thousand times. The timing will also include overhead for launching the process, but this effect is present for all three scenarios and the variance should be safely covered by the fact that we execute 1000 times:
time for i in {1..1000};
do Rscript --vanilla data/r005/benchmarking/base.R &>/dev/null;
done
time for i in {1..1000};
do Rscript --vanilla data/r005/benchmarking/datatable.R &>/dev/null;
done
time for i in {1..1000};
do Rscript --vanilla data/r005/benchmarking/tidyverse.R &>/dev/null;
done
Visualizing the results shows that base R is the clear winner here, largely due to package loading overhead. Any performance benefits of the other packages are not enough to catch up in this very small use case:
If interested, you can look at the scripts ran above:
Benchmarking without package loading overhead
We could argue that it is not fair to include the library
statements in the benchmark, as the overhead can be relatively big considering how small the actual action done by the code is, as we are only processing 4 small files. Here is a benchmark omitting the overhead and only executing the relevant code with the packages pre-loaded, using microbenchmark with a 100 iterations:
Visualizing the results in this case shows that data.table is a winner, with base R being the slowest of the options.