Recently I was involved in a task that included reading and writing quite large amounts of data, totaling more than 1 TB worth of csvs without the standard big data infrastructure. After trying multiple approaches, the one that made this possible was using data.table’s reading and writing facilities - fread() and fwrite().
This motivated me to look at benchmarking data.table’s fread() and how it compares to other packages such as tidyverse’s readr and base R for reading tabular data from text files such as csvs.
Data manipulation and aggregation is one of the classic tasks anyone working with data will come across. We of course can perform data transformation and aggregation with base R, but when speed and memory efficiency come into play, data.table is my package of choice.
In this post we will look at of the fresh and very useful functionality that came to data.table only last year - grouping sets, enabling us, for example, to create pivot table-like reports with sub-totals and grand total quickly and easily.
When speed and memory efficiency is important, the data.table package is one of the ways to improve those aspects of our R code dramatically. Including data.table in a package also comes with the added benefit of only importing the methods package, which is part of base R. We must also however pay attention to correctly importing and using methods, as data.table handles data.frame subsetting operators in a special way.