Jozef's Rblog

Optimizing partitioning for Apache Spark database loads via JDBC for performance

Sat, 26 Dec 2020 12:00:00 +0000

Introduction

Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. A very common task in working with Spark apart from using HDFS-based data storage is also interfacing with traditional RDMBS systems such as Oracle, MS SQL Server, and others. There is a lot of performance that can be gained by efficiently partitioning data for these types of data loads.

In this post, we will explore the partitioning options that are available for Spark’s JDBC reading capabilities and investigate how partitioning is implemented in Spark itself to choose the options such that we get the best performing results. We will also show how to use those options from R using the sparklyr package.

Getting test data into a MySQL database

If you are interested only in partitioning content, feel free to skip this paragraph.

For a fully reproducible example, we will use a local MySQL server instance as due to its open-source nature it is very accessible. Let’s populate a database table with some randomly generated data that will be useful to show different partitioning strategies and their impact on performance. We will write this data frame into the MySQL database using R’s {DBI} package and call the newly created table test_table. For the timings below we used a table with 10 million records.

# Set this to 1e7L for timings similar to those on pictures
rows <- 1e5L
groups <- 8L
set.seed(1)

mkNum <- function(x) vapply(x, function(s) sum(utf8ToInt(s)), numeric(1))
mkStr <- function() paste(sample(labels(eurodist), 3L), collapse = "")

unif <- floor(runif(rows, min = 0L, max = groups))
state_name <- sample(state.name, rows, replace = TRUE)
state_str  <- replicate(rows, mkStr())

test_df <- data.frame(
  id = seq_len(rows),
  grp_unif = unif,
  grp_skwd = pmin(floor(rexp(rows)), groups - 1L),
  grp_unif_range = (unif + 1L) ^ (unif + 1L),
  state_name = state_name,
  state_value = mkNum(state_name) * (1 + runif(rows)),
  state_srt_1 = state_str,
  state_srt_2 = sample(state_str),
  state_srt_3 = sample(state_str),
  state_srt_4 = sample(state_str),
  state_srt_5 = sample(state_str),
  stringsAsFactors = FALSE
)

con <- DBI::dbConnect(drv = RMySQL::MySQL(), db = "testdb", password = "pass")
DBI::dbWriteTable(con, "test_table", test_df, overwrite = TRUE)
DBI::dbDisconnect(con)

Partitioning columns with Spark’s JDBC reading capabilities

For this paragraph, we assume that the reader has some knowledge of Spark’s JDBC reading capabilities. We discussed the topic in more detail in the related previous article.

The partitioning options are provided to the DataFrameReader similarly to other options. We will focus on the key 4 options:

partitionColumn - The name of the column used for partitioning. It must be a numeric, date, or timestamp column from the table in question.
numPartitions - The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections.
lowerBound and upperBound- bounds used to decide the partition stride. We will talk more about the stride a bit later in the article

A few important notes need to be made:

If no partitioning options are specified, Spark will use a single executor and create a single non-empty partition. Reading the data will be neither distributed nor parallelized. This can cause significant performance loss in cases where parallelized reading is preferable.

The lowerBound and upperBound options are only used to define how the data is partitioned, not which data is read in. There is a common misconception that using the wrong bounds will filter the data which is not the case.

Partitioning options

Now with that in mind and the testing table prepared, let us investigate 2 columns that are relevant for partitioning and how the values are distributed. We will then see how using each of the columns for partitioning can impact the performance of the reading process

The green histogram shows the distribution of values in the grp_unif column, in which the values are evenly distributed between the values 0 to 7
The blue histogram shows the distribution of values in the grp_skwd column, in which the values are heavily skewed towards the smaller values, 0 being by far the most prevalent and 7 very rare

Distribution of record counts for the 2 partitioning columns

Partitioning examples using the interactive Spark shell

To show the partitioning and make example timings, we will use the interactive local Spark shell. We can run the Spark shell and provide it the needed jars using the --jars option and allocate the memory needed for our driver:

/usr/local/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-shell \
  --jars /home/$USER/jars/mysql-connector-java-8.0.21/mysql-connector-java-8.0.21.jar \
  --driver-memory 7g

Now within the Spark shell, we can execute Scala expressions for three scenarios:

no partitioning options provided (baseline)
partitioning using the uniformly distributed column
partitioning using the skewed column

After running these, we can compare the speed and see the benefit we gained by the different partitioning approaches versus the baseline.

// First, setup the data frame without partitioning
val reader_no_partitioning = spark.read.
  format("jdbc").
  option("url", "jdbc:mysql://localhost:3306/testdb").
  option("user", "rstudio").
  option("password", "pass").
  option("driver", "com.mysql.cj.jdbc.Driver").
  option("dbtable", "test_table")

val df_no_partitioning = reader_no_partitioning.load()
df_no_partitioning.cache().count()
df_no_partitioning.unpersist()
  
// Now use the skewed column to partition
val reader_partitioning_skewed = reader_no_partitioning.
  option("partitionColumn", "grp_skwd").
  option("numPartitions", 8).
  option("lowerBound", 0).
  option("upperBound", 4)
val df_partitioning_skewed = reader_partitioning_skewed.load()
df_partitioning_skewed.cache().count()
df_partitioning_skewed.unpersist()

// Now use the uniform column to partition
val reader_partitioning_unif = reader_no_partitioning.
  option("partitionColumn", "grp_unif").
  option("numPartitions", 8).
  option("lowerBound", 0).
  option("upperBound", 8)
val df_partitioning_unif = reader_partitioning_unif.load()
df_partitioning_unif.cache().count()
df_partitioning_unif.unpersist()

Comparing the performance of different partitioning options

Now let us look at how fast each of the read operations was. This is of course by no means a relevant benchmark for real-life data loads but can provide some insight into optimizing the partitioning. In our experience, the benefits of proper partitioning can be extremely relevant, especially with real-life use cases where the databases sit on external servers and support many concurrent connections.

First, let’s see the total time for the 3 options

JobId 0 - no partitioning - total time of 2.9 minutes
JobId 1 - partitioning using the grp_skwd column and 8 partitions - 2.1 minutes
JobId 2 - partitioning using the grp_unif column and 8 partitions - 59 seconds

Timing of reading using different partitioning options

To understand better why the partitioning using the grp_unif column was so much faster, let us look at the performance per partition, with the partitioning using grp_skewed to the left the grp_unif to the right:

Investigating timing for each partition

We can see that the Durations for each of the partitions for grp_unif is almost identical, whereas for grp_skewed the longest time is much larger than the biggest time. This is heavily correlated with the sizes of each of the partitions, which points us toward our conclusion when looking at the actual implementation.

Understanding the partitioning implementation

The implementation of the partitioning within Apache Spark can be found in this piece of source code. The most notable single row that is key to understanding the partitioning process and the performance implications is the following:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

In combination with the while loop:

while (i < numPartitions) {
  val lBoundValue = boundValueToString(currentValue)
  val lBound = if (i != 0) s"$column >= $lBoundValue" else null
  currentValue += stride
  val uBoundValue = boundValueToString(currentValue)
  val uBound = if (i != numPartitions - 1) s"$column < $uBoundValue" else null
  val whereClause =
    if (uBound == null) {
      lBound
    } else if (lBound == null) {
      s"$uBound or $column is null"
    } else {
      s"$lBound AND $uBound"
    }
  ans += JDBCPartition(whereClause, i)
  i = i + 1
}

We can see that the data to be read is partitioned by splitting the values in the partitionColumn into numPartitions groups using the stride.

Based on this information, we can optimize the column that we choose for the partitioning as well as the values for upperBound and lowerBound such that the intervals for the values of partitionColumn will end up with roughly the same size.

In our example, the - grp_unif column was purposefully generated such that this is the case with the most basic partitioning options, each partition having around 1.25 million records - grp_skwd column had partitions with very different sizes, the biggest one with more than 6.3 million, whereas the smallest one with only around 9 thousand records

Setting up partitioning for JDBC via Spark from R with sparklyr

As we have shown in detail in the previous article, we can use sparklyr’s function spark_read_jdbc() to perform the data loads using JDBC within Spark from R. The key to using partitioning is to correctly adjust the options argument with elements named:

numPartitions
partitionColumn
lowerBound
upperBound

These are mapped one-to-one to the options as described above. Once we have done that, we pass the created options to the call to spark_read_jdbc() along with the other connection options in the options argument.

An oversimplified example of a full load could look like so:

# Setup jars and connect to Spark ----
jars <- dir("~/jars", pattern = "jar$", recursive = TRUE, full.names = TRUE)
config <- sparklyr::spark_config()
config$sparklyr.jars.default <- jars
config[["sparklyr.shell.driver-memory"]] <- "6G"
sc <- sparklyr::spark_connect("local", config = config)

# Create basic JDBC connection options ----
jdbcOpts <- list(
  user = "rstudio",
  password = "pass",
  server = "localhost",
  driver = "com.mysql.cj.jdbc.Driver",
  fetchsize = "100000",
  dbtable = "test_table",
  url = "jdbc:mysql://localhost:3306/testdb"
)

# Create the partitioning options ----
partitioningOpts <- list(
  numPartitions = 8L,
  partitionColumn = "grp_unif",
  lowerBound = 0L,
  upperBound = 8L
)

# Use the options combined to read a table ----
test_tbl <- sparklyr::spark_read_jdbc(
  sc,
  "test_table",
  options = c(jdbcOpts, partitioningOpts),
  smemory = FALSE
)

# Print a few records ----
test_tbl

# Disconnect ----
sparklyr::spark_disconnect(sc)

TL;DR, just tell me roughly how to partition

At the risk of oversimplifying and omitting some corner cases, to partition reading from Spark via JDBC, we can provide our DataFrameReader with the options:

option("partitionColumn", column_to_partition)
option("numPartitions", n)
option("lowerBound", x)
option("upperBound", y)

Such that when the stride is calculated as stride = y/n - x/n and the partitions are created by splitting the values of partitionColumn roughly like so:

Partition 1: Rows where column_to_partition ∈ <x, x+stride)
Partition 2: Rows where column_to_partition ∈ <x+stride, x+2*stride)
...
Partition n: Rows where column_to_partition ∈ <y-stride, y)

We try to set up the values of column_to_partition, n, x, y such that each of the created partitions is of roughly the same size.

Running the code in this article

If you have Docker available, running the following should yield a Docker container with RStudio Server exposed on port 8787, so you can open your web browser at http://localhost:8787 to access it and experiment with the code. The user name is rstudio and the password is as you choose below:

docker run -d -p 8787:8787 -e PASSWORD=pass jozefhajnala/jozefio

References

A guide to retrieval and processing of data from relational database systems using Apache Spark and JDBC with R and sparklyr
JDBC To Other Databases in Spark documentation
Discussion on the JDBC partitioning topic at StackOverflow
Tips for using JDBC in Apache Spark SQL by Radek Strnad
Class DataFrameReader as Spark’s Scala API Doc
DataFrameReader implementation at GitHub

A guide to retrieval and processing of data from relational database systems using Apache Spark and JDBC with R and sparklyr

Sat, 15 Aug 2020 12:00:00 +0000

Introduction

The {sparklyr} package lets us connect and use Apache Spark for high-performance, highly parallelized, and distributed computations. We can also use Spark’s capabilities to improve and streamline our data processing pipelines, as Spark supports reading and writing from many popular sources such as Parquet, Orc, etc. and most database systems via JDBC drivers.

In this post, we will explore using R to perform data loads to Spark and optionally R from relational database management systems such as MySQL, Oracle, and MS SQL Server and show how such processes can be simplified. We will also provide reproducible code via a Docker image, such that interested readers can experiment with it easily.

Getting test data into a MySQL database

If you are interested only in the Spark loading part, feel free to skip this paragraph.

For a fully reproducible example, we will use a local MySQL server instance as due to its open-source nature it is very accessible. We will use the {DBI} and {RMySQL} packages to connect to the server directly from R and populate a database with data provided by the {nycflights13} package that we will later use for our Spark loads.

Let us write the flights data frame into the MySQL database using {DBI} and call the newly created table test_table:

test_df <- nycflights13::flights

# Create a connection to database `testdb`
con <- DBI::dbConnect(
  drv = RMySQL::MySQL(),
  host = "localhost",
  dbname = "testdb",
  user = "rstudio",
  password = "pass"
)

# Write our `test_df` into a table called `test_table`
DBI::dbWriteTable(con, "test_table", test_df, overwrite = TRUE)

# Close the connection
DBI::dbDisconnect(con)

Now we have our table available and we can focus on the main part of the article.

Using JDBC to connect to database systems from Spark

Getting a JDBC driver and using it with Spark and sparklyr

Since Spark runs via a JVM, the natural way to establish connections to database systems is using Java Database Connectivity (JDBC). To do that, we will need a JDBC driver which will enable us to interact with the database system of our choice. For this example, we are using MySQL, but we provide details on other RDBMS later in the article.

Downloading and extracting the connector jar

With a bit of online search, we can download the driver and extract the contents of the zip file:

mkdir -p $HOME/jars
wget -q -t 3 \
  -O $HOME/jars/mysql-connector.zip \
  https://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-8.0.21.zip 
unzip -q -o \
  -d $HOME/jars \
  $HOME/jars/mysql-connector.zip

Now the file we are most interested in for our use case the .jar file that contains classes necessary to establish the connection. Using R, we can locate the extracted jar file(s), for example using the dir() function:

jars <- dir("~/jars", pattern = "jar$", recursive = TRUE, full.names = TRUE)
basename(jars)

## [1] "mysql-connector-java-8.0.21.jar"

Connecting using the jar

Next we need to tell {sparklyr} to use that resource when establishing a Spark connection, for example by adding a sparklyr.jars.default element with the paths to the necessary jar files to the config list and finally establish the Spark connection using our config:

config <- list(sparklyr.jars.default = jars)
sc <- sparklyr::spark_connect("local", config = config)

Retrieving data from a database with sparklyr

With the Spark connection established, we can connect to our MySQL database from Spark and retrieve the data. {sparklyr} provides a handy spark_read_jdbc() function for this exact purpose. The API maps closely to the Scala API, but it is not very explicit in how to set up the connection. The key here is the options argument to spark_read_jdbc(), which will specify all the connection details we need.

Setting the `options` argument of `spark_read_jdbc()`

First, let us create a jdbcConnectionOpts list with the basic connection properties. These are the connection URL and the driver. Below we also explictly specify the user and password, but these can usually also be provided as part of the URL:

# Connection options
jdbcConnectionOpts <- list(
  url = "jdbc:mysql://localhost:3306/testdb",
  driver = "com.mysql.cj.jdbc.Driver",
  user = "rstudio", 
  password = "pass"
)

The last bit of information we need to provide is the identification of the data we want to extract once the connection is established. For this, we can use one of two options:

dbtable - in case we want to create a Spark DataFrame by extracting contents of a specific table
query - in case we want to create a Spark DataFrame by executing a SQL query

Loading a specific database table

First let us go with the option to load a database table that we populated with the flights earlier and named test_table, putting it all together and loading the data using spark_read_jdbc():

# Other options specific to the action
jdbcDataOpts <- list(dbtable = "test_table")

# Use spark_read_jdbc() to load the data
test_tbl <- sparklyr::spark_read_jdbc(
  sc = sc,
  name = "test_table",
  options = append(jdbcConnectionOpts, jdbcDataOpts),
  memory = FALSE
)

# Print some records
test_tbl

## # Source: spark<test_table> [?? x 20]
##    row_names  year month   day dep_time sched_dep_time dep_delay arr_time
##    <chr>     <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
##  1 1          2013     1     1      517            515         2      830
##  2 2          2013     1     1      533            529         4      850
##  3 3          2013     1     1      542            540         2      923
##  4 4          2013     1     1      544            545        -1     1004
##  5 5          2013     1     1      554            600        -6      812
##  6 6          2013     1     1      554            558        -4      740
##  7 7          2013     1     1      555            600        -5      913
##  8 8          2013     1     1      557            600        -3      709
##  9 9          2013     1     1      557            600        -3      838
## 10 10         2013     1     1      558            600        -2      753
## # … with more rows, and 12 more variables: sched_arr_time <dbl>,
## #   arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <chr>

We provided the following arguments:

sc is the Spark connection that we established using the config that includes necessary jars
name is a character string with the name to be assigned to the newly generated table within Spark SQL, not the name of the source table we want to read from our database
options is a list with both the connection options and the data-related options, so we use append() to combine the jdbcConnectionOpts and jdbcDataOpts lists into one
memory is a logical that tells Spark whether we want to cache the table into memory. A bit more on that and some performance implications below

Executing a query instead

We mentioned above that apart from just loading a table, we can also choose to execute a SQL query and use its result as the source for our Spark DtaFrame. Here is a simple example of that.

# Use `query` instead of `dbtable`
jdbcDataOpts <- list(
  query = "SELECT * FROM test_table WHERE tailnum = 'N14228'"
)

# Use spark_read_jdbc() to load the data
test_qry <- sparklyr::spark_read_jdbc(
  sc = sc,
  name = "test_table",
  options = append(jdbcConnectionOpts, jdbcDataOpts),
  memory = FALSE
)

# Print some records
test_qry

## # Source: spark<test_table> [?? x 20]
##    row_names  year month   day dep_time sched_dep_time dep_delay arr_time
##    <chr>     <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
##  1 1          2013     1     1      517            515         2      830
##  2 6570       2013     1     8     1435           1440        -5     1717
##  3 7111       2013     1     9      717            700        17      812
##  4 7349       2013     1     9     1143           1144        -1     1425
##  5 10593      2013     1    13      835            824        11     1030
##  6 13775      2013     1    16     1829           1730        59     2117
##  7 18967      2013     1    22     1902           1808        54     2214
##  8 19417      2013     1    23     1050           1056        -6     1143
##  9 19648      2013     1    23     1533           1529         4     1641
## 10 21046      2013     1    25      724            720         4     1000
## # … with more rows, and 12 more variables: sched_arr_time <dbl>,
## #   arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <chr>

Note that the only element that changed is the jdbcDataOpts list, which now contains a query element instead of a dbtable element.

Other RDBM Systems

Our toy example with MySQL worked fine, but in practice, we might need to access data in other popular RDBM systems, such as Oracle, MS SQL Server, and others. The pattern we have shown above however remains, as the API design is the same regardless of the system in question.

In general, we will need 3 elements to successfully connect:

A JDBC driver specified and the resources provided to {sparklyr} in the config argument of spark_connect(), usually in the form of paths to .jar files containing the necessary resources
A connection URL that will depend on the system and other setup specifics
Last but not least, all the technical and infrastructural prerequisites such as credentials with the proper access rights, the host being accessible from the Spark cluster, etc.

Now for some examples that we have worked with in the past and had success with.

Oracle

Oracle JDBC Driver

The drivers can be downloaded (after login) from Oracle’s website and the driver name usually is "oracle.jdbc.driver.OracleDriver". Make sure you use the appropriate version.

Using fully qualified host identification

hostName <- "0.0.0.0"
portNumber <- "1521"
serviceName <- "service_name"

jdbcConnectionOpts <- list(
  user = "username",
  password = "password",
  driver = "oracle.jdbc.driver.OracleDriver",
  fetchsize = "100000",
  url = paste0(
    "jdbc:oracle:thin:@//",
    hostName, ":", portNumber,
    "/", serviceName
  )
)

Using tnsnames.ora

The tnsnames.ora file is a configuration file that contains network service names mapped to connect descriptors for the local naming method, or net service names mapped to listener protocol addresses. With this in place, we can use just the service name instead of fully qualified host, port, and service identification, for example:

serviceName <- "service_name"

jdbcConnectionOpts <- list(
  user = "username",
  password = "password",
  driver = "oracle.jdbc.driver.OracleDriver",
  fetchsize = "100000",
  url = paste0("jdbc:oracle:thin:@", serviceName)
)

Parsing special data types

Note that the JDBC driver on its own may not be enough to parse all data types in an Oracle database. For instance, parsing the XMLType will very likely require xmlparserv2.jar, and xdb.jar along with the proper ojdbc*.jar.

MS SQL Server

MS SQL Server JDBC Driver

The drivers for different JRE versions can be downloaded from the Download Microsoft JDBC Driver for SQL Server website. Again, make sure that the JRE version matches the one you use in your environments.

MS SQL Server connection options

serverName <- "0.0.0.0"
portNumber  <- "1433"
databaseName <- "db_name"

jdbcConnectionOpts <- list(
  user = "username",
  password = "password",
  driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
  fetchsize = "100000",
  url = paste0(
    "jdbc:sqlserver://",
    serverName, ":", portNumber,
    ";databaseName=", databaseName
  )
)

Even more RDBM Systems

Logos of R, sparklyr, Spark and selected RDBMS systems

Vlad Mihalcea wrote a very useful article on JDBC Driver Connection URL strings which has the connection URL details for several other common database systems.

Some notes on performance

The `memory` argument

The memory argument to spark_read_jdbc() can prove very important when performance is of interest. What happens when using the default memory = TRUE is that the table in the Spark SQL context is cached using CACHE TABLE and a SELECT count(*) FROM query is executed on the cached table. This forces Spark to perform the action of loading the entire table into memory.

Depending on our use case, it might be much more beneficial to use memory = FALSE and only cache into Spark memory the parts of the table (or processed results) that we need, as the most time-costly operations usually are data transfers over the network. Transferring as little data as possible from the database into Spark memory may bring significant performance benefits.

This is a bit difficult to show with our toy example, as everything is physically happening inside the same container (and therefore the same file system), but differences can be observed even with this setup and our small dataset:

microbenchmark::microbenchmark(
  times = 10,
  setup = {
    library(dplyr)
    library(sparklyr)
    sparklyr::spark_disconnect_all()
    sc <- sparklyr::spark_connect("local", config = config)
  },
  
  # with memory=TRUE (the default)
  eager = {
    one <- sparklyr::spark_read_jdbc(
      sc = sc,
      name = "test",
      options = append(jdbcConnectionOpts, list(dbtable = "test_table"))
    ) %>%
      filter(tailnum == "N14228") %>%
      select(tailnum, distance) %>%
      compute("test")
  },

  # with memory=FALSE
  lazy = {
    two <- sparklyr::spark_read_jdbc(
      sc = sc,
      name = "test",
      options = append(jdbcConnectionOpts, list(dbtable = "test_table")),
      memory = FALSE
    ) %>% 
      filter(tailnum == "N14228") %>%
      select(tailnum, distance) %>%
      compute("test")
  }
)

# Unit: seconds
#  expr       min       lq     mean   median       uq      max neval
# eager 15.460844 16.24838 17.07560 17.03592 17.88299 18.73005    10
#  lazy  9.821039 10.12435 10.40718 10.42766 10.70024 10.97283    10

We see that the “lazy” approach that does not cache the entire table into memory has yielded the result around 41% faster. This is of course by no means a relevant benchmark for real-life data loads but can provide some insight into optimizing the loads.

Partitioning

Partitioning the data can bring a very significant performance boost and we will look into setting it up and optimizing it in detail in a separate article.

Running the code in this article

docker run -d -p 8787:8787 -e PASSWORD=pass jozefhajnala/jozefio

References

JDBC Driver Connection URL strings
MS SQL Server: Programming Guide for JDBC - Building the Connection URL
Oracle: Database JDBC Developer’s Guide and Reference - Data Sources and URLs

A review of my experience with the Big Data Analysis with Scala and Spark course

Sat, 25 Jul 2020 12:00:00 +0000

Introduction

Apache Spark is an open-source distributed cluster-computing framework implemented in Scala that first came out in 2014 and has since then become popular for many computing applications including machine learning thanks to among other aspects its user-friendly APIs. The popularity also gave rise to many online courses of varied quality.

In this post, I share my personal experience with completing the Big Data Analysis with Scala and Spark course on Coursera in May 2020, briefly walk through the content and write about the course assignments. I wrote down each of the paragraphs as I went through the course, so it is not a retrospective evaluation but more of a “review-style diary” of the process of completing the course.

Disclaimer, what to expect

First off, this post does not mean to be an objective review as your experience will most likely be very different from mine. Before this course, I also completed one of the prerequisites - the Functional Programming Principles in Scala course, which I reviewed here.

This is not a paid review and I have no affiliation nor any benefit whatsoever from Coursera or other parties from writing this review.

Course organization, pre-course preparatory work

Organization

The course is organized into video sessions split into 4 weeks, but since it is fully online you can choose your own pace. I completed the course in one week while being on a standard working schedule. Each week apart from week 3 has a programming assignment that is submitted to Coursera and automatically graded.

I found the assignments executed very well from a technical perspective and had no issues at all with downloading, compiling, running, and submitting them.

Similarly to the other courses in the specialization, you can submit each assignment as many times as you want, so there is no stress making the submission right on the first try. Once the course is completed, you get a certificate.

Pre-course setup

Since I have prior experience with Scala and sbt and I already completed a previous course in the specialization, there was no extra setup overhead.

If you are an R user used to conveniently opening RStudio and easily installing packages, you may be surprised by the difficulty of the whole setup. The course does provide setup videos for major platforms, so with a bit of patience, you should be good to go.

Week 1

Content

Very practically introduces Spark, the motivation behind Spark, and comparison to Hadoop, especially for data science type applications and workflows. Presents the main collections class that Spark works with - RDD and provides a very useful comparison between the RDD API and Scala collections API. This builds upon the topics covered in the previous courses, mainly the Functional Programming Principles in Scala course. It also very nicely covers the differences between transformations and actions on RDDs and how that relates to the differences in expression evaluation between the sequential collections and the lazy evaluation of transformations on RDDs.

The content also covers cluster topology, how the driver and worker nodes are related, and what gets executed where. The importance of having data parallelized in such a way that there is little shuffling between the nodes is also highlighted.

I found especially useful the video session on Latency, where the speeds on different operations e.g. referencing memory, reading from disk and sending packets over networks are compared in very understandable terms, which motivates good practices in partitioning data and designing processes to minimize those operations that are time expensive.

This was fantastic content and I binged it in one evening.

Assignment

The assignment is quite fun and practical, the goal is to use full-text data from Wikipedia to produce a very simple metric of how popular a programming language is.

The only issue I had was that a lot of methods that should have been used were only introduced in the content of Week 2, so I had to study their documentation myself to implement the assignment. Had I known that they are introduced in detail in Week 2, I would have watched those sessions first before working on this assignment.

Week 2

Content

Starts with explaining foldLeft, fold, and aggregate. Very good explanations. It would be great to have them for the 1st assignment. Even a structure similar to the first assignment is mentioned along with distributed key-value pairs (pair RDDs), which support reduceByKey.

The later sessions introduce different available joins on pair RDDs, again showing examples, so the concepts are easy to understand. The explanations are very clear and detailed.

Assignment

This time the goal is to look at StackOverflow questions and answers data and apply k-means to cluster the content by languages. This was a very interesting and fun assignment.

Implementing it let me appreciate how R is amazing for exploratory and interactive data science work. Compared to R, debugging the Scala code was challenging, and writing data wrangling code to get the data into proper format took me hours.

For a comparison, here is the Scala code I wrote to get the data in requested formats:

val langs = List(
  "JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
  "Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy"
)
def langSpread = 50000

val lines = sc.textFile("src/main/resources/stackoverflow/stackoverflow.csv")
val raw   = rawPostings(lines)

/** Parse lines into proper structure */
def rawPostings(lines: RDD[String]): RDD[Posting] =
  lines.map(line => {
    val arr = line.split(",")
    Posting(
      postingType =    arr(0).toInt,
      id =             arr(1).toInt,
      acceptedAnswer = if (arr(2) == "") None else Some(arr(2).toInt),
      parentId =       if (arr(3) == "") None else Some(arr(3).toInt),
      score =          arr(4).toInt,
      tags =           if (arr.length >= 6) Some(arr(5).intern()) else None
    )
})


/** Group the questions and answers together */
def groupedPostings(
  postings: RDD[Posting]
): RDD[(QID, Iterable[(Question, Answer)])] = {
  val questions = postings.
    filter(thisPosting => thisPosting.postingType == 1).
    map(thisQuestion => (thisQuestion.id, thisQuestion))
  val answers = postings.
    filter(thisPosting => thisPosting.postingType == 2).
    map(thisAnswer => (thisAnswer.parentId.get, thisAnswer))
  questions.join(answers).groupByKey()
}

/** Compute the maximum score for each posting */
def scoredPostings(
  grouped: RDD[(QID, Iterable[(Question, Answer)])]
): RDD[(Question, HighScore)] = {

  def answerHighScore(as: Array[Answer]): HighScore = {
    var highScore = 0
    var i = 0
    while (i < as.length) {
      val score = as(i).score
      if (score > highScore) highScore = score
      i += 1
    }
    highScore
  }

  grouped.map{
    case (_, qaList) => (
        qaList.head._1,
        answerHighScore(qaList.map(x => x._2).toArray)
    )
  }
}

Editing Scala in VS Code

And here is data.table code that can reach very similar results:

library(data.table)

# Read Data -----
so <- fread("http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow.csv")
colNames <- c("postTypeId", "id", "acceptedAnswer", "parentId", "score", "tag")
setnames(so, colNames)

# Select questions and answers -----
que <- so[postTypeId == 1, .(queId = id, queTag = tag)]
ans <- so[postTypeId == 2, .(ansId = id, queId = parentId, ansScore = score)]
langSpread <- 50000L

langs = data.frame(
  index = (0:14) * langSpread,
  queTag = c(
    "JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
    "Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy"
  )
)

# Merge into final object -----
mg <- merge(que, ans,  by = "queId")
mg <- mg[, .(maxAnsScore = max(ansScore)), by = .(queId, queTag)]
mg <- merge(mg, langs)

Some tweaks were also needed to make the grader happy and since the grader output is not that detailed and there were no local unit tests provided, it took me quite a few submissions to get this right. All-in-all, it was a fun assignment and it highlighted how much simpler R is for this type of usage.

Week 3

Content

This week focuses on partitioning and shuffling. The video lectures explain the concepts very well and even provide a practical hands-on example of how preventing shuffles can significantly improve the performance of operations on RDDs.

It also looks at optimizing Spark operations with partitioners and look at key differences between wide and narrow dependencies in the context of fail-safety. Again a concrete example is provided along with the explanations, which I find very helpful.

Assignment

There is no assignment in Week 3.

Week 4

Content

Once again an extremely useful set of sessions that introduce the DataFrame, DataSet, and Spark SQL APIs. Especially for R and Python users, this week’s content is great as the untyped APIs are those that pyspark and SparkR (and sparklyr) users will interact with the vast majority of the time. The sessions explain how these more high-level APIs relate to the typed RDD API and how the 2 main optimization tools - catalyst and tungsten work to optimize the code that users send via the high-level APIs.

There is also a benchmarking comparison of different RDD approaches that are not directly optimized so we can see performance drops versus the Spark SQL API which optimizes the SQL query such that even a query written inefficiently by the user executes very fast.

Once again, a fantastic content session to wrap up the course.

Assignment

The final assignment of the course focuses on comparing the DataSet API with the DataFrame and Spark SQL APIs in a very practical manner. Based on data on how people spend their time split across categories such as primary needs, work, and spare time activities, we compute some aggregated statistics using the untyped DataFrame and SQL APIs and the typed DataSet API. I feel this assignment really shows the differences between the APIs well in a practical sense and also allows the student to implement each of the tasks more freely.

Since I had previous experience with the DataFrame and Spark SQL APIs from working with them, I found this assignment much less challenging, but still seeing the three APIs in comparison was useful.

TL;DR - Just give me the overview

The course introduces Apache Spark and the key concepts in a very understandable and practical way
The feel of the course was very hands-on and well-executed, the explanations very clear, making use of practical examples
The assignments are fun, each of them working with a real-life set of data and exploring different Spark concepts and APIs
Overall I was very happy with the course and would love to see a more in-depth sequel

Exploring and plotting positional ice hockey data on goals, penalties and more from R with the {nhlapi} package

Sat, 04 Jul 2020 12:00:00 +0000

Introduction

The National Hockey League (NHL) is considered to be the premier professional ice hockey league in the world, founded 102 years ago in 1917. Like many other sports, the data about teams, players, games, and more are a great resource to dive in and analyze using modern software tools. Thanks to the open NHL API, the data is accessible to everyone and the {nhlapi} R package aims to make that data readily available for analysis to R users.

In this post, we will use the {nhlapi} R package to explore the positional data on in-game events, which will provide us with information on the plays that happened in matches and where they happened in terms of the position on the rink. We will also show ways to plot that information using 2D density charts with {ggplot2}.

Installing the {nhlapi} package

We can install {nhlapi} from CRAN. It has only 1 recursive dependency, so the installation is very light and swift. Alternatively, we can also install the latest development version from the master branch on GitHub using the {remotes} or {devtools} package:

# Current CRAN version:
install.packages("nhlapi")

# Development version from GitHub
#  devtools::install_github("jozefhajnala/nhlapi")
#  remotes::install_github("jozefhajnala/nhlapi")

library(nhlapi)

Now we attach the package using library() or require() and can start exploring the data. All the relevant functions start with the nhl_ prefix so they are easy to find and are well documented, so we can get help by using the help() function in R. For example, in this post we will look at the detailed games’ data, so running help(nhl_games) will provide us with detailed information on the available functions.

Retrieving basic game information

To look at a quick example, we will explore the very first game in the regular season 2017/2018, in which the Toronto Maple Leafs played against the Winnipeg Jets. First, let’s look at the very basic game results using the nhl_games_linescore() function which retrieves a very limited amount of high-level information:

linescore <- nhlapi::nhl_games_linescore(gameIds = 2017020001)[[1]]

# Look at quick info on periods
linescore$periods

  periodType            startTime              endTime num ordinalNum
1    REGULAR 2017-10-04T23:17:19Z 2017-10-04T23:58:23Z   1        1st
2    REGULAR 2017-10-05T00:16:56Z 2017-10-05T00:54:10Z   2        2nd
3    REGULAR 2017-10-05T01:12:37Z 2017-10-05T01:50:38Z   3        3rd
  home.goals home.shotsOnGoal home.rinkSide away.goals away.shotsOnGoal
1          0               17         right          3               11
2          0               10          left          1                8
3          2               10         right          3               12
  away.rinkSide
1          left
2         right
3          left

Getting detailed events data for a game

Now to something more interesting, lets investigate what plays were made during the game and where on the ice they happened. We can use nhl_games_feed() to get the most detailed game data available in the API. To get a picture of the amount of detail, we can print the structure of the retrieved object limited to 3 levels of depth:

gameIds <- 2017020001
gameFeed <- nhlapi::nhl_games_feed(gameIds = gameIds)[[1]]
str(gameFeed, max.level = 3)

List of 6
 $ copyright: chr "NHL and the NHL Shield are registered trademarks of the National Hockey League. NHL and NHL team marks are the "| __truncated__
 $ gamePk   : int 2017020001
 $ link     : chr "/api/v1/game/2017020001/feed/live"
 $ metaData :List of 2
  ..$ wait     : int 10
  ..$ timeStamp: chr "20171006_173713"
 $ gameData :List of 6
  ..$ game    :List of 3
  .. ..$ pk    : int 2017020001
  .. ..$ season: chr "20172018"
  .. ..$ type  : chr "R"
  ..$ datetime:List of 2
  .. ..$ dateTime   : chr "2017-10-04T23:00:00Z"
  .. ..$ endDateTime: chr "2017-10-05T01:50:41Z"
  ..$ status  :List of 5
  .. ..$ abstractGameState: chr "Final"
  .. ..$ codedGameState   : chr "7"
  .. ..$ detailedState    : chr "Final"
  .. ..$ statusCode       : chr "7"
  .. ..$ startTimeTBD     : logi FALSE
  ..$ teams   :List of 2
  .. ..$ away:List of 16
  .. ..$ home:List of 16
  ..$ players :List of 45
  .. ..$ ID8474709:List of 22
  .. ..$ ID8473618:List of 22
  .. ..$ ID8471218:List of 22
  .. ..$ ID8470828:List of 21
  .. ..$ ID8477939:List of 22
  .. ..$ ID8476945:List of 22
  .. ..$ ID8473412:List of 22
  .. ..$ ID8475716:List of 21
  .. ..$ ID8476941:List of 22
  .. ..$ ID8476469:List of 21
  .. ..$ ID8477359:List of 22
  .. ..$ ID8479339:List of 21
  .. ..$ ID8479318:List of 22
  .. ..$ ID8476410:List of 22
  .. ..$ ID8475883:List of 21
  .. ..$ ID8474574:List of 22
  .. ..$ ID8477940:List of 21
  .. ..$ ID8473463:List of 21
  .. ..$ ID8477464:List of 22
  .. ..$ ID8473461:List of 22
  .. ..$ ID8476392:List of 22
  .. ..$ ID8466139:List of 22
  .. ..$ ID8470834:List of 22
  .. ..$ ID8468575:List of 22
  .. ..$ ID8477429:List of 22
  .. ..$ ID8468493:List of 22
  .. ..$ ID8474037:List of 22
  .. ..$ ID8475786:List of 22
  .. ..$ ID8470611:List of 22
  .. ..$ ID8476853:List of 22
  .. ..$ ID8477448:List of 21
  .. ..$ ID8477504:List of 22
  .. ..$ ID8479458:List of 21
  .. ..$ ID8477015:List of 22
  .. ..$ ID8475179:List of 21
  .. ..$ ID8476885:List of 22
  .. ..$ ID8475279:List of 22
  .. ..$ ID8473574:List of 22
  .. ..$ ID8476460:List of 22
  .. ..$ ID8475098:List of 22
  .. ..$ ID8474581:List of 22
  .. ..$ ID8478483:List of 22
  .. ..$ ID8475172:List of 22
  .. ..$ ID8480158:List of 21
  .. ..$ ID8479293:List of 22
  ..$ venue   :List of 3
  .. ..$ id  : int 5058
  .. ..$ name: chr "Bell MTS Place"
  .. ..$ link: chr "/api/v1/venues/5058"
 $ liveData :List of 4
  ..$ plays    :List of 5
  .. ..$ allPlays     :'data.frame':    312 obs. of  28 variables:
  .. ..$ scoringPlays : int [1:9] 93 108 112 157 225 269 284 286 290
  .. ..$ penaltyPlays : int [1:12] 21 43 66 86 117 148 167 183 247 253 ...
  .. ..$ playsByPeriod:'data.frame':    3 obs. of  3 variables:
  .. ..$ currentPlay  :List of 3
  ..$ linescore:List of 10
  .. ..$ currentPeriod             : int 3
  .. ..$ currentPeriodOrdinal      : chr "3rd"
  .. ..$ currentPeriodTimeRemaining: chr "Final"
  .. ..$ periods                   :'data.frame':   3 obs. of  11 variables:
  .. ..$ shootoutInfo              :List of 2
  .. ..$ teams                     :List of 2
  .. ..$ powerPlayStrength         : chr "Even"
  .. ..$ hasShootout               : logi FALSE
  .. ..$ intermissionInfo          :List of 3
  .. ..$ powerPlayInfo             :List of 3
  ..$ boxscore :List of 2
  .. ..$ teams    :List of 2
  .. ..$ officials:'data.frame':    4 obs. of  4 variables:
  ..$ decisions:List of 5
  .. ..$ winner    :List of 3
  .. ..$ loser     :List of 3
  .. ..$ firstStar :List of 3
  .. ..$ secondStar:List of 3
  .. ..$ thirdStar :List of 3
 - attr(*, "url")= chr "https://statsapi.web.nhl.com/api/v1/game/2017020001/feed/live"

Now lets finally look at the data on plays. We can access those via the allPlays data.frame inside the element plays of liveData. The below code chunk will store those in a separate data.frame called plays. We can then filter based on result.event to look for instance only at goals.

plays <- gameFeed$liveData$plays$allPlays
goals <- plays[plays$result.event == "Goal", ]

# Selecting limited columns to keep the print reasonable
goals[, c(2, 5, 6, 12, 15, 18, 26, 23, 24)]

##     result.event
## 94          Goal
## 109         Goal
## 113         Goal
## 158         Goal
## 226         Goal
## 270         Goal
## 285         Goal
## 287         Goal
## 291         Goal
##                                                                     result.description
## 94        Nazem Kadri (1) Wrist Shot, assists: James van Riemsdyk (1), Tyler Bozak (1)
## 109                        James van Riemsdyk (1) Wrist Shot, assists: Tyler Bozak (2)
## 113   William Nylander (1) Wrist Shot, assists: Jake Gardiner (1), Auston Matthews (1)
## 158    Patrick Marleau (1) Backhand, assists: Auston Matthews (2), Mitchell Marner (1)
## 226          Patrick Marleau (2) Wrist Shot, assists: Nazem Kadri (1), Leo Komarov (1)
## 270 Mitchell Marner (1) Wrist Shot, assists: James van Riemsdyk (2), Morgan Rielly (1)
## 285      Mark Scheifele (1) Snap Shot, assists: Patrik Laine (1), Dustin Byfuglien (1)
## 287       Auston Matthews (1) Tip-In, assists: Connor Carrick (1), Andreas Borgman (1)
## 291                        Mathieu Perreault (1) Wrist Shot, assists: Bryan Little (1)
##     result.secondaryType result.strength.name about.period
## 94            Wrist Shot           Power Play            1
## 109           Wrist Shot                 Even            1
## 113           Wrist Shot                 Even            1
## 158             Backhand                 Even            2
## 226           Wrist Shot                 Even            3
## 270           Wrist Shot           Power Play            3
## 285            Snap Shot                 Even            3
## 287               Tip-In                 Even            3
## 291           Wrist Shot                 Even            3
##     about.periodTime           team.name coordinates.x coordinates.y
## 94             15:45 Toronto Maple Leafs            84            -6
## 109            17:40 Toronto Maple Leafs            62             5
## 113            18:23 Toronto Maple Leafs            84           -22
## 158            08:32 Toronto Maple Leafs           -82             2
## 226            00:36 Toronto Maple Leafs            68            12
## 270            08:07 Toronto Maple Leafs            85            -6
## 285            11:31       Winnipeg Jets           -82             8
## 287            11:57 Toronto Maple Leafs            84            -3
## 291            12:57       Winnipeg Jets           -80             1

Now we can see that there are many columns, among them coordinates.x and coordinates.y which tell us the location of the play on the rink, where [0, 0] is the center of the rink.

More involved data retrieval - many games in parallel

Now we know how to look at the positional data for one match so one very interesting aspect of the data is where plays happen overall. We will now investigate and plot where different plays were happening in the regular season 2017/2018. Looking at ?nhl_games we see that for regular seasons we can usually get all the gameIds in the interval 2017020001:2017021271.

# Define the game ids
gameIds <- 2017020001:2017021271

# Retrieve the data
gameFeeds <- nhlapi::nhl_games_feed(gameIds)

To retrieve the data a bit faster, we can also use the parallel package which is part of the base R installation to retrieve the data in parallel, for example, like so.

# Define the game ids
gameIds <- 2017020001:2017021271

# Create a local cluster 
cl <- parallel::makeCluster(parallel::detectCores() / 2)

# Retrieve the data using nhlapi::nhl_games_feed()
gameFeeds <- parallel::parLapplyLB(cl, gameIds, nhlapi::nhl_games_feed)

# Stop the cluster
parallel::stopCluster(cl)

Now we have the data retrieved in a list called gameFeeds. It might be wise to store it on disk such that we do not have to do the long retrieval all the time, for example using saveRDS():

saveRDS(gameFeeds, file.path("~", "gamefeeds_regular_2017.rds"))

Processing and plotting positional data

Now that the data is safely retrieved, we can process and prepare the data on plays for plotting.

# Retrieve the data frames with plays from the data
getPlaysDf <- function(gm) {
  playsRes <- try(gm[[1L]][["liveData"]][["plays"]][["allPlays"]])
  if (inherits(playsRes, "try-error")) data.frame() else playsRes
}
plays <- lapply(gameFeeds, getPlaysDf)

# Bind the list into a single data frame
plays <- nhlapi:::util_rbindlist(plays)

# Keep only the records that have coordinates
plays <- plays[!is.na(plays$coordinates.x), ]

# Move the coordinates to non-negative values before plotting
plays$coordx <- plays$coordinates.x + abs(min(plays$coordinates.x))
plays$coordy <- plays$coordinates.y + abs(min(plays$coordinates.y))

Now we have the data ready in a plays data.frame, finally we can create some cool plots. As an example, in the following chunk the popular ggplot2 package is used to plot densities and events that would yield results similar to the ones shown below:

library(ggplot2)

# Look at goals only
goals <- plays[result.event == "Goal"]

ggplot(goals, aes(x = coordx, y = coordy)) +
  labs(title = "Where are goals scored from") +
  geom_point(alpha = 0.1, size = 0.2) +
  xlim(0, 198) + ylim(0, 84) +
  geom_density_2d_filled(alpha = 0.35, show.legend = FALSE) +
  theme_void()

Some examples of rendered images

With a bit of effort, we can also add a background image of the ice hockey rink to make the density plots more relatable and arrive at some quite informative plots:

Happy exploring!

References

The {nhlapi} package on CRAN
The {nhlapi} package on GitHub

A review of my experience with the Functional Programming Principles in Scala course

Sat, 13 Jun 2020 12:00:00 +0000

Introduction

Functional programming is a programming paradigm where programs are constructed by applying and composing functions and it quite popular in the data science application because of some of its useful properties that can help for example with scaling computations. One well-known resource to get into functional programming is the Functional Programming Principles in Scala course by École Polytechnique Fédérale de Lausanne.

In this post, I share my personal experience with completing the Functional programming in Scala course on Coursera in May 2020, briefly walk through the content and write about the course assignments. I wrote down each of the paragraphs as I went through the course, so it is not a retrospective evaluation but more of a “review-style diary” of the process of completing the course. Since this blog is oriented towards R, I will also try to make parallels with the R environment that can be relatable to R users and developers.

Disclaimer, what to expect

First off, this post does not mean to be an objective review as your experience will most likely be very different to mine, based on both

your expectations coming into the course and
your prior background
your prior experience with Scala and sbt

I expected to get deeper and more structured knowledge for practical use in Scala with relation to data science and functional programming, as my prior exposure to Scala was mostly maintaining/fixing in an already established code base and creating Spark extensions to work with Spark from R.

I wrote these comments as I went through the content and assignments instead of after finishing, so you might find the tone of the entire article change as the weeks change.

Course organization, pre-course preparatory work

Organization

The course is organized into video sessions split into 6 weeks, but since it is fully online you can choose your own pace. I completed the course in two weeks while being on a standard working schedule. Each week apart from week 5 has a programming assignment that is submitted to Coursera and automatically graded.

I found the assignments executed very well from a technical perspective and had no issues at all with downloading, compiling, running, and submitting them.

You can submit each assignment as many times as you want, so there is no stress making the submission right on the first try. Once the course is completed, you get a certificate.

Pre-course setup

Since I have prior experience with Scala and sbt, the preparation was not difficult and I was able to get going quickly.

Week 1

Content

Briefly present basic programming paradigms and concepts, model of evaluation of expressions, call by name and call by value strategies, and focuses on recursion, also introducing tail recursion.

Assignment

The assignment is purely recursion oriented - Pascal’s triangle and such. I was able to complete the assignment easily, even though I found the final exercise challenging. My issue with the content was that this felt more like school homework and I was coming into the course looking to improve and gain practical skill with Scala.

Week 2

Content

The lectures won me over with constructing a custom class for working with rational numbers. This immediately clicked with me and also was very useful, because it walked through creating classes, defining methods and operators, constructors, requirements, and assertions in a very concise and practical way.

Assignment

A different story. The introduction to the problems goes something like

“We represent a set of integers by its characteristic function and define a type alias for this representation.”

I had a bit of intuition around this (and if you come from a CS background this may come as second nature), but if you have neither of those, you might have some terminology to study before you even understand what the assignment is about. This is fine if it is in line with your expectations of the course, but if you came for something practical, thinking about programming recursive transformations on integer sets represented via characteristic functions may not seem like the best investment of your time.

Week 3

Content

Has a nice explanation of singleton objects and finally, we look at organizing classes, traits, and objects into packages, very nice. Until we are back to recursion again, this time on binary trees.

Assignment

Writing recursive methods on binary trees. I spent way more time thinking about how to do it than programming. The methods are very short once done, but complex to think about, especially if you are not used to thinking about recursive traversal of binary trees.

Also, it got frustrating. The assignment tests were failing because of some predefined timeout that is hard to reproduce locally (you don’t know what tests the grader runs) and you only know when you submit. The issue was to make a recursive method more efficient by placing brackets better around infix operators, which I honestly would not figure out without reading through the course’s discussion board.

Especially frustrating about this is that the video content only covered trivial cases and the assignment asked for far more complex problems. I was close to just quitting the course at this point.

Week 4

Content

Starts by rewriting Boolean and Integer types as abstract classes - Peano numbers in case of non-negative integers. Quickly flies over subtyping, generics, and pattern matching and shows only trivial examples. The more interesting example, well, go do it yourself! The final video shows a bit of practice with lists programming a recursive insertion sort.

Assignment

Implement Huffman coding methods via binary trees using pattern matching. I had no idea what Huffman coding is, so I had to first research a topic I had no interest in to even understand the assignment. To give a taste, this is one of the exercises:

“Define the function decode which decodes a list of bits which were already encoded using a Huffman tree, given the corresponding coding tree.”

Also, one of the hints, “hint: very simple” was simply priceless.

Weeks 5 and 6

Content

Week 5 looks at methods available for lists and gives more details behind the intuition. The later sessions even show how we can prove some properties of the recursive methods, which I found interesting. In week 6 we go deeper into the collections is Scala and look at for expressions. We solve the n queens problem with Sets and for expressions. One session is dedicated to Maps, Options, and methods such as groupBy.

As an example, we implement addition of polynomials using these concepts. The conclusion session nicely brings things together by walking through one implementation of the conversion of telephone numbers to sentences. The session looking at Map, Filter, and Reduce methods was very relatable to R’s Map(), Filter(), Reduce() and Position() functions as their design is similar to the corresponding methods in Scala (look at ?Reduce in R for more).

Assignment

There is no assignment in week 5. Week 6 assignment is to compute anagrams of sentences. Compared to the previous assignments I found this one much more fun and interesting, so it felt like a positive ending to the course. Apart from the very first one, this is the only assignment I worked to get a 100% score as I found it motivating. If more of the course had at least this level of practicality, I would have enjoyed it much more.

Scala in VS Code

Course execution and technical notes

Each assignment comes with a pre-prepared sbt project that can be compiled and partially tested, so it is easy to start working on the assignment
The submission and grading process work conveniently and automatically
The reading materials themselves often refer to Java constructs to explain Scala constructs, which may tell you nothing if you do not have prior experience with Java. For instance: “Traits are like interfaces in Java, but they can also contain concrete members, i.e. method implementations or field definitions.”
Some video lectures are placed in the wrong weeks (the narration says Week 3 but they are actually in Week 2) so it can get a bit confusing
During the first 4 weeks the videos stop for the viewer to fill in the examples, which I guess was meant to be interactive but I found it distracting and always skipped. In the final weeks, this feature was not there, which I found nice.

TL;DR - Just give me the overview

The course introduces key concepts of functional programming in Scala with a strong focus on recursion and walks the students through methods on immutable Scala objects. It also introduces pattern matching, for expressions, subtyping, and generics
The feel of the course was school-like as opposed to more practice-oriented courses
The assignments are challenging and I found them school-like, which I did not prefer as I was looking for more of a practical course

Automating R package checks across platforms with GitHub Actions and Docker in a portable way

Sat, 18 Apr 2020 12:00:00 +0000

Introduction

Automating the execution, testing and deployment of R code is a powerful tool to ensure the reproducibility, quality and overall robustness of the code that we are building. A relatively recent feature in GitHub - GitHub actions - allows us to do just that without using additional tools such as Travis or Jenkins for our repositories stored on GitHub.

In this post, we will examine using GitHub actions and Docker to test our R packages across platforms in a portable way and show how this setup works for the CRAN package languageserversetup.

Many different tools, many different syntaxes. And low portability

The motivation behind this post stems mostly from my experience with many different automation tools which we could in simplified terms refer to as CI/CD tools. Some of them offer a wide variety of features such as Jenkins, Bamboo or Travis, others, such as GitLab CI and GitHub Actions are perhaps less feature-rich but offer simplicity and very good out-of-the-box integration with the repository hosting.

What all these tools share apart from the usefulness of the features is however a bit less appealing for teams trying to build portable CI/CD pipelines - their own syntax.

One good example is the amazing work done to integrate R with Travis. Thanks to this integration, we can work with R relative well with Travis. It would likely require a similar effort to enable such integration on the other CI/CD tools.

What this means for development teams thinking about CI/CD pipelines is that building portable setups using tool-native syntax can quickly become an endeavor on its own - we have written about some examples of Jenkins-based solutions with regards to environments here and with regards to parallelization here. Porting such a setup built using a specific tool to another tool becomes increasingly difficult.

Containerizing and shell scripting our way to portable setups

Because of the experience described above, when setting up CI pipelines for R packages I find it beneficial and efficient to choose a route of portability instead. When setting up with GitLab CI a few years ago, the approach was:

create a Docker image in which R-related commands will run
write a simple shell script that wraps around it

This process is described in detail in there 2 posts:

Perhaps the biggest advantage of such an approach is that we can simply pick that shell script up and place it to a different tool and, assuming that the new tool supports Docker.

Everything will run just fine, apart from a few details that still stay tool-based, such as working with environment variables and authentication secrets.

Continuous integration for R-based applications with GitHub Actions

When creating the languageserversetup package, it was very important to test each change across many platforms automatically and since I opted to host the open-source code on GitHub instead of GitLab this time, GitHub Actions seemed like a natural choice for a CI/CD setup.

The current GitHub action for a CRAN-like checks looks as follows:

name: check_cran
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v1
    - name: Check for CRAN
      env:
        DOCKER_LOGIN_TOKEN: ${{ secrets.DOCKER_LOGIN_TOKEN }}
        LANGSERVERSETUP_RUN_DEPLOY: false
      run: sh ci/docker_stage.sh ci/check_rhub.R "cran"

As we can see, apart from the skeleton that you get for free, the only line that does some work is the very last one. It tells the GitHub Actions executor to run the shell script docker_stage.sh with 2 arguments:

sh ci/docker_stage.sh ci/check_rhub.R "cran"

This setup is very portable. You could take it almost verbatim and use it within Jenkins, GitLab CI and probably most other CI/CD tools.

The Docker-wrapping shell script

What is the docker_stage.sh script used for? In our case, it is 3-fold:

Run CRAN-like checks automatically
Run containerized deployments
Run and report test coverage

What they have in common is that they all happen in a Docker container and are described with an R script that can be executed via Rscript. That means that this shell script is just a helper that will:

Pull the needed Docker image
Create a container from that image
Copy the code checked-out by the (GitHub Actions) runner into the container
Execute the R script provided as the first command-line argument (ci/check_rhub.R above) and other arguments if needed
Stop and remove the container when done

The R scripts executed within the Docker container

Now the R scripts that are executed within the container can do almost any actions that you require, from checking the package, running unit tests to the execution of your data science models.

One fully automated example using this exact approach is how the sparkfromr.com book is deployed. The repositories are open-sourced, you can read more in this post.

The only important condition is that your Docker container can run that script successfully. In the R world, that mostly entails having R, all the R packages and their system dependencies installed. This is made amazingly easy by the Rocker Project, which provides versioned base R images, but also images with RStudio. For tidyverse fans, they even have an image with the entire tidyverse ready for use.

This is however very easily testable, as the setup using sh ci/docker_stage.sh ci/check_rhub.R "cran" will not only run via the CI/CD tools, but also on your development machine. Note that on Windows, you might need to enable the Windows Subsystem for Linux for that to be fully true.

Setting up the process this way may nudge you to a containerized development process, where you develop the project within a container. In that case, the fact that everything works is just an automatic consequence of the development process and the containerization has no overhead, because we can use that very same image for CI/CD purposes.

The GitHub actions yaml, environment variables and secrets

Of the few elements of the setup that are not fully portable, notable are environment variables and secrets. For GitHub Actions, we can do it with the env: clause, for example:

      env:
        DOCKER_LOGIN_TOKEN: ${{ secrets.DOCKER_LOGIN_TOKEN }}
        LANGSERVERSETUP_RUN_DEPLOY: false

The above will set the LANGSERVERSETUP_RUN_DEPLOY environment variable to false and the will expose the encrypted secret named DOCKER_LOGIN_TOKEN to an environment variable of the same name. The secrets can be created via your repository’s Settings -> Secrets menu on GitHub.

A concrete example - Checking an R package automatically using R Hub in 4 steps

Now with all the information above, let us look at a quick walk-through of a setup that will let us check your R package on multiple platforms using R Hub. We need:

An R script that will run and evaluate the check via R Hub - For the package languageserver setup, this looks as follows: ci/check_rhub.R. Note that this script is years old and quite possibly needlessly long and complicated.
A shell script that will run the R script, such as ci/docker_stage.sh
A docker container in which the R script can run. We have covered this in some detail in Preparing a private docker image to use with R-hub
A .yaml file in the .github/workflows directory of your repository, for example, .github/workflows/check_cran.yml

And that is it. Now we will have our package checked each time we push a commit to our repository:

GitHub Action log for package check via R Hub

Other uses - Test coverage reporting and script-based deployments

Since the languageserversetup repository is completely open, you can also look at the other GitHub actions setup for that repository. Note that all of the GitHub actions use the very same docker_stage.sh script, the only thing that changes are the R scripts per purpose:

Test coverage reporting with covr and codecov.io
- R script running the coverage computation with covr and publishing it to codecov.io
- GitHub Action definition
Debian-based script deployments
- R script running an example deployment and some tests
- GitHub Action definition

TL;DR - just show me the code

An example implementation of package testing with the CRAN package languageserversetup:

GitHub Actions workflows for the languageserversetup package
Docker-based shell script to execute R scripts
R script for package checks with R Hub
R script for reporting test coverage using Codecov.io and covr

An example implementation of bookdown publication publishing with sparkfromr.com

GitHub Actions workflows for the book deployment
Docker-based shell script to deploy the book. Note that there is no need for a separate R script because the action to be done is trivial.

References

Docker images for R on the Rocker Project
Get started with Docker official documentation
GitHub Actions: Creating and storing encrypted secrets
GitHub Actions: Documentation

Setting up R with Visual Studio Code quickly and easily with the languageserversetup package

Sat, 21 Mar 2020 12:00:00 +0000

Introduction

Over the past years, R has been gaining popularity, bringing to life new tools to with ith it. Thanks to the amazing work by contributors implementing the Language Server Protocol for R and writing Visual Studio Code Extensions for R, the most popular development environment amongst developers across the world now has very strong support for R as well.

In this post, we will look at the languageserversetup package that aims to make the setup of the R Language Server robust and easy to use by installing it into a separate, independent library and adjusting R startup in a way that initializes the language server when relevant.

Visual Studio Code and R

According to the 2019 StackOverflow developer survey, Visual Studio Code is the most popular development environment across the board, with amazing support for many languages and extensions ranging from improved code editing to advanced version control support and Docker integration.

Until recently the support for R in Visual Studio Code was in my view not comprehensive enough to justify switching from other tools such as RStudio (Server) to using VS Code exclusively. This has changed with the work done by the team implementing the following 3 tools:

The R extension for VS Code
The R LSP Client extension for VS Code
The languageserver package: An implementation of the Language Server Protocol for R

The features now include all that we need to work efficiently, including auto-complete, definition provider, code formatting, code linting, information on functions on hover, color provider, code sections and more.

If you are interested in more steps around the setup and the overview of features I recommend the Writing R in VSCode: A Fresh Start blogpost by Kun Ren. I also recommend that you follow Kun on Twitter if you are interested in the latest developments.

Setup considerations, issues, and tweaks: creating the `languageserversetup` package

With my current team, we have almost fully embraced Visual Studio Code as an IDE for our work in R, which is especially great as the work is multi-language and multi-environment in nature and we can do our development in Scala, R and more, including implementing and testing Jenkins pipelines and designing Docker images without leaving VS Code.

Setting up for the team on multiple systems and platforms we have found the following interesting points which were my motivation to write a small R package, languageserversetup, that should make the installation and setup of the R language server as easy and painless as possible.

Managing package libraries

One of the specifics of R is that all extensions (packages) are installed into package libraries, be it the packages we develop and use for our applications or the tools we use mostly as means to make our development life easier. We can therefore often end in a situation where we need to use different versions of R packages for different purposes. For example, the languageserver package currently needs R6 (>= 2.4.1), stringr (>= 1.4.0) and more, in total it recursively requires 75 other R packages to be installed. When installing and running the package we can run into conflicting versions of what our current applications need versus what the languageserver package requires to function properly.

Managing library paths

The second consideration, related to the first one is that if we simply install the language server into the default library with for instance install.packages it will change the library to a state that is possibly not desired. We can also run into unexpected crashes, where the languageserver will function properly for a time until one of the non-triggered dependencies with a hidden conflict gets triggered.

A solution - Complete library separation and smart initialization

One possible solution to the above issues is to:

Keep the package libraries of the languageserver and the other libraries that the user uses (perhaps apart from the main system library containing the base and recommended packages that come with the R installation itself) completely separated, including all non-base dependencies
Initialize that library only when the R process in question is triggered by the language server, otherwise, keep the process untouched and use the user libraries as usual

Solving it with 2 R commands - the `languageserversetup` package

To make the above solution easily accessible, I have created a small R package called languageserversetup that will do all the work for you. It can be installed from CRAN and it has no dependencies on other R packages:

install.packages("languageserversetup")

Now the entire setup has only 2 steps:

Install the languageserver package and all of its dependencies into a separate independent library (Will ask for confirmation before taking action) using:

languageserversetup::languageserver_install()

Add code to .Rprofile to automatically align the library paths for the language server functionality if the process is an instance of the languageserver, otherwise, the R session will run as usual with library paths unaffected. This is achieved by running (will also ask for confirmation):

languageserversetup::languageserver_add_to_rprofile()

That’s it. Now you can enjoy the functionality without caring about the setup of libraries or any package version conflicts. Thanks to the full separation of libraries, the removal is as trivial as deleting the library directory.

In action with VS Code

Installing languageserversetup and using `languageserver_install()`

Installing the language server

Initializing the functionality with `languageserver_add_to_rprofile()`

Adding the language server to startup

All done, now enjoy the awesomeness!

Technical details

If you are interested in more technical details,

please visit the package’s openly accessible GitHub repository.
the README.md has information on options configuration, installation, uninstallation, platforms and more
the help files for the functions can be accessed from R with ?languageserver_install, ?languageserver_startup, ?languageserver_add_to_rprofile and ?languageserver_remove_from_rprofile for more details on their arguments and customization
for testing, GitHub actions are set up for multiple platforms and to run all CRAN checks on the package on each commit

References

The R extension for VS Code Marketplace
The R LSP Client extension on VS Code Marketplace
The languageserver package on GitHub
The languageserversetup package on GitHub
Kun Ren’s Writing R in VSCode: A Fresh Start blogpost

R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years

Sat, 22 Feb 2020 12:00:00 +0000

Introduction

It is almost the 29th of February 2020! A day that is very interesting for R, because it marks 20 years from the release of R v1.0.0, the first official public release of the R programming language.

In this post, we will look back on the 20 years of R with a bit of history and 3 interesting perspectives - how much faster did R get over the years, how many R packages were being released since 2000 and how did the number of package downloads grow.

The first release of R, 29th February 2000

The first official public release of R happened on the 29th of February, 2000. In the release announcement, Peter Dalgaard notes:

“The release of a current major version indicates that we believe that R has reached a level of stability and maturity that makes it suitable for production use. Also, the release of 1.0.0 marks that the base language and the API for extension writers will remain stable for the foreseeable future. In addition we have taken the opportunity to tie up as many loose ends as we could.”

Today, 20 years later, it is quite amazing how true the statement around the API remaining stable has proven. The original release announcement and full release statement are still available online.

You can also still download the very first public version of R. For instance, for Windows you can find it on the Previous Releases of R for Windows page. And it is quite runnable, even under Windows 10.

Further down in history, to 1977

Now to give R justice in terms of age, we need to go even further into history. In the full release statement of R v1.0.0, we can find that

R implements a dialect of the award-winning language S, developed at Bell Laboratories by John Chambers et al.

With some digging we can use the Wayback Machine Internet Archive to find interesting notes on Version 1 of S itself written by John Chambers, where he writes:

Over the summer of 1976, some actual implementation began. The paper record has a gap over this period (maybe we were too busy coding to write things down). My recollection is that by early autumn, a language was available for local use on the Honeywell system in use at Murray Hill. Certainly by early 1977 there was software and a first version of a user’s manual.

As we can see the ideas and principles behind R are actually much older than 20 years and even 40 years. If you are interested in the history, I recommend watching the very interesting 40 years of S talk from userR 2016.

Faster - How performant is R today versus 20 years ago?

With the 20th birthday of R approaching, I was curious as to how much faster did the implementation of R get with increasing versions. I wrote a very simple benchmarking code to solve the Longest Collatz sequence problem for the first 1 million numbers with a brute-force-ish algorithm.

Then executed it on the same hardware using 20 different versions of R, starting with the very original 1.0, through 2.0, 3.0 all the way to today’s development version.

Benchmarking code

Below is the code snippet with the implementation to be benchmarked:

col_len <- function(n) {
  len <- 0
  while (n > 1) {
    len <- len + 1
    if ((n %% 2) == 0)
      n <- n / 2
    else {
      n <- (n * 3 + 1) / 2
      len <- len + 1
    }
  }
  len
}

res <- lapply(
  1:10,
  function(i) {
    gc()
    system.time(
      max(sapply(seq(from = 1, to = 999999), col_len))
    )
  }
)

Results

Now to the interesting part, the results - the below chart shows the boxplots of time required to execute the code in seconds, with R versions on the horizontal axis.

We can see that the median time to execute the above code to find the longest Collatz sequence amongst the first million numbers was:

February 2000: More than 17 minutes with the first R version, 1.0.0
January 2002: A large performance boost came already with the 1.4.1 release, decreasing the time by almost 4x, to around 4.5 minutes
October 2004: Even more interestingly, my measurements have seen another big improvement with version 2.0.0 - to just 168 seconds, less than 3 minutes. I was not however able to get such good results for any of the later 2.x versions
April 2014 - Another speed improvement came 10 years later, with version 3.1 decreasing the time to around 145 seconds
April 2017 - Finally, the 3.4 release has seen another significant performance boost, from this version on the time needed to perform this calculation is less than 30 seconds.

Some details and notes

The above is by no means a proper benchmarking solution and was ran purely out of interest. The benchmarks were run on a

Windows-based PC with Intel Core (TM) i5-4590 Processor and 8 GB DDR3 1600 MHz RAM.
using 32-bit versions of R, with no additional packages installed
the following options were used with R 1.0.0: --vsize=900M --nsize=20000k

Some interesting notes on running the same code with a 20-year-old version of R:

There was no message() function available
Integer literals using the L suffix were not accepted
The function do.call() needed a character function name as the first argument
Did not accept = for assignment. It did accept _ though ;-)

Other than that, the code ran with no issues across all the tested versions.

Stronger - How many packages were released over the years?

The power of R comes by no small part from the fact that it is easily extensible and the extensions are easily accessible using The Comprehensive R Archive Network, known to most simply as CRAN.

Next on the list of interesting numbers was to look at how CRAN has grown to the powerhouse with more than 15 000 available packages today. Namely, I looked at the numbers of new packages (first releases to CRAN), and total releases (including newer versions of existing packages) over the years using the pkgsearch package.

Results

Once again, the numbers speak for themselves

In 2000-2004 the number of newly released packages was less than a 100
In 2010 CRAN has seen more than 400 new packages
In 2014 more than 1000 packages had their first release
In 2017 over 2000 new packages were added to CRAN
In 2018 and 2019, the number of total CRAN releases was more than 10 000

I would like to take this opportunity to thank the team behind CRAN to make this amazing growth possible.

Bigger - How did downloads of R packages grow?

The size of the user and developer bases of programming languages is difficult to estimate, but we can use a simple proxy to get a picture in terms of growth. RStudio’s CRAN mirror provides a REST API from which we can look at and visualize the number of monthly downloads of R packages in the past 7 years:

Note the numbers above represent just one of many CRAN mirrors and therefore the true number of package downloads is much higher, the informational value of the chart is mostly in the growth, which is quite impressive:

January 2013 has seen around 1.1 million
January 2015 it was 7.7 million
January 2017 it was 26.9 million
January 2020 more than 128 million downloads

Thank you for the 20 years

And here is to 20 more.

Cheers!

Resources

The release announcement on stat.ethz.ch
The full release statement at developer.r-project.org
The older version R Installers for Windows

Releasing and open-sourcing the Using Spark from R for performance with arbitrary code series

Sat, 04 Jan 2020 12:00:00 +0000

Introduction

Over the past months, we published and refined a series of posts on Using Spark from R for performance with arbitrary code. Since the posts have grown in size and scope the blogposts were no longer the best medium to share the content in the way most useful to the readers, we decided to compile a publication instead and open-source it for all readers to use freely.

In this post, we present Using Spark from R for performance, an open-source online publication that will serve as a medium to communicate the current and future installments of the series comprehensively, including instructions on how to use it and a Docker image with all the prerequisites needed to run the code examples.

Who is this book for?

The book is published at sparkfromr.com and it focuses on users who are interested in practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages. This publication focuses on exploring the different interfaces available for communication between R and Spark using the sparklyr package.

We have also created a Docker image that lets you use the code in the book without caring for setting up all the necessary software requirements such as Java, Spark, and all the necessary R packages. A guide to using the book with that image is included as a separate chapter.

What are the main topics currently covered?

The main topics are summarized in the following chapters:

Are the sources also available?

Yes. The content is rendered and published automatically from publicly accessible git repositories, you can find the

Content sources in the sparkfromr GitHub repository
Rendered version in the sparkfrom_deployed GitHub repository
Automatically built Docker image used to render the book on DockerHub
Sources used to build the Docker images in the sparkfrom_docker GitHub repository

All contributions to the above are of course most welcome.

Where can issues be raised?

In case you find any errors and other issues with the book, or simply have requests for improvements or more content features the ideal place to raise them is directly in the GitHub repositories:

For issues in the content of the book, please raise an issue here
For issues related to the Docker image, please raise an issue here

Acknowledgments and thank yous

Creation of this book would not be possible without many openly available resources such as the

R packages around the rmarkdown ecosystem created by Yihui Xie, namely the bookdown package via which this publication is rendered
the project also heavily relies on the Rocker Project which provides Docker images for the R environment thanks to Carl Boettiger, Dirk Eddelbuettel, and Noam Ross
last but not least there would be nothing to write about in this short book if the sparklyr package was not written by Javier Luraschi et al., the R programming language itself maintained by the R core group and the Apache Spark creators and maintainers.

My thanks go to the creators and maintainers of all these amazing open-source tools.

Logos of bookdown, Apache Spark and R

Happy reading!

4 great free tools that can make your R work more efficient, reproducible and robust

Sat, 21 Dec 2019 12:00:00 +0000

Introduction

It is Christmas time again! And just like last year, what better time than this to write about the great tools that are available to all interested in working with R. This post is meant as a praise to a few selected tools and packages that helped me to be more efficient and productive with R in 2019.

In this post, we will praise free tools that can help your work become more efficient, reproducible and productive, namely the data.table package, the Rocker project for R-based Docker images, the base package parallel, and the R-Hub service for package checking.

data.table - R’s unsung powerhouse

One of the packages I find most under-marketed and under-appreciated in the R package ecosystem is data.table. If it is mentioned, it is mostly for its speed and memory efficiency, which is certainly well deserved, but I feel dismissing the other benefits and features is not doing it justice. Here are a few points that I like about data.table that do not get that much exposure.

The concise and generic syntax

I enjoy the fact that data.table’s syntax is very concise and principle-driven. In effect, all you need for most common use cases is to learn using the [] brackets and an amazing world of opportunities will follow. Just one small example on taking 2 data tables, joining them on their common columns, filtering on rows, summarizing a variable grouped by an evaluated expression on 1 line:

# Prepare the packages and data
library(data.table)
flts <- as.data.table(nycflights13::flights)
wthr <- as.data.table(nycflights13::weather)
byCols <- intersect(names(flts), names(wthr))

# Join, filter, group by and aggregate
wthr[flts, on = byCols][origin == "JFK", mean(dep_delay, na.rm = TRUE), precip > 0]

##    precip       V1
## 1:  FALSE 10.92661
## 2:     NA 13.66543
## 3:   TRUE 29.70753

Fully featured data wrangling toolbox

And this is just scratching the surface as data.table also provides functions such as

dcast() and melt() for efficient data reshaping
rbindlist() for fast replacement of do.call("rbind", l)
fsetdiff(), fintersect(), funion() and fsetequal() for fast and easy to use operations on data.tables
rollup(), cube() and groupingsets() to create pivot tables, more on that in a dedicated article

No dependencies

All in all, I consider data.table to be a single package that brings speed, efficiency and conciseness to all data wrangling operations. Another benefit that also often stays unmentioned is the fact that data.table has no dependencies on other non-base R packages, which is beneficial for maintenance, stability, reproducibility, size and deployment speeds.

Fast reading and writing of (compressed) csvs

One additional feature of data.table that I use regularly is the ability to read and write data to and from text files with amazing speeds using the fread() and fwrite() functions. On one project, it gave the team I was a part of such a benefit I wrote an article on it.

Not only is it very fast and convenient, but thanks to a recently added feature, data.table now supports fwrite() directly to gzipped csvs, which saves significant space when writing large amounts of data.

For getting started with data.table, I recommend the Introduction to data.table vignette

The Rocker project for R-based Docker images

Containerization is a powerful and useful tool for many purposes, one of them being reproducibility. In the R world, ensuring that our R library contains the exact versions of packages we need can be achieved by using tools such as packrat or its successor renv.

Managing the R package versions can however only get us so far, especially when relying on other system dependencies such as pandoc for rendering our R Markdown documents or Java. And when we need to test our R applications against multiple versions of R itself, things can get very tedious and messy very quickly using just one environment, especially on UNIX-based platforms.

In comes the Rocker project - Docker Containers for the R Environment. Thanks to the efforts of Carl Boettiger, Dirk Eddelbuettel, and Noam Ross, spinning a container with a specific version of R, RStudio or even the tidyverse packages is as easy as launching a terminal and running

docker run --rm -ti rocker/r-base

Want to test your R code using an older version of R, say some Very, Very Secure Dishes from 2016? As easy as

docker run --rm -ti rocker/r-ver:3.2.5

Even more usefully, all the sources to build the Docker images are also available on GitHub, so we can adapt the images for our own usage. For instance

the series of articles on Using Spark from R for performance with arbitrary code on this blog uses a setup adapted from the rocker/r-ver:3.6.1 image
we have also used the images provided by the Rocker project when setting up continuous multi-platform R package building, checking and testing with R-Hub
even to keep the building of this very website stable and reproducible, a Docker image based on the Rocker project is used

On a more generic note, learning Docker is beneficial to R users also when working outside R and there are many great learning resources to do so. For learning Docker I recommend the Get started documentation.

Base package parallel

The internals of the R language are single-threaded, meaning that when writing R code, unless optimized for multi-threaded computation under the hood such as data.table does, our code will only utilize 1 thread, which can pose a challenge to performance even in common daily tasks, especially now that even common, very portable ultrabooks come with processors with 4 or more cores and 8 or more threads.

The R ecosystem provides many ways to take advantage of the multiple threads available. In this post I would like to give more visibility to the parallelization options that come with the base R installation itself, not requiring any extra external dependencies or packages - via the package parallel.

In a very small showcase, let’s look at how much faster we can execute a brute-force-ish solution to the Longest Collatz sequence problem for the first 10 million numbers. First, define the function that will compute the sequence length for a given integer n:

col_len <- function(n) {
  len <- 0L
  while (n > 1) {
    len <- len + 1L
    if ((n %% 2) == 0)
      n <- n / 2
    else {
      n <- (n * 3 + 1) / 2
      len <- len + 1L
    }
  }
  len
}

Running the function for numbers from 1 to 9,999,999 using sapply() and measuring the time on this particular laptop shown that the process finished in around 580 seconds - almost 10 minutes:

max(sapply(seq(from = 1, to = 9999999), col_len))

## [1] 8400511

Now we will create a simple cluster on the local machine using all available threads and send the function definition to all the created worker processes:

# Attach the parallel package
library(parallel)
# Create a cluster using all available threads
cl <- makeCluster(detectCores(), methods = FALSE)
# Send the definition of the col_len function to the workers
clusterExport(cl, "col_len")

Next, we execute the function in parallel using the cluster. It is as simple as just using parSapply() instead of sapply() and providing the cluster definition cl as the first argument:

# Execute in parallel using cluster cl
max(parSapply(cl, seq(from = 1, to = 9999999), col_len))

## [1] 8400511

After the process is done, it is good practice to stop the cluster:

# Stopping the cluster
stopCluster(cl)

Using all 8 available threads the time needed to execute the code and get the same results went down to around 90 seconds or 1.5 minutes. We can therefore gain significant time savings using base R executing some of your code in parallel, adjusting the code very minimally and using very faimilar syntax.

For more information on using the parallel package, I recommend reading the package’s vignette by running vignette("parallel") or reading online. For more information on High-Performance and Parallel Computing with R, there is a dedicated CRAN Task View.

Rhub for fast and automated multi-platform R package testing

R-hub offers free R CMD check as a service on different platforms. This enables R developers to quickly and efficiently check their R packages to make sure they pass all necessary checks on several platforms. As a bonus, the checks seem to be running in a very short time, which means we can have your results at hand in a few minutes.

Using R-hub interactively is as simple as installing the rhub package from CRAN, validating your e-mail by running rhub::validate_email() and running:

cr <- rhub::check()

In an interactive session, this will offer a list of platforms to choose from and check our package against them.

CI/CD running checks on multiple platforms with R-hub

For more introductory information, we recommend the Get started with rhub article. We have written about automating and continuously executing multiplatform checks using GitLab CI/CD integration and Docker images in a separate blog post.

Resources

The Christmas praise post for 2018
The Introduction to data.table vignette
The Get started Docker documentation
The Parallel package vignette
The Get started with rhub article

Thank you for reading and
have a very merry Christmas :o)

Using Spark from R for performance with arbitrary code - Part 5 - Exploring the invoke API from R with Java reflection and examining invokes with logs

Sat, 23 Nov 2019 12:00:00 +0000

Introduction

In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs, SQL query generators that can be executed by Spark and how to use the lower-level API to invoke methods on Java object references from R.

In this fifth part, we will look into more details around sparklyr’s invoke() API, investigate available methods for different classes of objects using the Java reflection API and look under the hood of the sparklyr interface mechanism with invoke logging.

Preparation

The full setup of Spark and sparklyr is not in the scope of this post, please check the first one for some setup instructions and a ready-made Docker image.

If you have docker available, running

docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio

Should make RStudio available by navigating to http://localhost:8787 in your browser. You can then use the user name rstudio and password pass to login and continue experimenting with the code in this post.

# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Connect and copy the flights dataset to the instance
sc <- sparklyr::spark_connect(master = "local")
tbl_flights <- dplyr::copy_to(sc, nycflights13::flights, "flights")

Examining available methods from R

If you did not do so, it is recommended to read the previous part of this series before this one to get a quick overview of the invoke() API.

Using the Java reflection API to list the available methods

The invoke() interface is powerful, but also a bit hidden from the eyes as we do not immediately know what methods are available for which object classes. We can circumvent that using the getMethods method which (in short) returns an array of Method objects reflecting public member methods of the class.

For instance, retrieving a list of methods for the org.apache.spark.SparkContext class:

mthds <- sc %>% spark_context() %>%
  invoke("getClass") %>%
  invoke("getMethods")
head(mthds)

## [[1]]
## <jobj[55]>
##   java.lang.reflect.Method
##   public org.apache.spark.util.CallSite org.apache.spark.SparkContext.org$apache$spark$SparkContext$$creationSite()
## 
## [[2]]
## <jobj[56]>
##   java.lang.reflect.Method
##   public org.apache.spark.SparkConf org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_conf()
## 
## [[3]]
## <jobj[57]>
##   java.lang.reflect.Method
##   public org.apache.spark.SparkEnv org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_env()
## 
## [[4]]
## <jobj[58]>
##   java.lang.reflect.Method
##   public scala.Option org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_progressBar()
## 
## [[5]]
## <jobj[59]>
##   java.lang.reflect.Method
##   public scala.Option org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_ui()
## 
## [[6]]
## <jobj[60]>
##   java.lang.reflect.Method
##   public org.apache.spark.rpc.RpcEndpointRef org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_heartbeatReceiver()

We can see that the invoke() chain has returned a list of Java object references, each of them of class java.lang.reflect.Method. This is a good result, but the output is not very user-friendly from the R user perspective. Let us write a small wrapper that will return a some of the method’s details in a more readable fashion, for instance the return type and an overview of parameters:

getMethodDetails <- function(mthd) {
  returnType <- mthd %>% invoke("getReturnType") %>% invoke("toString")
  params <- mthd %>% invoke("getParameters")
  params <- vapply(params, invoke, "toString", FUN.VALUE = character(1))
  c(returnType = returnType, params = paste(params, collapse = ", "))
}

Finally, to get a nice overview, we can make another helper function that will return a named list of methods for an object’s class, including their return types and overview of parameters:

getAvailableMethods <- function(jobj) {
  mthds <- jobj %>% invoke("getClass") %>% invoke("getMethods")
  nms <- vapply(mthds, invoke, "getName", FUN.VALUE = character(1))
  res <- lapply(mthds, getMethodDetails)
  names(res) <- nms
  res
}

Investigating DataSet and SparkContext class methods

Using the above defined function we can explore the methods available to a DataFrame reference, showing a few of the names first:

dfMethods <- tbl_flights %>% spark_dataframe() %>%
  getAvailableMethods()

# Show some method names:
dfMethodNames <- sort(unique(names(dfMethods)))
head(dfMethodNames, 20)

##  [1] "agg"                           "alias"                        
##  [3] "apply"                         "as"                           
##  [5] "cache"                         "checkpoint"                   
##  [7] "coalesce"                      "col"                          
##  [9] "collect"                       "collectAsArrowToPython"       
## [11] "collectAsList"                 "collectToPython"              
## [13] "colRegex"                      "columns"                      
## [15] "count"                         "createGlobalTempView"         
## [17] "createOrReplaceGlobalTempView" "createOrReplaceTempView"      
## [19] "createTempView"                "crossJoin"

If we would like to see more details we can now investigate further, for instance show different parameter interfaces for the agg method, showing that the agg method has the following parameter interfaces:

sort(vapply(
  dfMethods[names(dfMethods) == "agg"], 
  `[[`, "params",
  FUN.VALUE = character(1)
))

##                                                                                                                                  agg 
##                                                                             "java.util.Map<java.lang.String, java.lang.String> arg0" 
##                                                                                                                                  agg 
##                                                              "org.apache.spark.sql.Column arg0, org.apache.spark.sql.Column... arg1" 
##                                                                                                                                  agg 
##                                           "org.apache.spark.sql.Column arg0, scala.collection.Seq<org.apache.spark.sql.Column> arg1" 
##                                                                                                                                  agg 
##                                                            "scala.collection.immutable.Map<java.lang.String, java.lang.String> arg0" 
##                                                                                                                                  agg 
## "scala.Tuple2<java.lang.String, java.lang.String> arg0, scala.collection.Seq<scala.Tuple2<java.lang.String, java.lang.String>> arg1"

Similarly, we can look at a SparkContext class and show some available methods that can be invoked:

scMethods <- sc %>% spark_context() %>%
  getAvailableMethods()
scMethodNames <- sort(unique(names(scMethods)))
head(scMethodNames, 60)

##  [1] "$lessinit$greater$default$3" "$lessinit$greater$default$4"
##  [3] "$lessinit$greater$default$5" "accumulable"                
##  [5] "accumulableCollection"       "accumulator"                
##  [7] "addedFiles"                  "addedJars"                  
##  [9] "addFile"                     "addJar"                     
## [11] "addSparkListener"            "applicationAttemptId"       
## [13] "applicationId"               "appName"                    
## [15] "assertNotStopped"            "binaryFiles"                
## [17] "binaryFiles$default$2"       "binaryRecords"              
## [19] "binaryRecords$default$3"     "broadcast"                  
## [21] "cancelAllJobs"               "cancelJob"                  
## [23] "cancelJobGroup"              "cancelStage"                
## [25] "checkpointDir"               "checkpointDir_$eq"          
## [27] "checkpointFile"              "clean"                      
## [29] "clean$default$2"             "cleaner"                    
## [31] "clearCallSite"               "clearJobGroup"              
## [33] "collectionAccumulator"       "conf"                       
## [35] "createSparkEnv"              "dagScheduler"               
## [37] "dagScheduler_$eq"            "defaultMinPartitions"       
## [39] "defaultParallelism"          "deployMode"                 
## [41] "doubleAccumulator"           "emptyRDD"                   
## [43] "env"                         "equals"                     
## [45] "eventLogCodec"               "eventLogDir"                
## [47] "eventLogger"                 "executorAllocationManager"  
## [49] "executorEnvs"                "executorMemory"             
## [51] "files"                       "getAllPools"                
## [53] "getCallSite"                 "getCheckpointDir"           
## [55] "getClass"                    "getConf"                    
## [57] "getExecutorIds"              "getExecutorMemoryStatus"    
## [59] "getExecutorThreadDump"       "getLocalProperties"

Using helpers to explore the methods

We can also use the helper functions to investigate more. For instance, we see that there is a getConf method avaiable to us. Looking at the object reference however does not provide useful information, so we can list the methods for that class and look for "get" methods that would show us the configuration:

spark_conf <- sc %>% spark_context() %>% invoke("conf")
spark_conf_methods <- spark_conf %>% getAvailableMethods() 
spark_conf_get_methods <- spark_conf_methods %>%
  names() %>%
  grep(pattern = "get", ., value = TRUE) %>%
  sort()
spark_conf_get_methods

##  [1] "get"                 "get"                 "get"                
##  [4] "getAll"              "getAllWithPrefix"    "getAppId"           
##  [7] "getAvroSchema"       "getBoolean"          "getClass"           
## [10] "getDeprecatedConfig" "getDouble"           "getenv"             
## [13] "getExecutorEnv"      "getInt"              "getLong"            
## [16] "getOption"           "getSizeAsBytes"      "getSizeAsBytes"     
## [19] "getSizeAsBytes"      "getSizeAsGb"         "getSizeAsGb"        
## [22] "getSizeAsKb"         "getSizeAsKb"         "getSizeAsMb"        
## [25] "getSizeAsMb"         "getTimeAsMs"         "getTimeAsMs"        
## [28] "getTimeAsSeconds"    "getTimeAsSeconds"    "getWithSubstitution"

We see that there is a getAll method that could prove useful, returning a list of tuples and taking no arguments as input:

# Returns a list of tuples, takes no arguments:
spark_conf_methods[["getAll"]]

##              returnType                  params 
## "class [Lscala.Tuple2;"                      ""

# Invoke the `getAll` method and look at part of the result
spark_confs <- spark_conf %>% invoke("getAll")
spark_confs <- vapply(spark_confs, invoke, "toString", FUN.VALUE = character(1))
sort(spark_confs)[c(2, 3, 12, 14)]

## [1] "(spark.app.name,sparklyr)"         "(spark.driver.host,localhost)"    
## [3] "(spark.spark.port.maxRetries,128)" "(spark.sql.shuffle.partitions,2)"

Looking at the Scala documentation for the getAll method, we actually see that there is information missing on our data - the classes of the objects in the tuple, which in this case is scala.Tuple2<java.lang.String,java.lang.String>[].

We could therefore improve our helper to be more detailed in the return value information.

Unexported helpers provided by sparklyr

The sparklyr package itself provides facilities of nature similar to those above, looking at some of them, even though they are not exported:

sparklyr:::jobj_class(spark_conf)

## [1] "SparkConf" "Object"

sparklyr:::jobj_info(spark_conf)$class

## [1] "org.apache.spark.SparkConf"

capture.output(sparklyr:::jobj_inspect(spark_conf)) %>% head(10)

##  [1] "<jobj[1645]>"                                                                                                                   
##  [2] "  org.apache.spark.SparkConf"                                                                                                   
##  [3] "  org.apache.spark.SparkConf@7ec389e7"                                                                                          
##  [4] "Fields:"                                                                                                                        
##  [5] "<jobj[2490]>"                                                                                                                   
##  [6] "  java.lang.reflect.Field"                                                                                                      
##  [7] "  private final java.util.concurrent.ConcurrentHashMap org.apache.spark.SparkConf.org$apache$spark$SparkConf$$settings"         
##  [8] "<jobj[2491]>"                                                                                                                   
##  [9] "  java.lang.reflect.Field"                                                                                                      
## [10] "  private transient org.apache.spark.internal.config.ConfigReader org.apache.spark.SparkConf.org$apache$spark$SparkConf$$reader"

How sparklyr communicates with Spark, invoke logging

Now that we have and overview of the invoke() interface, we can take a look under the hood of sparklyr and see how it actually communicates with the Spark instance. In fact, the communication is a set of invocations that can be very different depending on which of the approches we choose for our purposes.

To obtain the information, we use the sparklyr.log.invoke property. We can choose one of the following 3 values based on our preferences:

TRUE will use message() to communicate short info on what is being invoked
"cat" will use cat() to communicate short info on what is being invoked
"callstack" will use message() to communicate short info on what is being invoked and the callstack

We will use TRUE in our article to keep the output short and easily manageable. First, we will close the previous connection and create a new one with the configuration containing the sparklyr.log.invoke set to TRUE, and copy in the flights dataset:

sparklyr::spark_disconnect(sc)

## NULL

config <- sparklyr::spark_config()
config$sparklyr.log.invoke <- TRUE
suppressMessages({
  sc <- sparklyr::spark_connect(master = "local", config = config)
  tbl_flights <- dplyr::copy_to(sc, nycflights13::flights, "flights")
})

Using dplyr verbs translated with dbplyr

Now that the setup is complete, we use the dplyr verb approach to retrieve the count of rows and look the invocations that this entails:

tbl_flights %>% dplyr::count()

## Invoking sql
## Invoking sql

## Invoking columns

## Invoking isStreaming

## Invoking sql

## Invoking isStreaming

## Invoking sql

## Invoking sparklyr.Utils collect

## Invoking columns

## Invoking schema

## Invoking fields

## Invoking dataType

## Invoking toString

## Invoking name

## Invoking sql

## Invoking columns

## # Source: spark<?> [?? x 1]
##        n
##    <dbl>
## 1 336776

We see multiple invocations do the sql method and also the columns method. This makes sense since the dplyr verb approach actually works by translating the commands into Spark SQL via dbplyr and then sends those translated commands to Spark via that interface.

Using DBI to send queries

Similarly, we can investigate the invocations that happen when we try to retrieve the same results via the DBI interface:

DBI::dbGetQuery(sc, "SELECT count(1) AS n FROM flights")

## Invoking sql

## Invoking isStreaming

## Invoking sparklyr.Utils collect

## Invoking columns

## Invoking schema

## Invoking fields

## Invoking dataType

## Invoking toString

## Invoking name

##        n
## 1 336776

We see slightly fewer invocations compared to the above dplyr approach, but the output is also less processed.

Using the invoke interface

Looking at the invocations that get executed using the invoke() interface:

tbl_flights %>% spark_dataframe() %>% invoke("count")

## Invoking sql

## Invoking count

## [1] 336776

We see that the amount of invocations is much lower, where the top 3 invocations come from the first part of the pipe. The invoke("count") part translated to exactly one invocation to the count method. We see therefore that the invoke() interface is indeed a more lower-level interface that invokes methods as we request them, with little to none overhead related to translations and other effects.

Redirecting the invoke logs

When running R applications that use Spark as a calculation engine, it is useful to get detailed invoke logs for debugging and diagnostic purposes. Implementing such mechanisms, we need to take into consideration how R handles the invoke logs produced by sparklyr. In simple terms, the invoke logs produced when using

TRUE and "callstack" are created using message(), which means they get sent to the stderr() connection by default
"cat" are created using cat(), so they get sent to stdout() connection by default

This info can prove useful when redirecting the log information from standard output and standard error to different logging targets.

Apache Spark and R logos

Conclusion

In this part of the series, we have looked at using the Java reflection API with sparklyr’s invoke() interface to get useful insight on available methods for different object types that can be used in the context of Spark, but also in other contexts. Using invoke logging, we have also shown how the different sparklyr interfacing methods communicate with Spark under the hood.

References

The first part of this series
The second part of this series
The third part of this series
The fourth part of this series
A Docker image with R, Spark, sparklyr and Arrow available and its Dockerfile.
Stackoverflow discussion of reflection

Using Spark from R for performance with arbitrary code - Part 4 - Using the lower-level invoke API to manipulate Spark's Java objects from R

Sat, 09 Nov 2019 12:00:00 +0000

Introduction

In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs and SQL query generators that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed.

In this fourth part, we will look at how to write R functions that interface with Spark via a lower-level invocation API that lets us use all the functionality that is exposed by the Scala Spark APIs. We will also show how such R calls relate to Scala code.

Preparation

The full setup of Spark and sparklyr is not in the scope of this post, please check the first one for some setup instructions and a ready-made Docker image.

If you have docker available, running

docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio

# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Prepare the data
weather <- nycflights13::weather %>%
  mutate(id = 1L:nrow(nycflights13::weather)) %>% 
  select(id, everything())

# Connect
sc <- sparklyr::spark_connect(master = "local")

# Copy the weather dataset to the instance
tbl_weather <- dplyr::copy_to(
  dest = sc, 
  df = weather,
  name = "weather",
  overwrite = TRUE
)
# Copy the flights dataset to the instance
tbl_flights <- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::flights,
  name = "flights",
  overwrite = TRUE
)

The invoke() API of sparklyr

So far when interfacing with Spark from R, we have used the sparklyr package in three ways:

Writing combinations of dplyr verbs that would be translated to Spark SQL via the dbplyr package and the SQL executed by Spark when requested
Generating Spark SQL code directly and sending it for execution in multiple ways
Combinations of the above two methods

What these methods have in common is that they translate operations written in R to Spark SQL and that SQL code is then sent for execution by our Spark instance.

There is however another approach that we can use with sparklyr, which will be more familiar to users or developers who have worked with packages like rJava or rscala before. Even though arguably less convenient than the APIs provided by the 2 aforementioned packages, sparklyr provides an invocation API that exposes 3 functions:

invoke(jobj, method, ...) to execute a method on a Java object reference
invoke_static(sc, class, method, ...) to execute a static method associated with a Java class
invoke_new(sc, class, ...) to invoke a constructor associated with a Java class

Apache Spark and R logos

Let us have a look at how we can use those functions in practice to efficiently work with Spark from R.

Getting started with the invoke API

We can start with a few very simple examples of invoke() usage, for instance getting the number of rows of the tbl_flights:

# Get the count of rows
tbl_flights %>% spark_dataframe() %>%
  invoke("count")

## [1] 336776

We see one extra operation before invoking the count: spark_dataframe(). This is because the invoke() interface works with Java object references and not tbl objects in remote sources such as tbl_flights. We, therefore, need to convert tbl_flights to a Java object reference, for which we use the spark_dataframe() function.

Now, for something more exciting, let us compute a summary of the variables in tbl_flights using the describe method:

tbl_flights_summary <- tbl_flights %>% spark_dataframe() %>%
  invoke("describe", as.list(colnames(tbl_flights))) %>%
  sdf_register()
tbl_flights_summary

## # Source: spark<?> [?? x 19]
##   summary year  month day   dep_time sched_dep_time dep_delay arr_time
##   <chr>   <chr> <chr> <chr> <chr>    <chr>          <chr>     <chr>   
## 1 count   3367… 3367… 3367… 328521   336776         328521    328063  
## 2 mean    2013… 6.54… 15.7… 1349.10… 1344.25484001… 12.63907… 1502.05…
## 3 stddev  0.0   3.41… 8.76… 488.281… 467.335755734… 40.21006… 533.264…
## 4 min     2013  1     1     1        106            -43.0     1       
## 5 max     2013  12    31    2400     2359           1301.0    2400    
## # … with 11 more variables: sched_arr_time <chr>, arr_delay <chr>,
## #   carrier <chr>, flight <chr>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <chr>, distance <chr>, hour <chr>, minute <chr>

We also one see extra operation after invoking the describe method: sdf_register(). This is because the invoke() interface also returns Java object references and we may like to see a more user-friendly tbl object instead. This is where sdf_register() comes in to register a Spark DataFrame and return a tbl_spark object back to us.

And indeed, we can see that the wrapper sdf_describe() provided by the sparklyr package itself works in a very similar fashion:

sparklyr::sdf_describe

## function (x, cols = colnames(x)) 
## {
##     in_df <- cols %in% colnames(x)
##     if (any(!in_df)) {
##         msg <- paste0("The following columns are not in the data frame: ", 
##             paste0(cols[which(!in_df)], collapse = ", "))
##         stop(msg)
##     }
##     cols <- cast_character_list(cols)
##     x %>% spark_dataframe() %>% invoke("describe", cols) %>% 
##         sdf_register()
## }
## <environment: namespace:sparklyr>

If we so wish, for DataFrame related object references, we can also call collect() to retrieve the results directly, without using sdf_register() first, for instance retrieving the full content of the origin column:

tbl_flights %>% spark_dataframe() %>%
  invoke("select", "origin", list()) %>%
  collect()

## # A tibble: 336,776 x 1
##    origin
##    <chr> 
##  1 EWR   
##  2 LGA   
##  3 JFK   
##  4 JFK   
##  5 LGA   
##  6 EWR   
##  7 EWR   
##  8 LGA   
##  9 JFK   
## 10 LGA   
## # … with 336,766 more rows

It can also be helpful to investigate the schema of our flights DataFrame:

tbl_flights %>% spark_dataframe() %>%
  invoke("schema")

## <jobj[143]>
##   org.apache.spark.sql.types.StructType
##   StructType(StructField(year,IntegerType,true), StructField(month,IntegerType,true), StructField(day,IntegerType,true), StructField(dep_time,IntegerType,true), StructField(sched_dep_time,IntegerType,true), StructField(dep_delay,DoubleType,true), StructField(arr_time,IntegerType,true), StructField(sched_arr_time,IntegerType,true), StructField(arr_delay,DoubleType,true), StructField(carrier,StringType,true), StructField(flight,IntegerType,true), StructField(tailnum,StringType,true), StructField(origin,StringType,true), StructField(dest,StringType,true), StructField(air_time,DoubleType,true), StructField(distance,DoubleType,true), StructField(hour,DoubleType,true), StructField(minute,DoubleType,true), StructField(time_hour,TimestampType,true))

We can also use the invoke interface on other objects, for instance the SparkContext. Let’s for instance retrieve the uiWebUrl of our context:

sc %>% spark_context() %>%
  invoke("uiWebUrl") %>%
  invoke("toString")

## [1] "Some(http://localhost:4040)"

Grouping and aggregation with invoke chains

Imagine we would like to do simple aggregations of a Spark DataFrame, such as an average of a column grouped by another column. For reference, we can do this very simply using the dplyr approach. Let’s compute the average departure delay by origin of the flight:

tbl_flights %>%
  group_by(origin) %>%
  summarise(avg(dep_delay))

## # Source: spark<?> [?? x 2]
##   origin `avg(dep_delay)`
##   <chr>             <dbl>
## 1 EWR                15.1
## 2 JFK                12.1
## 3 LGA                10.3

Now we will show how to do the same aggregation via the lower level API. Using the Spark shell we would simply do:

flights.
  groupBy("origin").
  agg(avg("dep_delay"))

Translating that into the lower level invoke() API provided by sparklyr looks something like this:

tbl_flights %>%
  spark_dataframe() %>%
  invoke("groupBy", "origin", list()) %>%
  invoke("agg", invoke_static(sc, "org.apache.spark.sql.functions", "expr", "avg(dep_delay)"), list()) %>%
  sdf_register()

What is all that extra code?

Now, compared to the very simple 2 operations in the Scala version, we have some gotchas to examine:

one of the invoke() calls is quite long. Instead of just avg("dep_delay") like in the Scala example, we use invoke_static(sc, "org.apache.spark.sql.functions", "expr", "avg(dep_delay)"). This is because the avg("dep_delay") expression is somewhat of a syntactic sugar provided by Scala, but when calling from R we need to provide the object reference hidden behind that sugar.
the empty list() at the end of the "groupBy" and "agg" invokes. This is needed as a workaround some Scala methods take String, String* as arguments and sparklyr currently does not support variable parameters. We can pass list() to represent an empty String[] in Scala as the needed second argument.

Wrapping the invocations into R functions

Seeing the above example, we can quickly write a useful wrapper to ease the pain a little. First, we can create a small function that will generate the aggregation expression we can use with invoke("agg", ...):

agg_expr <- function(tbl, exprs) {
  sparklyr::invoke_static(
    tbl[["src"]][["con"]],
    "org.apache.spark.sql.functions",
    "expr",
    exprs
  )
}

Next, we can wrap around the entire process to make a more generic aggregation function, using the fact that a remote tibble has the details on sc within its tbl[["src"]][["con"]] element:

grpagg_invoke <- function(tbl, colName, groupColName, aggOperation) {
  avgColumn <- tbl %>% agg_expr(paste0(aggOperation, "(", colName, ")"))
  tbl %>%  spark_dataframe() %>% 
    invoke("groupBy", groupColName, list()) %>%
    invoke("agg", avgColumn, list()) %>% 
    sdf_register()
}

And finally use our wrapper to get the same results in a more user-friendly way:

tbl_flights %>% 
  grpagg_invoke("arr_delay", groupColName = "origin", aggOperation = "avg")

## # Source: spark<?> [?? x 2]
##   origin `avg(arr_delay)`
##   <chr>             <dbl>
## 1 EWR                9.11
## 2 JFK                5.55
## 3 LGA                5.78

Reconstructing variable normalization

Now we will attempt to construct the variable normalization that we have shown in the previous parts with dplyr verbs and SQL generation - we will normalize the values of a column by first subtracting the mean value and then dividing the values by the standard deviation:

normalize_invoke <- function(tbl, colName) {
  sdf <- tbl %>% spark_dataframe()
  stdCol <- agg_expr(tbl, paste0("stddev_samp(", colName, ")"))
  avgCol <- agg_expr(tbl, paste0("avg(", colName, ")"))
  avgTemp <- sdf %>% invoke("agg", avgCol, list()) %>% invoke("first")
  stdTemp <- sdf %>% invoke("agg", stdCol, list()) %>% invoke("first")
  newCol <- sdf %>%
    invoke("col", colName) %>%
    invoke("minus", as.numeric(avgTemp)) %>%
    invoke("divide", as.numeric(stdTemp))
  sdf %>%
    invoke("withColumn", colName, newCol) %>%
    sdf_register()
}

tbl_weather %>% normalize_invoke("temp")

## # Source: spark<?> [?? x 16]
##       id origin  year month   day  hour   temp  dewp humid wind_dir
##    <int> <chr>  <dbl> <dbl> <int> <int>  <dbl> <dbl> <dbl>    <dbl>
##  1     1 EWR     2013     1     1     1 -0.913  26.1  59.4      270
##  2     2 EWR     2013     1     1     2 -0.913  27.0  61.6      250
##  3     3 EWR     2013     1     1     3 -0.913  28.0  64.4      240
##  4     4 EWR     2013     1     1     4 -0.862  28.0  62.2      250
##  5     5 EWR     2013     1     1     5 -0.913  28.0  64.4      260
##  6     6 EWR     2013     1     1     6 -0.974  28.0  67.2      240
##  7     7 EWR     2013     1     1     7 -0.913  28.0  64.4      240
##  8     8 EWR     2013     1     1     8 -0.862  28.0  62.2      250
##  9     9 EWR     2013     1     1     9 -0.862  28.0  62.2      260
## 10    10 EWR     2013     1     1    10 -0.802  28.0  59.6      260
## # … with more rows, and 6 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

The above implementation is just an example and far from optimal, but it also has a few interesting points about it:

Using invoke("first") will actually compute and collect the value into the R session
Those collected values are then sent back during the invoke("minus", as.numeric(avgTemp)) and invoke("divide", as.numeric(stdTemp))

This means that there is unnecessary overhead when sending those values from the Spark instance into R and back, which will have slight performance penalties.

Where invoke can be better than dplyr translation or SQL

As we have seen in the above examples, working with the invoke() API can prove more difficult than using the intuitive syntax of dplyr or SQL queries. In some use cases, the trade-off may still be worth it. In our practice, these are some examples of such situations:

When Scala’s Spark API is more flexible, powerful or suitable for a particular task and the translation is not as good
When performance is crucial and we can produce more optimal solutions using the invocations
When we know the Scala API well and not want to invest time to learn the dplyr syntax, but it is easier to translate the Scala calls into a series of invoke() calls
When we need to interact and manipulate other Java objects apart from the standard Spark DataFrames

Conclusion

In this part of the series, we have looked at how to use the lower-level invoke interface provided by sparklyr to manipulate Spark objects and other Java object references. In the following part, we will dig a bit deeper and look into using Java’s reflection API to make the invoke interface more accessible from R, getting detail invocation logs and more.

References

The first part of this series
The second part of this series
The third part of this series
A Docker image with R, Spark, sparklyr and Arrow available and its Dockerfile.
Wikipedia’s article on Method Chaining

Using Spark from R for performance with arbitrary code - Part 3 - Using R to construct SQL queries and let Spark execute them

Sat, 12 Oct 2019 12:00:00 +0000

Introduction

In the previous part of this series, we looked at writing R functions that can be executed directly by Spark without serialization overhead with a focus on writing functions as combinations of dplyr verbs and investigated how the SQL is generated and Spark plans created.

In this third part, we will look at how to write R functions that generate SQL queries that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed. We also briefly present wrapping these approaches into functions that can be combined with other Spark operations.

Preparation

The full setup of Spark and sparklyr is not in the scope of this post, please check the previous one for some setup instructions and a ready-made Docker image.

If you have docker available, running

docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio

# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Prepare the data
weather <- nycflights13::weather %>%
  mutate(id = 1L:nrow(nycflights13::weather)) %>% 
  select(id, everything())

# Connect
sc <- sparklyr::spark_connect(master = "local")

# Copy the weather dataset to the instance
tbl_weather <- dplyr::copy_to(
  dest = sc, 
  df = weather,
  name = "weather",
  overwrite = TRUE
)
# Copy the flights dataset to the instance
tbl_flights <- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::flights,
  name = "flights",
  overwrite = TRUE
)

R functions as Spark SQL generators

There are use cases where it is desirable to express the operations directly with SQL instead of combining dplyr verbs, for example when working within multi-language environments where re-usability is important. We can then send the SQL query directly to Spark to be executed. To create such queries, one option is to write R functions that work as query constructors.

Again using a very simple example, a naive implementation of column normalization could look as follows. Note that the use of SELECT * is discouraged and only here for illustration purposes:

normalize_sql <- function(df, colName, newColName) {
  paste0(
    "SELECT",
    "\n  ", df, ".*", ",",
    "\n  (", colName, " - (SELECT avg(", colName, ") FROM ", df, "))",
    " / ",
    "(SELECT stddev_samp(", colName,") FROM ", df, ") as ", newColName,
    "\n", "FROM ", df
  )
}

Using the weather dataset would then yield the following SQL query when normalizing the temp column:

normalize_temp_query <- normalize_sql("weather", "temp", "normTemp")
cat(normalize_temp_query)

## SELECT
##   weather.*,
##   (temp - (SELECT avg(temp) FROM weather)) / (SELECT stddev_samp(temp) FROM weather) as normTemp
## FROM weather

Now that we have the query created, we can look at how to send it to Spark for execution.

Apache Spark and R logos

Executing the generated queries via Spark

Using DBI as the interface

The R package DBI provides an interface for communication between R and relational database management systems. We can simply use the dbGetQuery() function to execute our query, for instance:

res <- DBI::dbGetQuery(sc, statement = normalize_temp_query)
head(res)

##   id origin year month day hour  temp  dewp humid wind_dir wind_speed
## 1  1    EWR 2013     1   1    1 39.02 26.06 59.37      270   10.35702
## 2  2    EWR 2013     1   1    2 39.02 26.96 61.63      250    8.05546
## 3  3    EWR 2013     1   1    3 39.02 28.04 64.43      240   11.50780
## 4  4    EWR 2013     1   1    4 39.92 28.04 62.21      250   12.65858
## 5  5    EWR 2013     1   1    5 39.02 28.04 64.43      260   12.65858
## 6  6    EWR 2013     1   1    6 37.94 28.04 67.21      240   11.50780
##   wind_gust precip pressure visib           time_hour   normTemp
## 1       NaN      0   1012.0    10 2013-01-01 06:00:00 -0.9130047
## 2       NaN      0   1012.3    10 2013-01-01 07:00:00 -0.9130047
## 3       NaN      0   1012.5    10 2013-01-01 08:00:00 -0.9130047
## 4       NaN      0   1012.2    10 2013-01-01 09:00:00 -0.8624083
## 5       NaN      0   1011.9    10 2013-01-01 10:00:00 -0.9130047
## 6       NaN      0   1012.4    10 2013-01-01 11:00:00 -0.9737203

As we might have noticed thanks to the way the result is printed, a standard data frame is returned, as opposed to tibbles returned by most sparklyr operations.

It is important to note that using dbGetQuery() automatically computes and collects the results to the R session. This is in contrast with the dplyr approach which constructs the query and only collects the results to the R session when collect() is called, or computes them when compute() is called.

We will now examine 2 options to use the prepared query lazily and without collecting the results into the R session.

Invoking sql on a Spark session object

Without going into further details on the invoke() functionality of sparklyr which we will focus on in the fourth installment of the series, if the desire is to have a “lazy” SQL that does not get automatically computed and collected when called from R, we can invoke a sql method on a SparkSession class object.

The method takes a string SQL query as input and processes it using Spark, returning the result as a Spark DataFrame. This gives us the ability to only compute and collect the results when desired:

# Use the query "lazily" without execution:
normalized_lazy_ds <- sc %>%
  spark_session() %>%
  invoke("sql",  normalize_temp_query)
normalized_lazy_ds

## <jobj[124]>
##   org.apache.spark.sql.Dataset
##   [id: int, origin: string ... 15 more fields]

# Collect when needed:
normalized_lazy_ds %>% collect()

## # A tibble: 26,115 x 17
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    <int> <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with 26,105 more rows, and 7 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>, normTemp <dbl>

Using tbl with dbplyr’s sql

The above method gives us a reference to a Java object as a result, which might be less intuitive to work with for R users. We can also opt to use dbplyr’s sql() function in combination with tbl() to get a more familiar result.

Note that when printing the below normalized_lazy_tbl, the query gets partially executed to provide the first few rows. Only when collect() is called the entire set is retrieved to the R session:

# Nothing is executed yet
normalized_lazy_tbl <- normalize_temp_query %>%
  dbplyr::sql() %>%
  tbl(sc, .)

# Print the first few rows
normalized_lazy_tbl

## # Source: spark<SELECT weather.*, (temp - (SELECT avg(temp) FROM weather))
## #   / (SELECT stddev_samp(temp) FROM weather) as normTemp FROM weather>
## #   [?? x 17]
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    <int> <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with more rows, and 7 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>, normTemp <dbl>

# Collect the entire result to the R session and print
normalized_lazy_tbl %>% collect()

## # A tibble: 26,115 x 17
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    <int> <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with 26,105 more rows, and 7 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>, normTemp <dbl>

Wrapping the tbl approach into functions

In the approach above we provided sc in the call to tbl(). When wrapping such processes into a function, it might however be useful to take the specific DataFrame reference as an input instead of the generic Spark connection reference.

In that case, we can use the fact that the connection reference is also stored in the DataFrame reference, in the con sub-element of the src element. For instance, looking at our tbl_weather:

class(tbl_weather[["src"]][["con"]])

## [1] "spark_connection"       "spark_shell_connection"
## [3] "DBIConnection"

Putting this together, we can create a simple wrapper function that lazily sends a SQL query to be processed on a particular Spark DataFrame reference:

lazy_spark_query <- function(tbl, qry) {
  qry %>%
    dbplyr::sql() %>%
    dplyr::tbl(tbl[["src"]][["con"]], .)
}

And use it to do the same as we did above with a single function call:

lazy_spark_query(tbl_weather, normalize_temp_query) %>% 
  collect()

## # A tibble: 26,115 x 17
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    <int> <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with 26,105 more rows, and 7 more variables: wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>, normTemp <dbl>

Combining multiple approaches and functions into lazy datasets

The power of Spark partly comes from the lazy execution and we can take advantage of this in ways that are not immediately obvious. Consider the following function we have shown previously:

lazy_spark_query

## function(tbl, qry) {
##   qry %>%
##     dbplyr::sql() %>%
##     dplyr::tbl(tbl[["src"]][["con"]], .)
## }

Since the output of this function without collection is actually only a translated SQL statement, we can take that output and keep combinining it with other operations, for instance:

qry <- normalize_sql("flights", "dep_delay", "dep_delay_norm")
lazy_spark_query(tbl_flights, qry) %>%
  group_by(origin) %>%
  summarise(mean(dep_delay_norm)) %>%
  collect()

## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

## # A tibble: 3 x 2
##   origin `mean(dep_delay_norm)`
##   <chr>                   <dbl>
## 1 EWR                    0.0614
## 2 JFK                   -0.0131
## 3 LGA                   -0.0570

The crucial advantage is that even though the lazy_spark_query would return the entire updated weather dataset when collected stand-alone, in combination with other operations Spark first figures out how to execute all the operations together efficiently and only then physically executes them and returns only the grouped and aggregated data to the R session.

We can therefore effectively combine multiple approaches to interfacing with Spark while still keeping the benefit of retrieving only very small, aggregated amounts of data to the R session. The effect is quite significant even with a dataset as small as flights (336,776 rows of 19 columns) and with a local Spark instance. The chart below compares executing a query lazily, aggregating within Spark and only retrieving the aggregated data, versus retrieving first and aggregating locally. The third boxplot shows the cost of pure collection on the query itself:

bench <- microbenchmark::microbenchmark(
  times = 20,
  collect_late = lazy_spark_query(tbl_flights, qry) %>%
    group_by(origin) %>%
    summarise(mean(dep_delay_norm)) %>%
    collect(),
  collect_first = lazy_spark_query(tbl_flights, qry) %>%
    collect() %>% 
    group_by(origin) %>%
    summarise(mean(dep_delay_norm)),
  collect_only = lazy_spark_query(tbl_flights, qry) %>%
    collect()
)

Where SQL can be better than dbplyr translation

When a translation is not there

We have discussed in the first part that the set of operations translated to Spark SQL via dbplyr may not cover all possible use cases. In such a case, the option to write SQL directly is very useful.

When translation does not provide expected results

In some instances using dbplyr to translate R operations to Spark SQL can lead to unexpected results. As one example, consider the following integer division on a column of a local data frame.

# id_div_5 is as expected
weather %>%
  mutate(id_div_5 = id %/% 5L) %>%
  select(id, id_div_5)

## # A tibble: 26,115 x 2
##       id id_div_5
##    <int>    <int>
##  1     1        0
##  2     2        0
##  3     3        0
##  4     4        0
##  5     5        1
##  6     6        1
##  7     7        1
##  8     8        1
##  9     9        1
## 10    10        2
## # … with 26,105 more rows

As expected, we get the result of integer division in the id_div_5 column. However, applying the very same operation on a Spark DataFrame yields unexpected results:

# id_div_5 is normal division, not integer division
tbl_weather %>%
  mutate(id_div_5 = id %/% 5L) %>%
  select(id, id_div_5)

## # Source: spark<?> [?? x 2]
##       id id_div_5
##    <int>    <dbl>
##  1     1      0.2
##  2     2      0.4
##  3     3      0.6
##  4     4      0.8
##  5     5      1  
##  6     6      1.2
##  7     7      1.4
##  8     8      1.6
##  9     9      1.8
## 10    10      2  
## # … with more rows

This is due to the fact that translation to integer division is quite difficult to implement: https://github.com/tidyverse/dbplyr/issues/108. We could certainly figure our a way to fix this particular issue, but the workarounds may prove inefficient:

tbl_weather %>%
  mutate(id_div_5 = as.integer(id %/% 5L)) %>%
  select(id, id_div_5)

## # Source: spark<?> [?? x 2]
##       id id_div_5
##    <int>    <int>
##  1     1        0
##  2     2        0
##  3     3        0
##  4     4        0
##  5     5        1
##  6     6        1
##  7     7        1
##  8     8        1
##  9     9        1
## 10    10        2
## # … with more rows

# Not too efficient:
tbl_weather %>%
  mutate(id_div_5 = as.integer(id %/% 5L)) %>%
  select(id, id_div_5) %>%
  explain()

## <SQL>
## SELECT `id`, CAST(`id` / 5 AS INT) AS `id_div_5`
## FROM `weather`
## 
## <PLAN>

## == Physical Plan ==
## *(1) Project [id#24, cast((cast(id#24 as double) / 5.0) as int) AS id_div_5#4273]
## +- InMemoryTableScan [id#24]
##       +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas)
##             +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39]

Using SQL and the knowledge that Hive does provide a built-in DIV arithmetic operator, we can get the desired results very simply and efficiently with writing SQL:

"SELECT `id`, `id` DIV 5 `id_div_5` FROM `weather`" %>%
  dbplyr::sql() %>%
  tbl(sc, .)

## # Source: spark<SELECT `id`, `id` DIV 5 `id_div_5` FROM `weather`> [?? x
## #   2]
##       id id_div_5
##    <int>    <dbl>
##  1     1        0
##  2     2        0
##  3     3        0
##  4     4        0
##  5     5        1
##  6     6        1
##  7     7        1
##  8     8        1
##  9     9        1
## 10    10        2
## # … with more rows

Even though the numeric value of the results is correct here, we may still notice that the class of the returned id_div_5 column is actually numeric instead of integer. Such is the life of developers using data processing interfaces.

When portability is important

Since the languages that provide interfaces to Spark are not limited to R and multi-language setups are quite common, another reason to use SQL statements directly is the portability of such solutions. A SQL statement can be executed by interfaces provided for all languages - Scala, Java, and Python, without the need to rely on R-specific packages such as dbplyr.

References

The first part of this series
The second part of this series
Documentation on Hive Operators and User-Defined Functions website.
A Docker image with R, Spark, sparklyr and Arrow available and its Dockerfile.
The DBI package on CRAN

Using Spark from R for performance with arbitrary code - Part 2 - Constructing functions by piping dplyr verbs

Sat, 21 Sep 2019 12:00:00 +0000

Introduction

In the first part of this series, we looked at how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We also examined how Apache Arrow can increase the performance of data transfers between the R session and the Spark instance.

In this second part, we will look at how to write R functions that can be executed directly by Spark without serialization overhead that we have shown in the previous installment. We will focus on writing functions as combinations of dplyr verbs that can be translated using dbplyr and investigate how the SQL is generated and Spark plans created.

Preparation

The full setup of Spark and sparklyr is not in the scope of this post, please check the previous one for some setup instructions and a ready-made Docker image.

If you have docker available, running

docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio

Apache Spark and R logos

First, we will attach the needed packages and copy some test data from the nycflights13 package into our local Spark instance:

# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Prepare the data
weather <- nycflights13::weather %>%
  mutate(id = 1L:nrow(nycflights13::weather)) %>% 
  select(id, everything())

# Connect
sc <- sparklyr::spark_connect(master = "local")

# Copy the weather dataset to the instance
tbl_weather <- dplyr::copy_to(
  dest = sc, 
  df = weather,
  name = "weather",
  overwrite = TRUE
)

# Copy the flights dataset to the instance
tbl_flights <- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::flights,
  name = "flights",
  overwrite = TRUE
)

R functions as combinations of dplyr verbs and Spark

One of the approaches to retain the performance of Spark with arbitrary R functionality is to carefully design our functions such that in its entirety when using it with sparklyr, the function call can be translated directly to Spark SQL using dbplyr.

This allows us to write, package, test, and document the functions as we normally would, while still getting the performance benefits of Apache Spark.

Let’s look at an example where we would like to do simple transformations of data stored in a column of a data frame, such as normalization of one of the columns. For illustration purposes, we will normalize the values of a column by first subtracting the mean value and then dividing the values by the standard deviation.

Trying it with base R functions

The first attempt could be quite simple, we could attempt to take advantage of R’s base function scale() to do the work for us:

normalize_dplyr_scale <- function(df, col, newColName) {
  df %>% mutate(!!newColName := scale({{col}}))
}

This function would work fine with a local data frame such as weather:

weather %>%
  normalize_dplyr_scale(temp, "normTemp") %>%
  select(id, temp, normTemp)

## # A tibble: 26,115 x 3
##       id  temp normTemp[,1]
##    <int> <dbl>        <dbl>
##  1     1  39.0       -0.913
##  2     2  39.0       -0.913
##  3     3  39.0       -0.913
##  4     4  39.9       -0.862
##  5     5  39.0       -0.913
##  6     6  37.9       -0.974
##  7     7  39.0       -0.913
##  8     8  39.9       -0.862
##  9     9  39.9       -0.862
## 10    10  41         -0.802
## # … with 26,105 more rows

However for a Spark DataFrame this would throw an error. This is because the base R function scale() is not translated by dbplyr at the moment and it is not a Hive built-in function either:

tbl_weather %>%
  normalize_dplyr_scale(temp, "normTemp") %>%
  select(id, temp, normTemp)

Error: org.apache.spark.sql.AnalysisException: Undefined function: 'scale'.

Using a combination of supported dplyr verbs and operations

To run the function successfully, we will need to rewrite it as a combination of functions and operations that are supported by the dbplyr translation to Spark SQL. One example implementation is as follows:

normalize_dplyr <- function(df, col, newColName) {
  df %>% mutate(
    !!newColName := ({{col}} - mean({{col}}, na.rm = TRUE)) /
        sd({{col}}, na.rm = TRUE)
  )
}

Using this function yields the desired results for both local and Spark data frames:

# Local data frame
weather %>%
  normalize_dplyr(temp, "normTemp") %>%
  select(id, temp, normTemp)

## # A tibble: 26,115 x 3
##       id  temp normTemp
##    <int> <dbl>    <dbl>
##  1     1  39.0   -0.913
##  2     2  39.0   -0.913
##  3     3  39.0   -0.913
##  4     4  39.9   -0.862
##  5     5  39.0   -0.913
##  6     6  37.9   -0.974
##  7     7  39.0   -0.913
##  8     8  39.9   -0.862
##  9     9  39.9   -0.862
## 10    10  41     -0.802
## # … with 26,105 more rows

# Spark DataFrame
tbl_weather %>%
  normalize_dplyr(temp, "normTemp") %>%
  select(id, temp, normTemp) %>% 
  collect()

## # A tibble: 26,115 x 3
##       id  temp normTemp
##    <int> <dbl>    <dbl>
##  1     1  39.0   -0.913
##  2     2  39.0   -0.913
##  3     3  39.0   -0.913
##  4     4  39.9   -0.862
##  5     5  39.0   -0.913
##  6     6  37.9   -0.974
##  7     7  39.0   -0.913
##  8     8  39.9   -0.862
##  9     9  39.9   -0.862
## 10    10  41     -0.802
## # … with 26,105 more rows

Investigating the SQL translation and its Spark plan

Another advantage of this approach is that we can investigate the plan by which the actions will be executed by Spark using the explain() function from the dplyr package. This will print both the SQL query constructed by dbplyr and the plan generated by Spark, which can help us investigate performance issues:

tbl_weather %>%
  normalize_dplyr(temp, "normTemp") %>%
  dplyr::explain()

## <SQL>
## SELECT `id`, `origin`, `year`, `month`, `day`, `hour`, `temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, `precip`, `pressure`, `visib`, `time_hour`, (`temp` - AVG(`temp`) OVER ()) / stddev_samp(`temp`) OVER () AS `normTemp`
## FROM `weather`
## 
## <PLAN>

## == Physical Plan ==
## *(1) Project [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39, ((temp#30 - _we0#948) / _we1#949) AS normTemp#934]
## +- Window [avg(temp#30) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _we0#948, stddev_samp(temp#30) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _we1#949]
##    +- Exchange SinglePartition
##       +- InMemoryTableScan [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39]
##             +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas)
##                   +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39]

If we are only interested in the SQL itself as a character string, we can use dbplyr’s sql_render():

tbl_weather %>%
  normalize_dplyr(temp, "normTemp") %>%
  dbplyr::sql_render() %>%
  unclass()

## [1] "SELECT `id`, `origin`, `year`, `month`, `day`, `hour`, `temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, `precip`, `pressure`, `visib`, `time_hour`, (`temp` - AVG(`temp`) OVER ()) / stddev_samp(`temp`) OVER () AS `normTemp`\nFROM `weather`"

A more complex use case - Joins, group bys, and aggregations

The dplyr syntax makes it very easy to construct more complex aggregations across multiple Spark DataFrames. An example of a function that joins 2 Spark DataFrames and computes a mean of a selected column, grouped by another column can look as follows:

joingrpagg_dplyr <- function(
  df1, df2, 
  joinColNames = intersect(colnames(df1), colnames(df2)),
  col, groupCol
) {
  df1 %>%
    right_join(df2, by = joinColNames) %>%
    group_by({{groupCol}}) %>%
    summarise(mean({{col}})) %>% 
    arrange({{groupCol}})
}

We can then use this function for instance to look at the mean arrival delay of flights grouped by visibility. Note that we are only collecting heavily aggregated data - 20 rows in total. The overhead of data transfer from the Spark instance to the R session is therefore small. Also, just assigning the function call to delay_by_visib does not actually execute or collect anything, execution really starts only when collect() is called:

delay_by_visib <- joingrpagg_dplyr(
  tbl_flights, tbl_weather,
  col = arr_delay, groupCol = visib
)
delay_by_visib %>% collect()

## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

## # A tibble: 20 x 2
##    visib `mean(arr_delay)`
##    <dbl>             <dbl>
##  1  0                24.9 
##  2  0.06             28.5 
##  3  0.12             45.4 
##  4  0.25             20.8 
##  5  0.5              39.8 
##  6  0.75             41.4 
##  7  1                37.6 
##  8  1.25             65.1 
##  9  1.5              34.7 
## 10  1.75             45.6 
## 11  2                26.3 
## 12  2.5              21.7 
## 13  3                21.7 
## 14  4                17.7 
## 15  5                18.9 
## 16  6                17.3 
## 17  7                16.4 
## 18  8                16.1 
## 19  9                15.6 
## 20 10                 4.32

We can look at the plan and the generated SQL query as well:

delay_by_visib %>% dplyr::explain()

## <SQL>
## SELECT `visib`, AVG(`arr_delay`) AS `mean(arr_delay)`
## FROM (SELECT `RHS`.`year` AS `year`, `RHS`.`month` AS `month`, `RHS`.`day` AS `day`, `LHS`.`dep_time` AS `dep_time`, `LHS`.`sched_dep_time` AS `sched_dep_time`, `LHS`.`dep_delay` AS `dep_delay`, `LHS`.`arr_time` AS `arr_time`, `LHS`.`sched_arr_time` AS `sched_arr_time`, `LHS`.`arr_delay` AS `arr_delay`, `LHS`.`carrier` AS `carrier`, `LHS`.`flight` AS `flight`, `LHS`.`tailnum` AS `tailnum`, `RHS`.`origin` AS `origin`, `LHS`.`dest` AS `dest`, `LHS`.`air_time` AS `air_time`, `LHS`.`distance` AS `distance`, `RHS`.`hour` AS `hour`, `LHS`.`minute` AS `minute`, `RHS`.`time_hour` AS `time_hour`, `RHS`.`id` AS `id`, `RHS`.`temp` AS `temp`, `RHS`.`dewp` AS `dewp`, `RHS`.`humid` AS `humid`, `RHS`.`wind_dir` AS `wind_dir`, `RHS`.`wind_speed` AS `wind_speed`, `RHS`.`wind_gust` AS `wind_gust`, `RHS`.`precip` AS `precip`, `RHS`.`pressure` AS `pressure`, `RHS`.`visib` AS `visib`
## FROM `flights` AS `LHS`
## RIGHT JOIN `weather` AS `RHS`
## ON (`LHS`.`year` = `RHS`.`year` AND `LHS`.`month` = `RHS`.`month` AND `LHS`.`day` = `RHS`.`day` AND `LHS`.`origin` = `RHS`.`origin` AND `LHS`.`hour` = `RHS`.`hour` AND `LHS`.`time_hour` = `RHS`.`time_hour`)
## ) `dbplyr_003`
## GROUP BY `visib`
## ORDER BY `visib`
## 
## <PLAN>

## == Physical Plan ==
## *(6) Sort [visib#38 ASC NULLS FIRST], true, 0
## +- Exchange rangepartitioning(visib#38 ASC NULLS FIRST, 2)
##    +- *(5) HashAggregate(keys=[visib#38], functions=[avg(arr_delay#409)])
##       +- Exchange hashpartitioning(visib#38, 2)
##          +- *(4) HashAggregate(keys=[visib#38], functions=[partial_avg(arr_delay#409)])
##             +- *(4) Project [arr_delay#409, visib#38]
##                +- SortMergeJoin [cast(year#401 as double), cast(month#402 as double), day#403, origin#413, hour#417, time_hour#419], [year#26, month#27, day#28, origin#25, cast(hour#29 as double), time_hour#39], RightOuter
##                   :- *(2) Sort [cast(year#401 as double) ASC NULLS FIRST, cast(month#402 as double) ASC NULLS FIRST, day#403 ASC NULLS FIRST, origin#413 ASC NULLS FIRST, hour#417 ASC NULLS FIRST, time_hour#419 ASC NULLS FIRST], false, 0
##                   :  +- Exchange hashpartitioning(cast(year#401 as double), cast(month#402 as double), day#403, origin#413, hour#417, time_hour#419, 2)
##                   :     +- *(1) Filter (((((isnotnull(month#402) && isnotnull(day#403)) && isnotnull(origin#413)) && isnotnull(year#401)) && isnotnull(time_hour#419)) && isnotnull(hour#417))
##                   :        +- InMemoryTableScan [year#401, month#402, day#403, arr_delay#409, origin#413, hour#417, time_hour#419], [isnotnull(month#402), isnotnull(day#403), isnotnull(origin#413), isnotnull(year#401), isnotnull(time_hour#419), isnotnull(hour#417)]
##                   :              +- InMemoryRelation [year#401, month#402, day#403, dep_time#404, sched_dep_time#405, dep_delay#406, arr_time#407, sched_arr_time#408, arr_delay#409, carrier#410, flight#411, tailnum#412, origin#413, dest#414, air_time#415, distance#416, hour#417, minute#418, time_hour#419], StorageLevel(disk, memory, deserialized, 1 replicas)
##                   :                    +- Scan ExistingRDD[year#401,month#402,day#403,dep_time#404,sched_dep_time#405,dep_delay#406,arr_time#407,sched_arr_time#408,arr_delay#409,carrier#410,flight#411,tailnum#412,origin#413,dest#414,air_time#415,distance#416,hour#417,minute#418,time_hour#419]
##                   +- *(3) Sort [year#26 ASC NULLS FIRST, month#27 ASC NULLS FIRST, day#28 ASC NULLS FIRST, origin#25 ASC NULLS FIRST, cast(hour#29 as double) ASC NULLS FIRST, time_hour#39 ASC NULLS FIRST], false, 0
##                      +- Exchange hashpartitioning(year#26, month#27, day#28, origin#25, cast(hour#29 as double), time_hour#39, 2)
##                         +- InMemoryTableScan [origin#25, year#26, month#27, day#28, hour#29, visib#38, time_hour#39]
##                               +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas)
##                                     +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39]

Using the functions with local versus remote datasets

Some of the appeal of the dplyr syntax comes from the fact that we can use the same functions to conveniently manipulate local data frames in memory and, with the very same code, data from remote sources such as relational databases, data.tables and even data within Spark.

This unified front-end, however, comes with some important differences that we must be aware of when applying and porting code from using it to manipulate and compute on local data versus on remote sources. The same holds for remote Spark DataFrames that we are manipulating when using dplyr functions.

An example of a different behavior is joining. The very simplest example - trying to inner join two tables can lead to a different amount of rows for the remote Spark DataFrames and the local R data frames:

bycols <-  c("year", "month", "day", "origin", "hour", "time_hour")

# Look at count of rows of Inner join of the Spark data frames 
tbl_flights %>% inner_join(tbl_weather, by = bycols) %>% count()

## # Source: spark<?> [?? x 1]
##        n
##    <dbl>
## 1 335096

# Look at count of rows of Inner join of the local data frames 
flights %>% inner_join(weather, by = bycols) %>% count()

## # A tibble: 1 x 1
##        n
##    <int>
## 1 335220

Another example of differences can arise from handling NA and NaN values:

# Create (lazy) left joins
joined_spark <- tbl_flights %>% left_join(tbl_weather, by = bycols) %>% collect()
joined_local <- flights %>% left_join(weather, by = bycols)

# Look at counts of NA values
joined_local %>% filter(is.na(temp)) %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1573

joined_spark %>% filter(is.na(temp)) %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1697

# Look at counts of NaN values
joined_local %>% filter(is.nan(temp)) %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1     0

joined_spark %>% filter(is.nan(temp)) %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1697

Special care must also be taken when dealing with date/time values and their time zones:

# Note the time_hour values are different
weather %>% select(id, time_hour)

## # A tibble: 26,115 x 2
##       id time_hour          
##    <int> <dttm>             
##  1     1 2013-01-01 01:00:00
##  2     2 2013-01-01 02:00:00
##  3     3 2013-01-01 03:00:00
##  4     4 2013-01-01 04:00:00
##  5     5 2013-01-01 05:00:00
##  6     6 2013-01-01 06:00:00
##  7     7 2013-01-01 07:00:00
##  8     8 2013-01-01 08:00:00
##  9     9 2013-01-01 09:00:00
## 10    10 2013-01-01 10:00:00
## # … with 26,105 more rows

tbl_weather %>% select(id, time_hour)

## # Source: spark<?> [?? x 2]
##       id time_hour          
##    <int> <dttm>             
##  1     1 2013-01-01 06:00:00
##  2     2 2013-01-01 07:00:00
##  3     3 2013-01-01 08:00:00
##  4     4 2013-01-01 09:00:00
##  5     5 2013-01-01 10:00:00
##  6     6 2013-01-01 11:00:00
##  7     7 2013-01-01 12:00:00
##  8     8 2013-01-01 13:00:00
##  9     9 2013-01-01 14:00:00
## 10    10 2013-01-01 15:00:00
## # … with more rows

And, rather obviously, when using Hive built-in functions in the dplyr-based function, we will most likely not be able to execute it on the local data frames, as we have seen previously.

The take-home message

In this part of the series, we have shown that we can take advantage of the performance of Spark while still writing arbitrary R functions by using dplyr syntax, which supports translation to Spark SQL using the dbplyr backend. We have also looked at some important differences when applying the same dplyr transformations to local and remote data sets.

With this approach, we can use R development best practices, testing, and documentation methods in a standard way when writing our R packages, getting the best of both worlds - Apache Spark for performance and R for convenient development of data science applications.

In the next installment, we will look at writing R functions that will be using SQL directly, instead of relying on dbplyr for the translation, and how we can efficiently send them to the Spark instance for execution and optionally retrieve the results to our R session.

References

The first part of this series
Documentation on Hive Operators and User-Defined Functions website.
A Docker image with R, Spark, sparklyr and Arrow available and its Dockerfile.
Overview of the dplyr syntax

Using Spark from R for performance with arbitrary code - Part 1 - Spark SQL translation, custom functions, and Arrow

Sat, 31 Aug 2019 12:00:00 +0000

Introduction

Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users.

This series of articles will attempt to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages.

In this first part, we will examine how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We will also look at how Apache Arrow can improve the performance of object serialization.

Setting up Spark with R and sparklyr

The full instructions on setting up sparklyr are not in the scope of this article, below we only provide a quick set of instructions to get a local Spark instance working with sparklyr.

Apache Spark and R logos

Using a ready-made Docker Image

For the purpose of this series, a Docker image was built which you can use to experiment in the following ways by running one of the commands below within a terminal. If you are using RStudio 1.1 or newer, Terminal functionality is built into RStudio itself.

Interactively with R and sparklyr

Running the following should yield an interactive R session with all prerequisites to start working with the sparklyr package using a local Spark instance.

docker run --rm -it jozefhajnala/sparkly:test R

# Start using sparklyr
library(sparklyr)
sc <- spark_connect("local")

Interactively with the Spark shell

Running the following should yield an interactive Scala REPL instance. A Spark context should be available as sc and a Spark session as spark.

docker run --rm -it jozefhajnala/sparkly:test /root/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-shell

Running an example R script

Running the following should execute an example R script using sparklyr with output appearing in the terminal:

docker run --rm jozefhajnala/sparkly:test Rscript /root/.local/spark_script.R

Manual Installation

The following are very basic instructions, for troubleshooting or more detailed step-by-step guides you can refer to RStudio’s spark website.

install.packages("sparklyr")
install.packages("nycflights13")
sparklyr::spark_install(version = "2.4.3")

Connecting and using a local Spark instance

# Load packages
library(sparklyr)
library(dplyr)
library(nycflights13)

# Connect
sc <- sparklyr::spark_connect(master = "local")

# Copy the weather dataset to the instance
tbl_weather <- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::weather,
  name = "weather",
  overwrite = TRUE
)

# Collect it back
tbl_weather %>% collect()

Sparklyr as a Spark interface provider

The sparklyr package is an R interface to Apache Spark. The meaning of the word interface is very important in this context as the way we use this interface can significantly affect the performance benefits we get from using Spark.

To understand the meaning of the above a bit better, we will examine 3 very simple functions that are different in implementation but intend to provide the same results, and how they behave with regards to Spark. We will use datasets from the nycflights13 package for our examples.

An R function translated to Spark SQL

Using the following fun_implemented() function will yield the expected results for both a local data frame nycflights13::weather and the remote Spark object referenced by tbl_weather:

# An R function translated to Spark SQL
fun_implemented <- function(df, col) {
  df %>% mutate({{col}} := tolower({{col}}))
}

fun_implemented(nycflights13::weather, origin)
fun_implemented(tbl_weather, origin)

This is because the R function tolower was translated by dbplyr to Spark SQL function LOWER and the resulting query was sent to Spark to be executed. We can see the actual translated SQL by running sql_render() on the function call:

dbplyr::sql_render(
  fun_implemented(tbl_weather, origin)
)

<SQL> SELECT LOWER(`origin`) AS `origin`, `year`, `month`, `day`, `hour`,
`temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, `precip`,
`pressure`, `visib`, `time_hour`
FROM `weather`

An R function not translated to Spark SQL

Using the following fun_r_only() function will only yield the expected results for a local data frame nycflights13::weather. For the remote Spark object referenced by tbl_weather we will get an error:

# An R function not translated to Spark SQL
fun_r_only <- function(df, col) {
  df %>% mutate({{col}} := casefold({{col}}, upper = FALSE))
}

fun_r_only(nycflights13::weather, origin)
fun_r_only(tbl_weather, origin)

 Error: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input 'AS' expecting ')'(line 1, pos 32)

== SQL ==
SELECT casefold(`origin`, FALSE AS `upper`) AS `origin`, 
`year`, `month`, `day`, `hour`, 
`temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, 
`precip`, `pressure`, `visib`, `time_hour`
--------------------------------^^^
FROM `weather`

This is because there simply is no translation provided by dbplyr for the casefold() function. The generated Spark SQL will therefore not be valid and throw an error once the Spark SQL parser tries to parse it.

A Hive built-in function not existing in R

On the other hand, using the below fun_hive_builtin() function will only yield the expected results for the remote Spark object referenced by tbl_weather. For the local data frame nycflights13::weather we will get an error:

# A Hive built-in function not existing in R
fun_hive_builtin <- function(df, col) {
  df %>% mutate({{col}} := lower({{col}}))
}

fun_hive_builtin(tbl_weather, origin)
fun_hive_builtin(nycflights13::weather, origin)

Error: Evaluation error: could not find function "lower".

This is because the function lower does not exist in R itself. For a non-existing R function there obviously is no dbplyr translation either. In this case, dbplyr keeps it as-is when translating to SQL, and the SQL will be valid and executed without problems because lower is, in fact, a function built-in to Hive:

dbplyr::sql_render(fun_hive_builtin(tbl_weather, origin))

<SQL> SELECT lower(`origin`) AS `origin`,
`year`, `month`, `day`, `hour`,
`temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`,
`precip`, `pressure`, `visib`, `time_hour`
FROM `weather`

Using non-translated functions with sparklyr

It can easily happen that one of the functions we want to use falls into the category where it is neither translated or a Hive built-in function. In this case, there is another interface provided by sparklyr that can allow us to do that - the spark_apply() function. Here is an oversimplified example that will reach our goal with casefold():

fun_r_custom <- function(tbl, colName) {
  tbl[[colName]] <- casefold(tbl[[colName]], upper = FALSE)
  tbl
}

spark_apply(tbl_weather, fun_r_custom, context = {colName <- "origin"})

What is so important about this distinction?

We have now shown that we can also send code that was not translated by dbplyr to Spark and get it executed without issues using spark_apply(). So what is the catch and where does the importance of the meaning of the word interface come in?

Let us quickly examine the performance of the operations:

mb = microbenchmark::microbenchmark(
  times = 10,
  hive_builtin = fun_hive_builtin(tbl_weather, origin) %>% collect(),
  translated_dplyr = fun_implemented(tbl_weather, origin) %>% collect(),
  spark_apply = spark_apply(tbl_weather, fun_r_custom, context = {colName <- "origin"}) %>% collect()
)

Note that the absolute values here will vary based on the setup, the important message is in the relative differences.

We can see that the operations executed via the SQL translation mechanism of dbplyr were executed in around 0.5 seconds while those via spark_apply took orders of magnitude longer - more than 6 minutes.

What happens when we use custom functions with `spark_apply`

We can now see that the operation with spark_apply() is extremely slow compared to the other two. The key to understanding the difference is to examine how the custom transformations of data using R functions are performed within spark_apply(). In simplified terms, this happens in a few steps:

the data is moved in row-format from Spark into the R process through a socket connection. This is inefficient as multiple data types need to be deserialized over each row
the data gets converted to columnar format since this is how R data frames are implemented
the R functions are applied to compute the results
the results are again converted to row-format, serialized row-by-row and sent back to Spark over the socket connection

What happens when we use translated or Hive built-in functions

When using functions that can be translated to Spark SQL the process is very different

The call is translated to Spark SQL using the dbplyr backend
The constructed query is sent to Spark for execution using DBI
Only when collect() or compute() is called, the SQL is executed within Spark
Only when collect() is called the results are also sent to the R session

This means that the transfer of data only happens once and only when collect() is called, which saves a vast amount of overhead.

Which R functionality is currently translated and built-in to Hive

An important question to answer with regards to performance then is what amount of functionality is available using the fast dbplyr backend. As seen above, these features can be categorized into two groups:

R functions translatable to Spark SQL via dbplyr. The full list of such functions is available on RStudio’s sparklyr website
Hive built-in functions that get translated as they are and can be evaluated by Spark. The full list is available on the Hive Operators and User-Defined Functions website.

Making serialization faster with Apache Arrow

What is Apache Arrow and how it improves performance

Our benchmarks have shown that using spark_apply() does not scale well and the penalty of the bottleneck in performance caused by serialization, deserialization, and transfer is too high.

To partially mitigate this we can take advantage of Apache Arrow, a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data.

By adding support for Arrow in sparklyr, it makes Spark perform the row-format to column-format conversion in parallel in Spark, data is then transferred through the socket but no custom serialization takes place and all the R process needs to do is copy this data from the socket into its heap, transform it and copy it back to the socket connection.

This makes the process significantly faster:

mb = microbenchmark::microbenchmark(
  times = 10, 
  setup = library(arrow),
  hive_builtin = fun_hive_builtin(tbl_weather, origin) %>% collect(),
  translated_dplyr = fun_implemented(tbl_weather, origin) %>% collect(),
  spark_apply_arrow = spark_apply(tbl_weather, fun_r_custom, context = {colName <- "origin"}) %>% collect()
)

We can see that the timing on spark_apply() decreased from more than 6 minutes to around 4.5 seconds, which is a very signigicant performance boost. Compared to the other methods we however still experience an order of magnitude difference.

Notes on the setup of Apache Arrow

It is worth noting that the implementation of Apache Arrow into R arrived on CRAN early August 2019, which means at the time of writing of this article it is on CRAN about 3 weeks. The functionality also depends on the Arrow C++ library, so installation is a bit more difficult than with some other R packages.

Care should also be taken with regards to the capability of the C++ library, the arrow R package version and the version of sparklyr. We had good results with using the R package arrow version 0.14.1, sparklyr 1.0.2 and the 0.14.1 version of the C++ libraries.

The aforementioned Docker image has both the C++ libraries and the R arrow package available for use.

The take-home message

Adding Arrow to the mix certainly significantly improved the performance of our example code, but is still quite slow compared to the native approach. Based on the above, we could conclude that

Performance benefits are present mainly when all the computation is performed within Spark and R serves merely as a “messaging agent”, sending commands to Spark to be executed. If there are object serialization and transfer of larger objects present, performance is strongly impacted.

The take-home message from this exercise is that we should strive to only use R code that can be executed within the Spark instance. If we need some data retrieved, it is advisable that this is data that was previously heavily aggregated within Spark and only a small amount is transferred to the R session.

But we still need arbitrary R function to run fast on Spark

In the next installments of this series, we will investigate a few options that allow us to retain the performance of Spark while still being able to write arbitrary R functions (i.e. using methods already implemented and available in the Spark API from R by implementing R functions not directly provided by the sparklyr interface) by:

Rewriting the functions as collections of dplyr verbs that all support translation to Spark SQL
Rewriting the functions as series of Scala method invocations
Rewriting the functions into Spark SQL and using DBI to execute directly

References

The Apache Arrow and RStudio’s Spark website
Homepage of Apache Arrow
R Apache Arrow on GitHub
R package arrow on CRAN
Arrow C++ library installation guide
Documentation on Hive Operators and User-Defined Functions website.
A Docker image with R, Spark, sparklyr and Arrow available and its Dockerfile.

Using parallelization, multiple git repositories and setting permissions when automating R applications with Jenkins

Sat, 10 Aug 2019 12:00:00 +0000

Introduction

In the previous post, we focused on setting up declarative Jenkins pipelines with emphasis on parametrizing builds and using environment variables across pipeline stages.

In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, combining sources from multiple git repositories and ensuring proper access right to the Jenkins agent.

Running stages in parallel

Parallel computation using R

There are numerous way to achieve parallel computation in the context of an R application, those native to R are for example

the parallel package, which is included with base R since version 2.14 and very stable, or
the more recent future package
the CRAN Task View: High-Performance and Parallel Computing with R provides a useful and extensive overview of multiple topics, including parallelism with R

Governing parallelism directly within R code requires tackling many aspects, starting with logging and ending in handling conditions and exception. We might therefore also be interested in leaving the orchestration of parallelism to a layer above the R application code itself. This approach has both benefits and limitations, so careful consideration should be taken before the implementation starts.

Orchestrating parallelization of R jobs with Jenkins

Declarative Jenkins pipelines are one of the ways to orchestrate parallelism with many options, a very simple example of a parallelized process can look as follows:

pipeline {
  agent any
    stages {
      
      stage('Preparation') {
        steps {
          // Cleanup, Environment setup, etc.
        }
      }
      
      stage('Tests') {
        parallel {
          stage('Unit Tests') {
            steps {
              // Invoke unit tests
            }
          }
          stage('Integration Tests') {
            steps {
              // Invoke integration tests
            }
          }
          stage('Regression Tests') {
            steps {
             // Invoke regression tests
            }
          }
          stage('Technical checks') {
            steps {
              // Invoke Technical checks
            }
          }
        }
      }
      
  }
}

Note the parallel directive, which will ensure that the (sub)stages within it

Unit Tests
Integration Tests
Regression Tests and
Technical checks

will be executed in parallel.

The parallelization will be orchestrated only after the first stage - “Preparation” was finished first. This is useful in case we need a stage that is shared among the parallel stages to be executed first.

Failing early

If we want to fail the parallel stages early (as soon as any of them fails), we can add failFast true into the parallel stage:

stage('Tests') {
  failFast true
  parallel {
    // ...
  }
}

An example parallel Jenkins pipeline shown by BlueOcean. Image credit https://bit.ly/31e8cAy

Cloning multiple git repositories

In certain situations, we may need to clone not just the main repository that is subject to our multibranch pipeline, but also secondary repositories.

An example of such setup is when we store modeling parameters for our run in a separate repository, or when configurations governing the runs are stored in a separate repository.

The git directive allows us to clone another repository. Note that if you need to use credentials for the process, those are configured in Jenkins’ credential configuration.

stage('Clone another repository') {
  steps {
    git branch: 'master',
    credentialsId: 'my-credential-id',
    url: 'git@github.com:user/repo.git'
  }
}

Cloning into a separate subdirectory

Note however this will clone the repository into the current working directory, where the main repository subject to the pipeline is likely already checked out. This may have unintended consequences, so a safer approach is to checkout the secondary repository into a separate directory. We can achieve this using the dir directive:

stage('Clone another repository to subdir') {
  steps {
    sh 'rm subdir -rf; mkdir subdir'
    dir ('subdir') {
      git branch: 'master',
        credentialsId: 'my-credential-id',
        url: 'git@github.com:user/repo.git'
    }
  }
}

Cleaning up

After the pipeline is done, it may be useful do perform cleanup steps, for example removing unneeded directories. Since we likely want to clean those up regardless of the pipeline results, we can take advantage of the post directive running always, which will be executed regardless of the outcome of the pipeline stages.

One example use is to remove the hidden .git directories from both the working directory, where the main repository is checked out and the "subdir", where we checked out the secondary repository:

post {
  always {
    sh 'rm .git -rf'
    sh 'rm subdir/.git -rf'
  }
}

Changing permissions to allow the Jenkins user to read

One aspect of using Jenkins to execute our R code is to ensure that the Jenkins user executing the code on the worker node has access to all the necessary files. The following is a list of useful Linux commands that can help with the setup. These should, of course, be used with care.

# Add user `jenkins` to group `somegroup`
usermod -a -G somegroup jenkins

# Change group of somedir/ to somegroup, recursively
chgrp -R somegroup somedir/

# Allow group to read `somedir`, recursively
chmod -R g+r somedir/

# Find all directories in a path and allow group to traverse
find /dir/moredir/somedir -type d -exec chmod g+x {} \;

References

Jenkins documentation on parallel blocks
Jenkins documentation on credential configuration
UnixExchange: Traversing directories
StackOverflow: Checkout multiple git repos into same Jenkins workspace
StackOverflow: Checkout Jenkins Pipeline Git SCM with credentials?

Using environment variables and parametrized builds for automating R applications with Jenkins

Sat, 27 Jul 2019 12:00:00 +0000

Introduction

Jenkins is a popular open-source tool that helps teams with automation and implementation of continuous integration and deployment pipelines, comparable to for example Atlassian’s Bamboo, GitLab CI or to some extent Travis.

In this post, we share some practical lessons learned when integrating R applications via Jenkins for the purpose of continuous integration and regression testing on runner nodes configured using Jenkins via declarative pipelines defined in a Jenkinsfile.

Example jenkins pipeline. Image credit https://bit.ly/2fpnBWI

Propagating environment variables to R sessions

When running R code on a local machine or a remote server from a user perspective, we count on a lot of configuration that is already present potentially without the user even noticing or knowing about the details of that configuration. One example of such configuration is the environment variables that configure some of R’s behavior.

When running R code on a computer that is connected to the Jenkins server as a node (a place where Jenkins sends the jobs to run), those environment variables likely need to be passed to the worker process, including configuration present for example in .Renviron files and .Rprofile files.

To propagate environment variables to all the stages of a declarative pipeline, we can use the environment directive in the pipeline definition. For example, to propagate a path to a user library, an example Jenkinsfile could look as follows:

pipeline {
  environment {
       R_LIBS_USER = '/path/to/lib'
  }
  // ... pipeline continues ...
 }

This will ensure that the environment variables defined will be propagated to all the stages defined in the pipeline.

Note: We might be tempted to simply use EXPORT on the variables that need to be propagated to other stages. While this will likely work in a classic setup where we are running multiple R scripts under the same shell, Jenkins runs each of the stages in a separate shell, meaning that EXPORT does not ensure that the variables will be propagated to other stages. The same of course applies to using Sys.setenv() from R itself.

Checking and accessing the propagated variables

To test whether our environment variables were propagated as intended, we can use printenv, for example in a stage dedicated to showing the environment variables:

pipeline {
  environment {
       R_LIBS_USER = '/path/to/lib'
  }
  agent any
    stages {
      stage('Show env vars') {
        steps {
          sh 'printenv'
        }
      }
  }
}

From R, we can access the environment variables using Sys.getenv():

# List all environment variables
Sys.getenv()

# Get a specific one
Sys.getenv("R_LIBS_USER")

Using a per-pipeline R library

For continuous integration purposes, it is useful to get our code checked out and tested on each commit. To get our packages installed into a separate library for each branch, one of the options is setting a user library path.

Doing that we can also choose the granularity of the separation we want to achieve. For example, using a library per branch in a multibranch pipeline context:

environment {
  R_LIBS_USER = """${sh(
    returnStdout: true,
    script: 'echo $PWD/test-lib'
  )}""".trim()
}

Using this would mean the same library is used for each build of the same branch. If we need more granularity we can use a library per both branch and build adding the BUILD_ID variable to the path:

environment {
  R_LIBS_USER = """${sh(
    returnStdout: true,
    script: 'echo $PWD/$BUILD_ID/test-lib'
  )}""".trim()
}

Note the need to apply the trim() method on the constructed paths to strip whitespaces/linebreaks that get produced when retrieving the value from standard output.

Working with parametrized builds from R

Jenkins also offers the option to parametrize builds, such that parameters of several types can be passed as environment variables to the shell through which the staged jobs are executed.

For usage with R applications, this means we can retrieve such parameters using the Sys.getenv() function. For example, if we create a parameter named r_num_cores in Jenkins, we can easily access its value within the build:

Sys.getenv("r_num_cores")

A small caveat to this is that all the parameters are passed as strings, so in case we want to pass R objects as parameters (for example a vector c(1, 2)), we would need to parse the string values, for example writing a wrapper function. A naive implementation of such wrapper can look as follows:

env_get <- function(varName, parse = TRUE) {
  res <- Sys.getenv(varName)
  if (isTRUE(parse)) res <- eval(parse(text = res))
  res
}

It is also worth noting that syntactical differences can require some further tweaking, for example, boolean Jenkins parameters are passed as "true" or "false", so would not work with the eval(parse(...) approach unless changed to uppercase first.

References

Jenkins documentation on Creating multibranch pipelines
Jenkins documentation on Declarative pipelines
Jenkins documentation on Setting environment variables
Jenkins documentation on Parametrized builds

How data.table's fread can save you a lot of time and memory, and take input from shell commands

Sat, 22 Jun 2019 12:00:00 +0000

Introduction

Recently I was involved in a task that included reading and writing quite large amounts of data, totaling more than 1 TB worth of csvs without the standard big data infrastructure. After trying multiple approaches, the one that made this possible was using data.table’s reading and writing facilities - fread() and fwrite().

This motivated me to look at benchmarking data.table’s fread() and how it compares to other packages such as tidyverse’s readr and base R for reading tabular data from text files such as csvs.

Comparing fread, readr’s read_csv and base R

The data.table package is a bit lesser known in the R community, but if people know it, it is most likely for its speed when working with data tables themselves within R. The package however also provides functions for efficient reading and writing of tabular data from and into text files - fread() for fast reading and fwrite() for fast writing.

Another underrated property of the fread() apart from speed however is memory efficiency, which can be crucial if we need to read in a lot of data without big data infrastructure.

The benchmarked data

As the data for this quick benchmark, we used the Airline on-time performance data from for years 2000 to 2008. This simple code chunk can be used to retrieve and extract the data. The download size is 868 MB in bz2 files. The extracted size is 5.34 GB in csv files and when combined translates to a data frame with some 59 million rows and 29 columns. This is quite limited due to the specs of the machine used, but enough to show significant differences between packages.

destDir <- path.expand("~/dataexpo")
years <- 2000:2008
baseUrl <- "http://stat-computing.org/dataexpo/2009"

bz2Names <- file.path(destDir, paste0(years, ".csv.bz2"))
dlUrls   <- file.path(baseUrl, paste0(years, ".csv.bz2"))

if (!dir.exists(destDir)) {
  dir.create(destDir, recursive = TRUE)
}

# download files
mapply(download.file, dlUrls, bz2Names)

# extract
system(paste0(
  "cd ", destDir, "; ",
  "bzip2 -d -k ", paste(bz2Names, collapse = " ")
))

Base R code to be benchmarked

Loading csv data from multiple files into a single data frame with base R is very simple:

dataDir <- path.expand("~/dataexpo")
dataFls <- dir(dataDir, pattern = "csv$", full.names = TRUE)
df <- do.call(rbind, lapply(dataFls, read.csv))

data.table `fread` code to be benchmarked

For data.table, we use rbindlist() for row binding instead of do.call(rbind, ...) and fread() for reading:

library(data.table)
dataDir <- path.expand("~/dataexpo")
dataFls <- dir(dataDir, pattern = "csv$", full.names = TRUE)
dt <- data.table::rbindlist(
  lapply(dataFls, data.table::fread, showProgress = FALSE)
)

`readr::read_csv` code to be benchmarked

The script for readr’s read_csv is also simple, with the small caveat that we need to predefine the column types, as rbind_rows does not like to coerce the data. Doing things the tidyverse way, we also use purrr::map_dfr() to for row binding and readr::read_csv() for reading:

library(readr)
library(purrr)
library(magrittr)
dataDir <- path.expand("~/dataexpo")
dataFiles <- dir(dataDir, pattern = "csv$", full.names = TRUE)

# rbind_rows won't coerce, prefedine
col_types <- readr::cols(
  .default = col_double(),
  UniqueCarrier = col_character(),
  TailNum = col_character(),
  Origin = col_character(),
  Dest = col_character(),
  CancellationCode = col_character(),
  CarrierDelay = col_double(),
  WeatherDelay = col_double(),
  NASDelay = col_double(),
  SecurityDelay = col_double(),
  LateAircraftDelay = col_double()
)

df <- dataFiles %>% 
  purrr::map_dfr(
    readr::read_csv,
    col_types = col_types,
    progress = FALSE
  )

The benchmarking method

A simple bash script was used to measure the maximum memory needed (Maximum resident set size to be precise) and to time the run of the script 10 times:

#!/bin/bash
scriptf=$1
printf "$scriptf \n\n"

/usr/bin/time -v Rscript $scriptf  \
 2>&1 >/dev/null | \
 grep -E 'Maximum resident'

time for i in {1..10}; do Rscript $scriptf >/dev/null; done

The results

The results speak for themselves. Not only was fread() almost 2.5 times faster than readr’s functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr’s 27 GB. Interestingly, even though very slow, base R also spent less memory than the tidyverse suite.

For larger data sets, data.table’s efficiency can save not only very significant amounts of time, but also needed memory, which can have important implications with regards to the cost of the hardware needed for processing.

method	max. memory	avg. time
`utils::read.csv` + `base::rbind`	21.70 GB	8.13 m
`readr::read_csv` + `purrr:map_dfr`	27.02 GB	3.43 m
`data.table::fread` + `rbindlist`	15.25 GB	1.40 m

When your mind gets blown - `fread()` from shell command outputs

And it gets better than that. Consider a scenario where we need to read the data, subset or split into groups and compute on the processed data. The classic approach would be to load the data from files into R as seen above and then do the data processing.

For scenarios like these, fread() provides an ever more powerful facility - the cmd argument with a shell command that pre-processes the file(s). If we want to filter our data used above to only look at flights operated by American Airlines the classic approach would be to read the data in and filter. With fread() we can, however, use grep first and only have fread() process output of that command:

library(data.table)
dataDir <- path.expand("~/dataexpo")
dataFiles <- dir(dataDir, pattern = "csv$", full.names = TRUE)

# All flights by American Airlines
command <- sprintf(
  "grep --text ',AA,' %s",
  paste(dataFiles, collapse = " ")
)

dt <- data.table::fread(cmd = command)

Looking at our benchmarks, this approach only cost us 1.68GB of memory and about 24 seconds of runtime on average:

method	max. memory	avg. time
`data.table::fread` from `grep`	1.68 GB	0.40 m

Optimizing further

The above is of course only the beginning of potential optimizations. We could probably save a lot of time taking advantage of GNU parallel to process the files with grep much faster. The key here is the flexibility of inputs that fread can process, without splitting the inputs into multiple files and other maintenance-heavy pre-processing.

In a bigger data setting, this can have a significant impact on the cost of a data science project and even investments in big data infrastructure, engineers and maintenance related to managing such a project.

TL;DR - Just show me the code

The benchmarking code can be found on GitLab.

References

How to interactively examine any R code - 4 ways to not just read the code, but delve into it step-by-step

Sat, 25 May 2019 12:00:00 +0000

Introduction

As pointed out by a recent read the R source post on the R hub’s website, reading the actual code, not just the documentation is a great way to learn more about programming and implementation details. But there is one more activity to get even more hands-on experience and understanding of the code in practice.

In this post, we provide tips on how to interactively debug R code step-by-step and investigate the values of objects in the middle of function execution. We will look at doing this for both exported and non-exported functions from different packages. We will also look at interactively debugging generics and methods, using functionality provided by base R.

Interactively examining functions with `debug()` and `debugonce()`

The 2 key functions we will be using for our interactive investigation of code are debug() and debugonce(). When debug() is called on a function, it will set a debugging flag on that function. When the function is executed, the execution will proceed one step at a time, giving us the option to investigate exactly what is going on in the context of that function call similarly to placing browser() at a certain point in our code.

Let us see a quick example:

debug(order)
order(10:1)

When running the second line, the code execution will stop inside order() and we can freely run the function line by line.

Debugging an R function interactively with debugonce()

When we no longer want to have the function flagged for debugging, call undebug():

undebug(order)

Alternatively, if we only want to have the function in debug mode for one execution, we can call debugonce() on the function. This approach may also be safer due to no need to undebug() later:

debugonce(order)
order(10:1)

Debugging non-exported functions using `:::`

The great thing about debug() and debugonce() is that they allow us to interactively investigate not just the code that we are currently writing, but any interpreted R function. To debug functions not even exported from package namespaces, we can use :::. For example, we normally cannot access the list_rmds() function from the blogdown package as it is not exported.

# This will not work
library(blogdown)
debugonce(list_rmds)

## Error in debugonce(list_rmds): object 'list_rmds' not found

# This will not work either
debugonce(blogdown::list_rmds)

## Error: 'list_rmds' is not an exported object from 'namespace:blogdown'

If we need to, we can still debug it using ::: to access it in the package namespace:

# This will work
debugonce(blogdown:::list_rmds)

This is particularly useful when debugging nested calls inside package code, which tend to use unexported functions.

Conveniently debugging methods with `debugcall()`

Many R functions are implemented as S3 generics, that will call the proper method based on the signature of the arguments. A good example of this approach is aggregate(). Looking at its code, we see it only dispatches to the proper method based on the arguments provided:

body(stats::aggregate)

## UseMethod("aggregate")

Using debug(aggregate) would therefore not be very useful for interactive investigation, as we most likely want to look at the method that is called to actually see what is going on.

For this purpose, we can use debugcall(), which will conveniently take us directly to the method. In the following case, it is the data.frame method of the aggregate() generic:

eval(debugcall(
  aggregate(mtcars["hp"], mtcars["carb"], FUN = mean),
  once = TRUE
))

As seen above, we can also use the once = TRUE argument to only debug the call once.

For more technical details, the reference provided by ?debugcall is a great resource. This is also true for ?debug and ?trace which I also strongly recommend reading.

Inserting debugging code anywhere inside a function body with `trace()`

If debugonce() and friends are not sufficient for our purposes and we want to insert advanced debugging code at different places within a function body, we can use trace() to do just that.

Imagine for example we would like to investigate a specific place in the code of the aforementioned stats::aggregate.data.frame method. First, we can explore the function body:

as.list(body(stats::aggregate.data.frame))

## [[1]]
## `{`
## 
## [[2]]
## if (!is.data.frame(x)) x <- as.data.frame(x)
## 
## [[3]]
## FUN <- match.fun(FUN)
## 
## [[4]]
## if (NROW(x) == 0L) stop("no rows to aggregate")
## 
## [[5]]
## if (NCOL(x) == 0L) {
##     x <- data.frame(x = rep(1, NROW(x)))
##     return(aggregate.data.frame(x, by, function(x) 0L)[seq_along(by)])
## }
## 
## [[6]]
## if (!is.list(by)) stop("'by' must be a list")
## 
## [[7]]
## if (is.null(names(by)) && length(by)) names(by) <- paste0("Group.", 
##     seq_along(by)) else {
##     nam <- names(by)
##     ind <- which(!nzchar(nam))
##     names(by)[ind] <- paste0("Group.", ind)
## }
## 
## [[8]]
## if (any(lengths(by) != NROW(x))) stop("arguments must have same length")
## 
## [[9]]
## y <- as.data.frame(by, stringsAsFactors = FALSE)
## 
## [[10]]
## keep <- complete.cases(by)
## 
## [[11]]
## y <- y[keep, , drop = FALSE]
## 
## [[12]]
## x <- x[keep, , drop = FALSE]
## 
## [[13]]
## nrx <- NROW(x)
## 
## [[14]]
## ident <- function(x) {
##     y <- as.factor(x)
##     l <- length(levels(y))
##     s <- as.character(seq_len(l))
##     n <- nchar(s)
##     levels(y) <- paste0(strrep("0", n[l] - n), s)
##     as.character(y)
## }
## 
## [[15]]
## grp <- lapply(y, ident)
## 
## [[16]]
## multi.y <- !drop && ncol(y)
## 
## [[17]]
## if (multi.y) {
##     lev <- lapply(grp, function(e) sort(unique(e)))
##     y <- as.list(y)
##     for (i in seq_along(y)) y[[i]] <- y[[i]][match(lev[[i]], 
##         grp[[i]])]
##     eGrid <- function(L) expand.grid(L, KEEP.OUT.ATTRS = FALSE, 
##         stringsAsFactors = FALSE)
##     y <- eGrid(y)
## }
## 
## [[18]]
## grp <- if (ncol(y)) {
##     names(grp) <- NULL
##     do.call(paste, c(rev(grp), list(sep = ".")))
## } else integer(nrx)
## 
## [[19]]
## if (multi.y) {
##     lev <- as.list(eGrid(lev))
##     names(lev) <- NULL
##     lev <- do.call(paste, c(rev(lev), list(sep = ".")))
##     grp <- factor(grp, levels = lev)
## } else y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]
## 
## [[20]]
## nry <- NROW(y)
## 
## [[21]]
## z <- lapply(x, function(e) {
##     ans <- lapply(X = split(e, grp), FUN = FUN, ...)
##     if (simplify && length(len <- unique(lengths(ans))) == 1L) {
##         if (len == 1L) {
##             cl <- lapply(ans, oldClass)
##             cl1 <- cl[[1L]]
##             ans <- unlist(ans, recursive = FALSE)
##             if (!is.null(cl1) && all(vapply(cl, identical, NA, 
##                 y = cl1))) 
##                 class(ans) <- cl1
##         }
##         else if (len > 1L) 
##             ans <- matrix(unlist(ans, recursive = FALSE), nrow = nry, 
##                 ncol = len, byrow = TRUE, dimnames = if (!is.null(nms <- names(ans[[1L]]))) 
##                   list(NULL, nms))
##     }
##     ans
## })
## 
## [[22]]
## len <- length(y)
## 
## [[23]]
## for (i in seq_along(z)) y[[len + i]] <- z[[i]]
## 
## [[24]]
## names(y) <- c(names(by), names(x))
## 
## [[25]]
## row.names(y) <- NULL
## 
## [[26]]
## y

Now we can choose a point in the function body, where we would like to interactively explore. For example the 21st element starting with z <- lapply(x, function(e)) { may be of interest. In that case, we can call:

trace(stats::aggregate.data.frame, tracer = browser, at = 21)

## Tracing function "aggregate.data.frame" in package "stats"

## [1] "aggregate.data.frame"

And see that this has added a call to .doTrace() to the function body:

as.list(body(stats::aggregate.data.frame))[[21L]]

## {
##     .doTrace(browser(), "step 21")
##     z <- lapply(x, function(e) {
##         ans <- lapply(X = split(e, grp), FUN = FUN, ...)
##         if (simplify && length(len <- unique(lengths(ans))) == 
##             1L) {
##             if (len == 1L) {
##                 cl <- lapply(ans, oldClass)
##                 cl1 <- cl[[1L]]
##                 ans <- unlist(ans, recursive = FALSE)
##                 if (!is.null(cl1) && all(vapply(cl, identical, 
##                   NA, y = cl1))) 
##                   class(ans) <- cl1
##             }
##             else if (len > 1L) 
##                 ans <- matrix(unlist(ans, recursive = FALSE), 
##                   nrow = nry, ncol = len, byrow = TRUE, dimnames = if (!is.null(nms <- names(ans[[1L]]))) 
##                     list(NULL, nms))
##         }
##         ans
##     })
## }

When we now call the aggregate() function on a data.frame, we will have the code stop at our selected point in the execution of the data.frame method:

aggregate(mtcars["hp"], mtcars["carb"], FUN = mean)

When done debugging, use untrace() to cancel the tracing:

untrace(stats::aggregate.data.frame)

## Untracing function "aggregate.data.frame" in package "stats"

Happy investigating and debugging!

References

R documentation of the referenced functions

R documentation on debug(), debugonce(), etc.
R documentation on trace(), untrace(), etc.
R documentation on debugcall()

Porting and redirecting a Hugo-based blogdown website to an HTTPS-enabled custom domain and how to do it the easy way

Sat, 11 May 2019 12:00:00 +0000

Introduction

As we wrote in Should you start your R blog now?, blogging has probably never been more accessible to the general population, R users included. Usually, the simplest solution is to host your blog via a service that provides it for free, such as Netlify, GitHub or GitLab Pages. But what if you want to host that awesome blog on your own, HTTPS enabled domain?

In this post, we will look at how to port a Hugo-based website, such as a blogdown blog to our own domain, specifically focusing on GitLab Pages. We will also cover setting up SSL certificates, redirects from www to non-www sites and other details that I had to solve when porting my blogdown blog from GitLab’s hosting.

If you are just starting - there is an easy way

This post is mostly a reminder-to-self of what porting this blog from GitLab Pages hosting to a custom domain entailed. The route I took was heavily influenced by the way I was serving the website at the beginning - using GitLab Pages on a project address. Migrating to a new domain I wanted to

Keep all the functionality that a GitLab repository with GitLab CI/CD provides
Serving the content at a custom HTTPS-enabled domain
Redirecting to the new domain with minimal to no content duplication
Making sure that the website works on both www and non-www addresses

If you have your blog hosted on GitLab pages and want to port and redirect it to your own HTTPS-enabled domain, while keeping the functionality that GitLab provides, you might find my journey useful.

Blogdown, Hugo & Let’s Encrypt logos

If you are just starting, it is easier to choose a different approach to publish your blog - here are 2 tips to consider if you want to prevent the pain I went through because of my past decisions:

What would I do if starting today

With the knowledge I gained when investigating the process, I would probably take the following route:

Register a custom domain via CloudFlare. This should make non-www/www redirects and getting SSL certificates seamless
Deploy the pages by connecting the GitLab repository to Netlify, which should be equally easy as using GitLab CI/CD
Setup deployment to the custom domain via Netlify. This should make the redirects to the custom domain seamless and technically sound

Doing it the simplest way

Serving a Hugo-based website can in principle be even simpler - in fact, all that is really necessary is just copying/uploading contents of the public directory generated for example with blogdown::build_site() to the proper place. All that we do around it are processes that make your lives nicer at the cost of extra effort.

Serving a page on a custom domain via GitLab Pages

1. Choose and register a domain name

The first step is to choose a domain name (i.e. the web address) for your brand new website. This is completely up to you and the internet is full of tips like this one. Next, register that domain name with a provider of your choice. I use a local provider for all of my websites for many years, so the choice was easy.

To register your domain name, you can pick from a plethora of providers, each with their of own pros and cons

2. Setup an SSL certificate

Setting up your website such that it can be accessed via HTTPS should be the standard these days, so we also need to set up an SSL certificate. Once again, this should be simple as we can use free Let’s Encrypt certificates to achieve that.

The actual process once again depends on the provider of your domain services - in practice, it should entail just a few clicks in their web UI

3. Setup GitLab to serve your pages to a custom domain

Setting up GitLab Pages to be served to your own domain is well documented in GitLab’s documentation here and even better here.

If you have chosen CloudFlare as your service to manage the DNS, Nick Zeng wrote a detailed guide on how to setup GitLab pages with a custom domain. GitLab also has links to setting up DNS records for other hosting providers.

After these 3 steps, you should see your website served on your new domain and HTTPS should work just fine. Now onto the not-so-simple issues.

Redirecting the gitlab.io address to the custom domain

Server-side redirects are not supported with GitLab pages

Now that we can see our content on our new domain, we may want to take care of the fact that it is now visible on 2 addresses - the new custom domain and the original GitLab pages address. The traditional way of handling this is with server-side redirects - the server would issue an HTTP 301 Moved Permanently redirect to the new domain. The issue with that is that GitLab Pages does not support server-side redirects.

On the other hand, redirects are supported by GitHub pages, which have this feature and by Netlify as well.

JavaScript to the rescue

We can also find a suggestion to use meta refresh tags, but since using them is not always simple and server-side functionality is not available, we can opt for client-side JavaScript to solve our redirection issues.

It may not seem like a good idea from SEO perspective at first, but looking at some research on how Google handles JavaScript redirects it looks like the JavaScript redirects are quickly followed by Google. From an indexing standpoint, they are interpreted as 301s — the end-state URLs replaced the redirected URLs in Google’s index.

An example of a JavaScript implementation using the window.location object that can be used to get the current page address and to redirect the browser to a new page can look as follows:

function replacePath(path, old_d, new_d) {
  path = path.replace(old_d, new_d);
  if (path.includes(new_d)) {
    // only if really on the new domain
    path = path.replace("http:", "https:");
  }
  return path;
}
newpath = replacePath(
  window.location.href,
  "://jozefhajnala.gitlab.io/r",
  "://jozef.io"
);

// Prevent infinite redirect
if (window.location.href != newpath){
  window.location.replace(newpath);
}

We would obviously replace the mentioned addresses by the desired ones and omit the https replacement if the new domain does not have SSL enabled.

We can test our JavaScript with a very simple function to see if all the URLs will be translated correctly:

// Place all urls into a variable
// this is just an example with a few
var oldLinks = [
  "https://jozefhajnala.gitlab.io/r",
  "https://jozefhajnala.gitlab.io/r/categories/rcase4base",
  "https://jozefhajnala.gitlab.io/r/categories/rstudioaddins",
  "https://jozefhajnala.gitlab.io/r/categories/various"
];

const old_d = "://jozefhajnala.gitlab.io/r";
const new_d = "://jozef.io";

// get the new links
var newLinks = oldLinks.map(x => replacePath(x, old_d, new_d));

function getStatus(url) {
  var req = new XMLHttpRequest();
  req.open("GET", url, false);
  req.send(null);
  return req.status;
}

// check the response statuses
// we want no 404s here, all 200 would be ideal
var statuses = newLinks.map(getStatus);

Setting canonical links

To be completely sure about duplicate content, if we have several similar versions of the same content, we can choose one version and point the search engines at this version by specifying a canonical URL.

To specify them using Hugo is very simple thanks to the way it provides variables and the partials approach to building themes. Simply add a line like this to your header.html or head.html partial file:

<link rel="canonical" href="{{ .Permalink }}">

The .html files for partials are usually located in the themes/<your_theme>/layouts/partials/ directory.

Redirecting www and non-www URLs to a single address

Another aspect of the move is to make sure that the content is available via both www.example.com and example.com, but not duplicated. Which of those is preferred is once again up to you. One solution would be to tell GitLab to serve the content to both and use the canonical link or use the JavaScript redirect again. However, there is a much nicer solution on offer here since for our own domain we can use server-side redirects.

If your provider of choice is CloudFlare, it seems that this redirect can be done in a few clicks via the web UI

Using .htaccess

One way to create this redirect is by using a .htaccess file. An example content, if you want to redirect to the www address with HTTPS, can look as follows:

RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule .* https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteRule .* https://www.%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

Once you have the file ready, upload it to your site through an ftp client. If you host directly via your own server

place the .htaccess file into the proper directory, for example, /var/www/example.com/
do not forget to activate the apache mod_rewrite module using sudo a2enmod rewrite
an extra SSL certificate is likely to be needed for https to work correctly. The generation using Let’s Encrypt is simple using certbot, described for example here

Read more details on using .htaccess here and more details on using Mod_Rewrites for redirects here

References

Using custom domains with GitLab Pages

Adding custom domains to GitLab Pages
GitLab Pages custom domains and SSL/TLS Certificates
Setting up DNS Records with popular hosting providers

Redirects and duplicate content

Setting up virtual hosts, securing Apache with SSL, using .htaccess

How To Set Up Apache Virtual Hosts on Ubuntu
How To Secure Apache with Let’s Encrypt on Ubuntu 16.04
How To Use the .htaccess File
How To Set Up Mod_Rewrite

Setting up continuous multi-platform R package building, checking and testing with R-Hub, Docker and GitLab CI/CD for free, with a working example

Sat, 27 Apr 2019 12:00:00 +0000

Introduction

In the previous post, we looked at how to easily automate R analysis, modeling, and development work for free using GitLab’s CI/CD. Together with the fantastic R-hub project, we can use GitLab CI/CD to do much more.

In this post, we will take it to the next level by using R-hub to test our development work on many different platforms such as multiple Linux setups, MS Windows and MacOS. We will also show how to automate and continuously execute those multiplatform checks using GitLab CI/CD integration and Docker images.

For those too busy to read, we also provide a working example implementation in a public GitLab repository.

Using R-hub to build, check and test our R package on many platforms

R-hub is a project supported by the R Consortium and offers free R CMD check as a service on different platforms. This enables us to quickly and efficiently check the R package you are developing to make sure it passes all necessary checks on several platforms. As an added bonus, the checks seem to be running in a very short time span, which means we can have your results at hand in a few minutes.

I also recommend that you read the why should you care about R-hub? blog post for more info.

CI/CD running checks on multiple platforms with R-hub

Getting started with R-hub

Getting started with R-hub is also very simple and can be achieved in 3 lines of code, from a package directory or an RStudio project for a package:

# Install the package
install.packages("rhub")

# Validate your e-mail address
# Provide the email argument if not detected automatically
rhub::validate_email()

# In an interactive session, 
# this will offer a list of platforms to choose from
cr <- rhub::check()

Your validated_emails.csv should be saved into rappdirs::user_data_dir("rhub", "rhub") directory once validate_email() was run successfully.

For more details on getting started, the Get started with rhub post has you covered in detail.

Using and evaluating R-hub check results via R scripts

For continuous integration purposes, we may want to evaluate the results of the check based on the number of errors, warnings, and notes that the check gives for each platform. To achieve this goal, we need to tackle 2 issues:

Getting the results in a non-interactive context

In a non-interactive session, R-hub will run the check asynchronously and end our process used to request the service to free up resources. This is great but can pose some challenges in the CI context, as we would have to keep around a job to repeatedly query the R-hub job’s status and processing the results once done. Or implement a much smarter reporting solution.

Luckily, since for this purpose maximizing efficiency is not our top concern, the simple workaround is to execute the check as-if in an interactive session via the CI tool. This will provide us with the actual results of the check as soon as done and also write the log into our CI’s run log, at the obvious cost of having the process blocked while waiting for the check to finish on R-hub’s servers.

Processing the check results

The public methods for an rhub_check object currently seem to provide only side-effecting results such as printing them in various levels of detail and returning self, so investigating results via code may be challenging.

The simplest current solution is to use the object’s private fields to access the results in the desired format. The below example looks at the status_ private field and returns a data frame with the number of errors, warnings, and notes for each. For an object containing only 1 check result it can look as follows:

statuses <- cr[[".__enclos_env__"]][["private"]][["status_"]]
res <- do.call(rbind, lapply(statuses, function(thisStatus) {
  data.frame(
    plaform  = thisStatus[["platform"]][["name"]],
    errors   = length(thisStatus[["result"]][["errors"]]),
    warnings = length(thisStatus[["result"]][["warnings"]]),
    notes    = length(thisStatus[["result"]][["notes"]]),
    stringsAsFactors = FALSE
  )
}))
res

##              plaform errors warnings notes
## 1 debian-gcc-release      0        0     0

Now we have a data frame which we can use to signal the CI/CD job to succeed or fail based on our wishes. For example, if we want to fail if the check discovered any notes, warnings or errors, a simple statement like the following will suffice:

if (any(colSums(res[2L:4L]) > 0)) {
  stop("Some checks resulted in errors, warnings or notes.")
}

Putting it together into a script

Now that we have solved the above challenges, we can put it all together into a script that can be later used in the context of a CI/CD job:

# Retrieve passed command line arguments
args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 1L) {
  stop("Incorrect number of args, needs 1: platform (string)")
}
platform <- args[[1L]]

# Check if passed platform is valid 
if (!is.element(platform, rhub::platforms()[[1L]])) {
  stop(paste(
    "Given platform not in rhub::platforms()[[1L]]:",
    platform
  ))
}

# Run the check on the selected platform
# Use show_status = TRUE to wait for results
cr <- rhub::check(platform = platform, show_status = TRUE)

# Get the statuses from private field status_
statuses <- cr[[".__enclos_env__"]][["private"]][["status_"]]

# Create and print a data frame with results
res <- do.call(rbind, lapply(statuses, function(thisStatus) {
  data.frame(
    plaform  = thisStatus[["platform"]][["name"]],
    errors   = length(thisStatus[["result"]][["errors"]]),
    warnings = length(thisStatus[["result"]][["warnings"]]),
    notes    = length(thisStatus[["result"]][["notes"]]),
    stringsAsFactors = FALSE
  )
}))
print(res)

# Fail if any errors, warnings or notes found
if (any(colSums(res[2L:4L]) > 0)) {
  stop("Some checks had errors, warnings or notes. See above for details.")
}

Preparing a private docker image to use with R-hub

If you are new to Docker, Colin Fay has you covered with his Introduction to Docker for R Users blog post.

Creating and testing an image

Thanks to all the hard work done by the maintainers of the Rocker images, our task with creating an image suitable for use with R hub is very simple. Essentially we only need 2 additions to the r-base image:

The rhub package and a few system dependencies
A validated_emails.csv file placed into the correct directory, providing R-hub with the information on validated e-mail to use for the checks

The following Dockerfile can be used the create such an image for yourself. Just make sure you have your validated_emails.csv file present in the resources folder when running docker build.

To test our docker image, we can use a command like the following to create a container and run R within it in an interactive session:

docker run --rm -it <hub-username>/<repo-name>:<tag> R

Now we can see the list of validated e-mails in that R session:

rhub::list_validated_emails()

##                  email                token
## 1 myemail@somemail.com 00000000000000000000

Pushing the image into a private repository

Now that we have our image created, we need to push it to a repository for GitLab CI to be able to use it. Normally this is very simple:

docker push <hub-username>/<repo-name>:<tag>

However as we are storing some relatively sensitive data in our image, namely our R-hub token we should probably make this image private. Thanks to Dockerhub, this process is very easy - just click the proper buttons as shown in this post in the Dockerhub docs. Note that for free a Dockerhub user has only 1 private repository available.

Creating a GitLab CI/CD pipeline

For an introduction to using GitLab CI/CD for R work, look at the previous post on How to easily automate R analysis, modeling and development work using CI/CD, with working examples

Setting up a pipeline with .gitlab-ci.yml

Now, we are ready with our private Docker image and the script to run and evaluate our R-hub checks, all that is left is to create and setup a CI/CD pipeline. For GitLab CI/CD, this means creating a .gitlab-ci.yml file in the root of our GitLab repository directory. Without much extra talk, that file can look as follows:

image: index.docker.io/jozefhajnala/rhub:rbase

stages:
  - check

variables:
  _R_CHECK_CRAN_INCOMING_: "false"
  _R_CHECK_FORCE_SUGGESTS_: "true"

before_script:
  - apt-get update

check_ubuntu:
  stage: check
  script:
    - Rscript inst/rhubcheck.R "ubuntu-gcc-release"

check_fedora:
  stage: check
  script:
    - Rscript inst/rhubcheck.R "fedora-clang-devel"

check_mswin:
  stage: check
  script:
    - Rscript inst/rhubcheck.R "windows-x86_64-devel"

check_macos:
  stage: check
  script:
    - Rscript inst/rhubcheck.R "macos-elcapitan-release"

This file will make sure that:

The CI/CD jobs start from the image we have created
Will have one stage named check
Set a couple of environment variables for R
Run three jobs check_ubuntu, check_fedora, check_mswin, and check_macos - each of them by using Rscript to execute an R script stored under inst/rhubcheck.R, with different arguments specifying the platform to check on

Authenticating to use a private repository

Since we have made our Docker image private, GitLab will not be able to use it out of the box, we need to provide it with information on how to authenticate against Dockerhub to be able to pull the private image. There are a few ways to reach this goal, I have used the one to setup a variable via the Settings -> CI/CD -> Variables option in GitLab’s web UI:

Creating CI/CD variable with GitLab

The variable name should be DOCKER_AUTH_CONFIG and the value:

{
  "auths": {
    "registry.example.com:5000": {
      "auth": "bXlfdXNlcm5hbWU6bXlfcGFzc3dvcmQ="
    }
  }
}

Where

"registry.example.com:5000" is replaced by our registry, for example "index.docker.io"
the value for "auth" is replaced by a base64-encoded version of our "<username>:<password>", which we can retrieve for example using R:

base64enc::base64encode(charToRaw("my_username:my_password"))

## [1] "bXlfdXNlcm5hbWU6bXlfcGFzc3dvcmQ="

And that is all! We are now ready to run our checks using a Docker image stored in a private repository. Once we push the .gitlab-ci.yml and inst/rhubcheck.R files to a GitLab repository, the pipeline will be automatically executed every time we push a commit to that repository.

TL;DR: Just show it to me in action

In case you are only interested in seeing the CI/CD pipeline with R-hub implemented for an R package, look at:

The .gitlab-ci.yml file for the jhaddins package on branch experimental
The Dockerfile used to build the image used in the above .gitlab-ci.yml
An R script that runs the checks via R-hub and evaluates the results
An example of a successful run with checks on 3 platforms
An example output of a check on Windows provided by R-hub

References

R-hub

The why should you care about R-hub? blog post
Get started with R-Hub
R-Hub on the R Consortium website
R-Hub’s reference online
Documentation on rhub_check R6 objects

R work and GitLab

Blog post on automating R analysis, modeling and development using CI/CD, with working examples
GitLab Continuous Integration documentation
GitLab CI/CD environment variables
Using a private container registry with GitLab CI/CD

R work and Docker

Docker images for R on the Rocker Project
Colin Fay’s Introduction to Docker for R Users
Get started with Docker official documentation
Using private repositories in DockerHub

How to easily automate R analysis, modeling and development work using CI/CD, with working examples

Sat, 13 Apr 2019 12:00:00 +0000

Introduction

Automating the execution, testing and deployment of R work is a very powerful tool to ensure the reproducibility, quality and overall robustness of the code that we are building, be it for data analysis and modeling purposes, developing R packages or even blogging. Modern tools also provide a free an easy to use way of achieving this goal.

In this post, we will show a quick and simple way to automate R data analysis and package development checking, testing and installation with GitLab CI/CD and provide example files that can be used for testing packages and deploying blogdown-based websites.

A quick overview of CI/CD and GitLab’s approach to it

In this paragraph, we will try to introduce GitLab CI/CD and it prerequisites in very simple and practical terms, at the cost of technical precision. Terms will be linked to relevant pages for those interested in precise definitions. We will also focus on using the CI/CD provided directly on GitLab. It is also possible to use it on your own infrastructure, but this is out of the scope of this introductory post.

What is CI/CD and how to use it with Gitlab

Continuous integration (CI) and Continuous deployment (CD) are IT practices that encourage checking and testing code often (e.g. on every change pushed to a repository) and being able to provide the resulting product (e.g. an application) to the users automatically.

For the purpose of this post, we will be focusing on R code and will be happy with the CI/CD

detecting any code changes that we make in our repository
automatically running a set of actions that we define when changes are detected

What is GitLab CI/CD, what can it do

GitLab CI/CD is a service provided by GitLab that makes using basic CI/CD easy even for non-IT professionals, such as R users
Free for both public and private repositories hosted on GitLab
Can execute a wide variety of tasks, ranging from executing custom scripts, deployment of Java applications, building Docker images, checking and testing R packages to publishing blogs and more

What are the prerequisites, how to make it work?

To use GitLab CI/CD your project’s code should be hosted on GitLab
To make it work, you need to create a yaml file called .gitlab-ci.yml with the instructions and push it to the root directory of your project’s repository
Once you do that, instructions in that yaml file will be executed by GitLab automatically each time you push a code change to the repository. Different triggers such as specified times, etc. can also be used

Why GitLab? What if my code is on GitHub?

This post is by no means supposed to be an advertisement for GitLab, I chose it some time ago for 2 very simple reasons

It allowed for free private repositories, which is now also true for GitHub
The CI/CD is fully integrated, with no need for other tools

If you use GitHub, the favorite CI tool for R code hosted there seems to be Travis. Some examples specific for R can be found here. You can also read a more generic Travis CI Tutorial.

The simplest example with R use

To make the post a bit less abstract and more practical, here is an overly simplified example of GitLab CI/CD used with R, which just runs the current version of R and prints the mtcars dataset:

The repository
The .gitlab-ci.yml file
An overview of pipeline runs
An example output of a run

Let’s have a look at this very simplistic .gitlab-ci.yml:

image: r-base

test:
  script:
  - R -e 'print(datasets::mtcars)'

We see the following:

image: r-base tells GitLab CI/CD to use the r-base Docker image for the run - more on that later
the rest of the yaml tells GitLab CI/CD to run a job named test, its task is to execute a script defined as R -e 'print(datasets::mtcars)', meaning just to run R and print the mtcars dataset

Now let us take a look at a more useful example for developing R packages.

Example pipeline for R package testing and deployment

We can use GitLab CI/CD to automatically

Build our package
Perform the R CMD check and investigate if we have any errors, warnings or notes
Run our unit tests
Check our testing coverage and finally
Install the package and potentially use it to perform more actions

GitLab CI/CD Pipeline for an R package

An example .gitlab-ci.yml with a pipeline based on a Docker image to test an R package can look as follows. Note that this is most likely overkill and too spacious, one could have a pipeline that is way shorter for this purpose:

image: jozefhajnala/rdev:3.4.4

stages:
  - build
  - document
  - check
  - test
  - deploy

variables:
  _R_CHECK_CRAN_INCOMING_: "false"
  _R_CHECK_FORCE_SUGGESTS_: "true"
  CODECOV_TOKEN: "2329aed3-de38-468c-9a06-95564363211c"

before_script:
  - apt-get update

buildbinary:
  stage: build
  script:
    - r -e 'devtools::build(binary = TRUE)'

documentation:
  stage: document
  script:
    - r inst/ci/document.R

checkerrors:
  stage: check
  script:
    - r -e 'if (!identical(devtools::check(document = FALSE, args = "--no-tests")[["errors"]], character(0))) stop("Check with Errors")'

checkwarnings:
  stage: check
  script:
    - r -e 'if (!identical(devtools::check(document = FALSE, args = "--no-tests")[["warnings"]], character(0))) stop("Check with Warnings")'

checknotes:
  stage: check
  script:
    - r -e 'if (!identical(devtools::check(document = FALSE, args = "--no-tests")[["notes"]], character(0))) stop("Check with Notes")'

unittests:
  stage: test
  script:
    - r -e 'if (any(as.data.frame(devtools::test())[["failed"]] > 0)) stop("Some tests failed.")'

codecov:
  stage: test
  script:
    - r -e 'covr::codecov()'

install:
  stage: deploy
  script:
    - r -e 'devtools::install()'

Now let’s again take a look at the content:

image - a docker image to use for the pipeline
stages - defines the ordering of job execution, jobs of the same stage are run in parallel, jobs of the next stage are run after the jobs from the previous stage complete successfully. Stages are the “columns” of the chart below.
variables - used to pass environment variables to the jobs
before_script - used to define commands that should be run before all jobs. An example is installing a needed R package that is not contained in our Docker image.

The rest are jobs definitions. Each job has a stage, which defines in which stage it is ran, where multiple jobs can be included in one stage. A script essentially defines what to do. For R uses, this can usually be:

Rscript -e '<r commands>' to execute R commands specified between the quotes
Rscript pathtoscript.R to execute a script stored in a file
for littler users, we can replace Rscript with r for similar purposes as above

Docker images for R users and developers

As we have seen, most of the pipelines start with image: <image>. This tells GitLab to use the specified Docker image for the run, which is extremely useful because a suitable Docker image will include all the software that we need to execute our analyses, modeling or other tasks, without us having to install that software within the .yaml file. Example of such software available in an image for R use is, obviously, R and other dependencies such as additional packages.

If you would like to read more about Docker, Colin Fay has you covered with his Introduction to Docker for R Users. For now, let’s just assume that using this image provides GitLab with a place that has R (and all needed packages) installed and can run the specified scripts for us.

One of the great things about Docker images is that they are easy to share and adapt. A huge thank you and kudos go to Carl Boettiger and Dirk Eddelbuettel, who maintain the Rocker project which provides a collection of images suited for different R needs built on Debian.

My personal favorites from the Rocker project are

r-ver images - providing an environment fixed in time, including using a specifically dated MRAN repository. Have a look at their Dockerfiles on GitHub. The image used by my CI pipeline for testing packages is adapted from r-ver:3.4.4.
r-base for the current version of base R. Have a look at the Dockerfile on GitHub

TL;DR: Just show it to me in action

In case you are only interested in seeing the CI/CD pipeline work in action for some R uses, you can look at:

The simplest example using R

The repository
The .gitlab-ci.yml file
An overview of the pipeline runs
An example output of a run

The mentioned R package testing

The .gitlab-ci.yml file for the jhaddins package on branch develop
An overview of the pipeline runs
An example output of a successful run
An example output of a run that discovered a NOTE in the check process

Building a Docker image

The Docker image jozefhajnala/rdev:3.4.4 used above

is based on the following Dockerfile
is also built with GitLab CI/CD, look at the .gitlab-ci.yml file
the build pipeline in action

Publishing a Hugo-based blogdown blog

Also, this blog itself is deployed on a schedule via GitLab CI/CD, using a file very similar to the following:

Example Hugo site deployed with GitLab CI/CD
The .gitlab-ci.yml

References

GitLab Continuous Integration documentation
GitLab CI/CD Pipeline Configuration Reference
Colin Fay’s Introduction to Docker for R Users
Building an R Project with Travis CI
Docker images for R on the Rocker Project

Should you start your R blog now? 6 reasons I found in my first year of R blogging

Sat, 30 Mar 2019 12:00:00 +0000

Introduction

It has been a year since I posted the first post on this blog. Since that time, I have learned many lessons, but the main one is probably that blogging has never been as accessible as it is now.

In this anniversary post, I would like to give you a few reasons to start your own R blog and write about what I have learned in my first year of blogging about R.

The barrier to entry is low and the tools excellent

For many people, writing a blog on their own can seem like a challenge. In the end, you are basically creating a full-blown website, with styles, content, hopefully also with a responsive design. Then you need to setup hosting, publishing, and all the other necessities to actually get the content online. How does one do all that on their own?

Just like many other areas a task that would be difficult years ago, the tools came a long way and we need very little technical knowledge to have a blog up and running in under an hour. I wrote about the amazing free tools that I personally use for this blog and I really believe that thanks to those tools blogging about R is very accessible to a wide range of R developers and users.

If you are interested in starting up right now, I would recommend taking a look at the get started chapter of the blogdown: Creating Websites with R Markdown written by Yihui Xie, the author of the blogdown package himself.

Writing is a great way to learn and discover

When I was starting to write the blog, the intention was mainly to provide more exposure to base R functionality, which I felt has too little presence and popularity online, at least relative to other packages with much better presence and marketing - hence the R:case4base section.

Regardless of whether this mission was successful or not, it surprised me how much I learned during the writing process. Be it technical details, alternative ways of implementation, other class methods or even function arguments I never used before. Writing about R requires a lot of reading, which in turn resulted in learning many new approaches and exploring ideas.

Even writing this post I have discovered a cool new R package - prompt by Gábor Csárdi.

Writing will of course force you, well, to write your thoughts down, which is more difficult than it seems, especially if you are not a trained writer already and helps express your thoughts in a more concise way. I find that even if no one except me read the blog posts, the added value of the writing itself is worth the effort.

Last but not least, I have used the posts on my blog as a reference for work since most of the time I write about issues I come across and try to propose solutions. Keeping a blog is a good way to have a written resource that you can get back to when you approach the same challenge at a later point in time. For example, I come back to the handling Java exceptions regularly to refresh my memory.

If you think without writing, you only think you’re thinking.
Leslie Lamport

Getting some readers is easier than expected

Apart from the learning experience, most of the people who write are happy when people actually read their blog and, if that is the goal, find it helpful. When starting, my optimistic expectation was that it would start to get some readers and crawl out of obscurity about a year from the first post, provided I keep posting consistently.

To much surprise, getting exposure proved much easier than anticipated, mainly thanks to the amazing r-bloggers, an aggregator of R blogs with a huge reader base. In fact, in the first 3 months since it was added, 40% of all the readers came to this blog from R bloggers. There are also other aggregators and websites that you can add your content to for some extra exposure, such as R Weekly and Awesome Blogdown.

Resources online that provide some very useful tips include Maëlle’s Get on your soapbox! R blog content and promotion post. I only discovered that the Twitter hashtag for R is #rstats reading this blog post.
Thanks, Maëlle!

The community is amazing

In some effort to gain more exposure, one can also turn to social media. And since my poor skills predispose me to not much than Twitter (which I still cannot use properly) I try to at least post a tweet when I publish a new post - with variable success. Twitter is a lot of work, in fact much more than one would expect, so I failed miserably on my goals to publish n tweets each week. There probably were months where I did not even open Twitter at all. The good thing about Twitter is, at least in my experience, there seems to be a very strong correlation between the amount of work you put in and exposure you get. And the amount variable is fully in your hands.

On an even more positive note, the #rstats community is just full of helpful and nice individuals, so any worries you might have will disappear soon.

It can also happen that you get lucky and some of the #rstats superstars such as Mara Averick will notice and retweet your tweet, which can really help boost your exposure. And you can easily communicate with other well-known figures of the community as well.

Blogging is fun

When push comes to shove, writing a blog is a spare time activity and to invest part of those precious moments, one must enjoy it. And there is a lot to enjoy when creating your own blog, especially with blogdown and hugo, where you have full control over the entire content and infrastructure of the blog. You can enjoy a multitude of activities related to content, design and more. To put this in perspective, my personal favorite time wasters include

endlessly obsessing about tiny details in the css, making sure everything is exactly as I want it to be, resulting in commits with messages like Dont use -1em top margin for pre.r, use 0 instead. Obviously, “as I want it to be” changes pretty much on a monthly basis, with current weather potentially having a significant effect.
trying to make the site light to load, resulting in spending hours on editing the svg representations of Font-Awesome icons to save 75KB of resources on page load. Check the footer of the page source code if you are interested in the result.
related to the above, make interactive charts light to load by writing a wrapper to minimize the rendered highchart size to the necessary JavaScript minimum

Write for yourself, the inspiration will come

One of my worries when starting a blog was that when done with the first few posts that I had planned, I will not find more inspiration and topics to write about. After sticking to my schedule of posting every other Saturday, instead of running out of inspiration, it seems that the topics that I want to write about are coming in a pace faster than 2 per month, which essentially means if this keeps up, I will never run out of ideas. And if that tragic moment comes when I have nothing to write about I guess the internet will happily keep growing without those few kilobytes I would add.

In terms of popularity, I write mostly what I find interesting and helpful for my future self. Even if I tried to write what others like, my estimation of what others may be interested in reading is so bad I would fail miserably. A case in point, my personal estimation was that the posts on interfacing Java from R would be the most read posts of the year. They took a lot of work and investigation to write and I find them really interesting. As it happens, both of them combined only have 10% of the reads compared to the post I called Christmas praise.

Happy R blogging!

How to create professional reports from R scripts, with custom styles

Sat, 16 Mar 2019 12:00:00 +0000

Introduction

If the practical tips for R Markdown post we talked briefly about how we can easily create professional reports directly from R scripts, without the need for converting them manually to Rmd and creating code chunks. In this one, we will provide useful tips on advanced options for styling, using themes and producing light-weight HTML reports directly from R scripts. We will also provide a repository with example R script and rendering code to get different styled and sized outputs easily.

Creating reports directly from R scripts

For an introduction on creating nice reports directly from R scripts, look into the relevant section of the previous blog post. In one sentence, we can just call one of the following:

# with knitr directly
knitr::spin("path-to-r-script.R")

# or with rmarkdown
rmakdown::render("path-to-r-script.R")

to create a report from an R script directly. Both spin() and render() provide a default style that will be used to render an R script to html. The same is true from RStudio’s built-in File -> Compile report... functionality, which will call render() in the background when used.

We might, however, be interested in using different styles other than the default one when rendering our R scripts into HTML reports, and there are multiple ways to achieve this.

Including styles the quick, dirty and risky way

The fastest way to include a custom css stored in a file is to simply include a line like the following at the beginning of the R script that we are using spin() on:

#' <link rel="stylesheet" type="text/css" href="path-to-our.css">

This simple approach however has many caveats, as the line is just inserted into the body of the document within a paragraph, completely oblivious to what else was inserted. Unless there is a very good reason, we should use one of the safer and more robust approaches mentioned below.

Using knitr’s spin directly

Under the spin’s hood

Under the hood, spin() calls knit2html(), which passes many useful arguments to markdownToHTML(), the function that actually converts a markdown file to the final HTML format. Unfortunately, many of those useful arguments are not exposed via spin().

Bearing this in mind, we have a few ways to access and provide them with the desired values:

Changing the options that govern the default values and just call spin() as before
Perform the spinning in 2 steps

Changing the options that govern the default values and just call `spin()` as before

As mentioned above, spin() does not expose the arguments of markdownToHTML() directly, so what happens in practice is that the default values for those arguments are used when spin() is called. Some of the interesting arguments are by default selected in the following way:

options = getOption("markdown.HTML.options"), 
extensions = getOption("markdown.extensions") 
stylesheet = getOption("markdown.HTML.stylesheet")
template = getOption("markdown.HTML.template")

Let’s have a look at some interesting default options’ values:

library(markdown)
options()[c(
  "markdown.HTML.options",
  "markdown.extensions",
  "markdown.HTML.stylesheet"
)]

## $markdown.HTML.options
## [1] "use_xhtml"      "smartypants"    "base64_images"  "mathjax"       
## [5] "highlight_code"
## 
## $markdown.extensions
## [1] "no_intra_emphasis" "tables"            "fenced_code"      
## [4] "autolink"          "strikethrough"     "lax_spacing"      
## [7] "space_headers"     "superscript"       "latex_math"       
## 
## $markdown.HTML.stylesheet
## [1] "/usr/local/lib/R/site-library/markdown/resources/markdown.css"

If we want to keep the spinning in one step, we can simply update those options before calling spin (and ideally change them back afterwards). For a somewhat minimalistic HTML output still keeping images self-contained, we can do:

options(
  markdown.extensions = "fenced_code",
  markdown.HTML.options = "base64_images",
  markdown.HTML.stylesheet = "{}"
)
knitr::spin("spin_exaple.R")

To use a custom css stylesheet instead of the one provided by default with the markdown package:

options(markdown.HTML.stylesheet = "path_to_custom.css")
knitr::spin("path-to-r-script.R")

Perform the report creation in 2 steps

The method above works but can seem quite workaround-ish. The method that could be considered more proper is to actually split the production of the final output into 2 steps:

Generate an intermediate .Rmd file via spin(), using spin(..., knit = FALSE)
Run knit2html() on the created .Rmd file with the desired options directly specified as arguments

This allows us to provide additional arguments extensions, stylesheet, header, template and encoding in the second step, instead of relying on the changed options to be passed as defaults.

The below example will embed styles present in path_to_custom.css into the resulting HTML:

# Creates the intermediate path-to-r-script.Rmd
knitr::spin("path-to-r-script.R", knit = FALSE)

# Now create the final HTML output from
# path-to-r-script.Rmd, with desired options
knitr::knit2html(
  input = "path-to-r-script.Rmd",
  stylesheet = "path_to_custom.css"
)

Using both of the above options will actually embed the css directly into the HTML output that is produced, making the output larger in size.

Note that the arguments we are looking to provide to knit2html() are implemented as part of ..., so we will have to name them. To look at the details, study the documentation of markdownToHTML(), to which those arguments get passed.

spin with custom air.css

Using rmarkdown’s render()

To produce an HTML report from an R script we can also use rmarkdown::render() on an R script file. This will create a report with slight differences to the default knit() output, one notable for HTML output is that render() will by default include inline base64 representations of fonts and JavaScript sources. It will also include some potentially useful metadata, such as the author’s name and the date of rendering.

The output_format powerhouse

The output of render() is governed mainly by the output_format argument. Most of the time users will pass on just the name of the format, such as "html_document", as most of the options are governed by the yaml metadata present at the beginning of our Rmd files.

For R scripts we usually do not use the yaml metadata. In this case, we can take full advantage of the flexibility of that argument, passing a call to rmarkdown::html_document() with the desired parameters as output_format.

Minimalistic output with render()

To produce a minimalistic HTML output from our path-to-r-script.R script, we can for example specify the following as output_format:

rmarkdown::render(
  "path-to-r-script.R", 
  output_format = rmarkdown::html_document(
    theme = NULL,
    mathjax = NULL,
    highlight = NULL
  )
)

Custom css with render()

Including a custom css stylesheet is equally simple, just provide a css argument with the css file path to the html_document() function:

rmarkdown::render(
  "path-to-r-script.R", 
  output_format = rmarkdown::html_document(
    theme = NULL,
    mathjax = NULL,
    highlight = NULL,
    css = "path_to_custom.css"
  )
)

An interesting property of including custom css styles is that by default the argument self_contained is set to TRUE, meaning that the full stylesheet will be embedded into the output HTML file, including all the external css imported into the one we are using. This means that if your stylesheets import external fonts such as the following, those will also be pasted directly into the output:

@import url(http://fonts.googleapis.com/css?family=Open+Sans:300italic,300);

This behavior is different for spin(), which will paste the @import clause into the output as-is, instead of parsing and pasting the actual content of the provided url.

TL;DR: Just show me the examples

If instead of reading about it you would like to just test it yourself, I created a very simple R project showcasing the mentioned methods and some more available via a GitLab repo.

The project has the following files:

src/path-to-r-script.R - an R script with custom formatted comments to be used as the source for creating reports with knitr::spin() and rmarkdown::render()
rendering_render.R - an R script that uses rmarkdown::render() to create multiple different output reports based on path-to-r-script.R and save them to outputs/
rendering_spin.R - an R script that uses knitr::spin() to create multiple different output reports based on path-to-r-script.R and save them to outputs/
outputs/ - HTML reports generated from the content of path-to-r-script.R by running rendering_spin.R and rendering_render.R
css/ - Example css used for creating outputs/ex_04_spin_air_css.html, all credit for the air.css goes to https://github.com/markdowncss/air

References

HTML document chapter of the R Markdown: The Definitive Guide book
Create R Markdown reports and presentations even better with these 3 practical tips
air.css style used to create the report on the screenshot above

Creating blazing fast pivot tables from R with data.table - now with subtotals using grouping sets

Sat, 02 Mar 2019 12:00:00 +0000

Introduction

Data manipulation and aggregation is one of the classic tasks anyone working with data will come across. We of course can perform data transformation and aggregation with base R, but when speed and memory efficiency come into play, data.table is my package of choice.

In this post we will look at of the fresh and very useful functionality that came to data.table only last year - grouping sets, enabling us, for example, to create pivot table-like reports with sub-totals and grand total quickly and easily.

Basic by-group summaries with data.table

To showcase the functionality, we will use a very slightly modified dataset provided by Hadley Wickham’s nycflights13 package, mainly the flights data frame. Lets prepare a small dataset suitable for the showcase:

library(data.table)
dataurl <- "https://jozef.io/post/data/"
flights <- readRDS(url(paste0(dataurl, "r006/flights.rds")))
flights <- as.data.table(flights)[month < 3]

Now, for those unfamiliar with data table, to create a summary of distances flown per month and originating airport with data.table, we could simply use:

flights[, sum(distance), by = c("month", "origin")]

##    month origin       V1
## 1:     1    EWR  9524521
## 2:     1    LGA  6359510
## 3:     1    JFK 11304774
## 4:     2    EWR  8725657
## 5:     2    LGA  5917983
## 6:     2    JFK 10331869

To also name the new column nicely, say distance instead of the default V1:

flights[, .(distance = sum(distance)), by = c("month", "origin")]

##    month origin distance
## 1:     1    EWR  9524521
## 2:     1    LGA  6359510
## 3:     1    JFK 11304774
## 4:     2    EWR  8725657
## 5:     2    LGA  5917983
## 6:     2    JFK 10331869

For more on basic data.table operations, look at the Introduction to data.table vignette.

As you have probably noticed, the above gave us the sums of distances by months and origins. When creating reports, especially readers coming from Excel may expect 2 extra perks

Looking at sub-totals and grand total
Seeing the data in wide format

Since the wide format is just a reshape and data table has the dcast() function for that for quite a while now, we will only briefly show it in practice. The focus of this post will be on the new functionality that was only released in data.table v1.11 in May last year - creating the grand- and sub-totals.

Quick pivot tables with subtotals and a grand total

To create a “classic” pivot table as known from Excel, we need to aggregate the data and also compute the subtotals for all combinations of the selected dimensions and a grand total. In comes cube(), the function that will do just that:

# Get subtotals for origin, month and month&origin with `cube()`:
cubed <- data.table::cube(
  flights,
  .(distance = sum(distance)),
  by = c("month", "origin")
)
cubed

##     month origin distance
##  1:     1    EWR  9524521
##  2:     1    LGA  6359510
##  3:     1    JFK 11304774
##  4:     2    EWR  8725657
##  5:     2    LGA  5917983
##  6:     2    JFK 10331869
##  7:     1   <NA> 27188805
##  8:     2   <NA> 24975509
##  9:    NA    EWR 18250178
## 10:    NA    LGA 12277493
## 11:    NA    JFK 21636643
## 12:    NA   <NA> 52164314

As we can see, compared to the simple group by summary we did earlier, we have extra rows in the output

Rows 7,8 with months 1,2 and origin <NA>, <NA> - these are the subtotals per month across all origins
Rows 9,10,11 with months NA, NA, NA and origins EWR, LGA, JFK - these are the subtotals per origin across all months
Row 12 with NA month and <NA> origin - this is the Grand total across all origins and months

All that is left to get a familiar pivot table shape is to reshape the data to wide format with the aforementioned dcast() function:

# - Origins in columns, months in rows
data.table::dcast(cubed, month ~ origin,  value.var = "distance")

##    month       NA      EWR      JFK      LGA
## 1:    NA 52164314 18250178 21636643 12277493
## 2:     1 27188805  9524521 11304774  6359510
## 3:     2 24975509  8725657 10331869  5917983

# - Origins in rows, months in columns
data.table::dcast(cubed, origin ~ month,  value.var = "distance")

##    origin       NA        1        2
## 1:   <NA> 52164314 27188805 24975509
## 2:    EWR 18250178  9524521  8725657
## 3:    JFK 21636643 11304774 10331869
## 4:    LGA 12277493  6359510  5917983

Pivot table with data.table

Using more dimensions

We can use the same approach to create summaries with more than two dimensions, for example, apart from months and origins, we can also look at carriers, simply by adding "carrier" into the by argument:

# With 3 dimensions:
cubed2 <- cube(
  flights, 
  .(distance = sum(distance)),
  by = c("month", "origin", "carrier")
)
cubed2

##      month origin carrier distance
##   1:     1    EWR      UA  5084378
##   2:     1    LGA      UA   729667
##   3:     1    JFK      AA  2013434
##   4:     1    JFK      B6  3672655
##   5:     1    LGA      DL  1678965
##  ---                              
## 153:    NA   <NA>      F9   174960
## 154:    NA   <NA>      HA   293997
## 155:    NA   <NA>      YV    21526
## 156:    NA   <NA>      OO      733
## 157:    NA   <NA>    <NA> 52164314

And dcast() to wide format which suits our needs best:

# For example, with month and carrier in rows, origins in columns:
dcast(cubed2, month + carrier ~ origin,  value.var = "distance")

##     month carrier       NA      EWR      JFK      LGA
##  1:    NA    <NA> 52164314 18250178 21636643 12277493
##  2:    NA      9E  1431961    88706  1271194    72061
##  3:    NA      AA  7171819   789591  3830482  2551746
##  4:    NA      AS   283436   283436       NA       NA
##  5:    NA      B6  9036256   940582  7062702  1032972
##  6:    NA      DL  8729015   465275  4963047  3300693
##  7:    NA      EV  4188259  3940295    48792   199172
##  8:    NA      F9   174960       NA       NA   174960
##  9:    NA      FL   431194       NA       NA   431194
## 10:    NA      HA   293997       NA   293997       NA
## 11:    NA      MQ  2439609   293352   425390  1720867
## 12:    NA      OO      733       NA       NA      733
## 13:    NA      UA 13016872  9770500  1834968  1411404
## 14:    NA      US  1677108   641427   442107   593574
## 15:    NA      VX  1463964       NA  1463964       NA
## 16:    NA      WN  1803605  1037014       NA   766591
## 17:    NA      YV    21526       NA       NA    21526
## 18:     1    <NA> 27188805  9524521 11304774  6359510
## 19:     1      9E   749305    46125   666109    37071
## 20:     1      AA  3773186   415707  2013434  1344045
## 21:     1      AS   148924   148924       NA       NA
## 22:     1      B6  4699834   484431  3672655   542748
## 23:     1      DL  4503241   245277  2578999  1678965
## 24:     1      EV  2178833  2067900    24624    86309
## 25:     1      F9    95580       NA       NA    95580
## 26:     1      FL   226658       NA       NA   226658
## 27:     1      HA   154473       NA   154473       NA
## 28:     1      MQ  1284653   152428   223510   908715
## 29:     1      OO      733       NA       NA      733
## 30:     1      UA  6777189  5084378   963144   729667
## 31:     1      US   858820   339595   219387   299838
## 32:     1      VX   788439       NA   788439       NA
## 33:     1      WN   938403   539756       NA   398647
## 34:     1      YV    10534       NA       NA    10534
## 35:     2    <NA> 24975509  8725657 10331869  5917983
## 36:     2      9E   682656    42581   605085    34990
## 37:     2      AA  3398633   373884  1817048  1207701
## 38:     2      AS   134512   134512       NA       NA
## 39:     2      B6  4336422   456151  3390047   490224
## 40:     2      DL  4225774   219998  2384048  1621728
## 41:     2      EV  2009426  1872395    24168   112863
## 42:     2      F9    79380       NA       NA    79380
## 43:     2      FL   204536       NA       NA   204536
## 44:     2      HA   139524       NA   139524       NA
## 45:     2      MQ  1154956   140924   201880   812152
## 46:     2      UA  6239683  4686122   871824   681737
## 47:     2      US   818288   301832   222720   293736
## 48:     2      VX   675525       NA   675525       NA
## 49:     2      WN   865202   497258       NA   367944
## 50:     2      YV    10992       NA       NA    10992
##     month carrier       NA      EWR      JFK      LGA

Custom grouping sets

So far we have focused on the “default” pivot table shapes with all sub-totals and a grand total, however the cube() function could be considered just a useful special case shortcut for a more generic concept - grouping sets. You can read more on grouping sets with MS SQL Server or with PostgreSQL.

The groupingsets() function allows us to create sub-totals on arbitrary groups of dimensions. Custom subtotals are defined by the sets argument, a list of character vectors, each of them defining one subtotal. Now let us have a look at a few practical examples:

Replicate a simple group by, without any subtotals or grand total

For reference, to replicate a simple group by with grouping sets, we could use:

groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c("month", "origin", "carrier"),
  sets = list(c("month", "origin", "carrier")),
)

Which would give the same results as

flights[, .(distance = sum(distance)), by = c("month", "origin", "carrier")]

Custom subtotals

To give only the subtotals for each of the dimensions:

groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c("month", "origin", "carrier"),
  sets = list(
    c("month"),
    c("origin"),
    c("carrier")
  )
)

##     month origin carrier distance
##  1:     1   <NA>    <NA> 27188805
##  2:     2   <NA>    <NA> 24975509
##  3:    NA    EWR    <NA> 18250178
##  4:    NA    LGA    <NA> 12277493
##  5:    NA    JFK    <NA> 21636643
##  6:    NA   <NA>      UA 13016872
##  7:    NA   <NA>      AA  7171819
##  8:    NA   <NA>      B6  9036256
##  9:    NA   <NA>      DL  8729015
## 10:    NA   <NA>      EV  4188259
## 11:    NA   <NA>      MQ  2439609
## 12:    NA   <NA>      US  1677108
## 13:    NA   <NA>      WN  1803605
## 14:    NA   <NA>      VX  1463964
## 15:    NA   <NA>      FL   431194
## 16:    NA   <NA>      AS   283436
## 17:    NA   <NA>      9E  1431961
## 18:    NA   <NA>      F9   174960
## 19:    NA   <NA>      HA   293997
## 20:    NA   <NA>      YV    21526
## 21:    NA   <NA>      OO      733
##     month origin carrier distance

To give only the subtotals per combinations of 2 dimensions:

groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c("month", "origin", "carrier"),
  sets = list(
    c("month", "origin"),
    c("month", "carrier"),
    c("origin", "carrier")
  )
)

##     month origin carrier distance
##  1:     1    EWR    <NA>  9524521
##  2:     1    LGA    <NA>  6359510
##  3:     1    JFK    <NA> 11304774
##  4:     2    EWR    <NA>  8725657
##  5:     2    LGA    <NA>  5917983
##  6:     2    JFK    <NA> 10331869
##  7:     1   <NA>      UA  6777189
##  8:     1   <NA>      AA  3773186
##  9:     1   <NA>      B6  4699834
## 10:     1   <NA>      DL  4503241
## 11:     1   <NA>      EV  2178833
## 12:     1   <NA>      MQ  1284653
## 13:     1   <NA>      US   858820
## 14:     1   <NA>      WN   938403
## 15:     1   <NA>      VX   788439
## 16:     1   <NA>      FL   226658
## 17:     1   <NA>      AS   148924
## 18:     1   <NA>      9E   749305
## 19:     1   <NA>      F9    95580
## 20:     1   <NA>      HA   154473
## 21:     1   <NA>      YV    10534
## 22:     1   <NA>      OO      733
## 23:     2   <NA>      US   818288
## 24:     2   <NA>      UA  6239683
## 25:     2   <NA>      B6  4336422
## 26:     2   <NA>      AA  3398633
## 27:     2   <NA>      EV  2009426
## 28:     2   <NA>      FL   204536
## 29:     2   <NA>      MQ  1154956
## 30:     2   <NA>      DL  4225774
## 31:     2   <NA>      WN   865202
## 32:     2   <NA>      9E   682656
## 33:     2   <NA>      VX   675525
## 34:     2   <NA>      AS   134512
## 35:     2   <NA>      F9    79380
## 36:     2   <NA>      HA   139524
## 37:     2   <NA>      YV    10992
## 38:    NA    EWR      UA  9770500
## 39:    NA    LGA      UA  1411404
## 40:    NA    JFK      AA  3830482
## 41:    NA    JFK      B6  7062702
## 42:    NA    LGA      DL  3300693
## 43:    NA    EWR      B6   940582
## 44:    NA    LGA      EV   199172
## 45:    NA    LGA      AA  2551746
## 46:    NA    JFK      UA  1834968
## 47:    NA    LGA      B6  1032972
## 48:    NA    LGA      MQ  1720867
## 49:    NA    EWR      AA   789591
## 50:    NA    JFK      DL  4963047
## 51:    NA    EWR      MQ   293352
## 52:    NA    EWR      DL   465275
## 53:    NA    EWR      US   641427
## 54:    NA    EWR      EV  3940295
## 55:    NA    JFK      US   442107
## 56:    NA    LGA      WN   766591
## 57:    NA    JFK      VX  1463964
## 58:    NA    LGA      FL   431194
## 59:    NA    EWR      AS   283436
## 60:    NA    LGA      US   593574
## 61:    NA    JFK      MQ   425390
## 62:    NA    JFK      9E  1271194
## 63:    NA    LGA      F9   174960
## 64:    NA    EWR      WN  1037014
## 65:    NA    JFK      HA   293997
## 66:    NA    JFK      EV    48792
## 67:    NA    EWR      9E    88706
## 68:    NA    LGA      9E    72061
## 69:    NA    LGA      YV    21526
## 70:    NA    LGA      OO      733
##     month origin carrier distance

Grand total

To give only the grand total:

groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c("month", "origin", "carrier"),
  sets = list(
    character(0)
  )
)

##    month origin carrier distance
## 1:    NA   <NA>    <NA> 52164314

Cube and rollup as special cases of grouping sets

Implementation of cube

We mentioned above that cube() can be considered just a shortcut to a useful special case of groupingsets(). And indeed, looking at the implementation of the data.table method data.table:::cube.data.table, most of what it does is to define the sets to represent the given vector and all of its possible subsets, and passes that to groupingsets():

function (x, j, by, .SDcols, id = FALSE, ...) {
  if (!is.data.table(x)) 
    stop("Argument 'x' must be a data.table object")
  if (!is.character(by)) 
    stop("Argument 'by' must be a character vector of column names used in grouping.")
  if (!is.logical(id)) 
    stop("Argument 'id' must be a logical scalar.")
  n = length(by)
  keepBool = sapply(2L^(seq_len(n) - 1L), function(k) rep(c(FALSE, 
    TRUE), times = k, each = ((2L^n)/(2L * k))))
  sets = lapply((2L^n):1L, function(j) by[keepBool[j, ]])
  jj = substitute(j)
  groupingsets.data.table(x, by = by, sets = sets, .SDcols = .SDcols, 
    id = id, jj = jj)
}

This means for example that

cube(flights, sum(distance),  by = c("month", "origin", "carrier"))

##      month origin carrier       V1
##   1:     1    EWR      UA  5084378
##   2:     1    LGA      UA   729667
##   3:     1    JFK      AA  2013434
##   4:     1    JFK      B6  3672655
##   5:     1    LGA      DL  1678965
##  ---                              
## 153:    NA   <NA>      F9   174960
## 154:    NA   <NA>      HA   293997
## 155:    NA   <NA>      YV    21526
## 156:    NA   <NA>      OO      733
## 157:    NA   <NA>    <NA> 52164314

Is equivalent to

groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c("month", "origin", "carrier"),
  sets = list(
    c("month", "origin", "carrier"),
    c("month", "origin"),
    c("month", "carrier"),
    c("month"),
    c("origin", "carrier"),
    c("origin"),
    c("carrier"),
    character(0)
  )
)

##      month origin carrier distance
##   1:     1    EWR      UA  5084378
##   2:     1    LGA      UA   729667
##   3:     1    JFK      AA  2013434
##   4:     1    JFK      B6  3672655
##   5:     1    LGA      DL  1678965
##  ---                              
## 153:    NA   <NA>      F9   174960
## 154:    NA   <NA>      HA   293997
## 155:    NA   <NA>      YV    21526
## 156:    NA   <NA>      OO      733
## 157:    NA   <NA>    <NA> 52164314

Implementation of rollup

The same can be said about rollup(), another shortcut than can be useful. Instead of all possible subsets, it will create a list representing the vector passed to by and its subsets “from right to left”, including the empty vector to get a grand total. Looking at the implementation of the data.table method data.table::rollup.data.table:

function (x, j, by, .SDcols, id = FALSE, ...) {
  if (!is.data.table(x)) 
    stop("Argument 'x' must be a data.table object")
  if (!is.character(by)) 
    stop("Argument 'by' must be a character vector of column names used in grouping.")
  if (!is.logical(id)) 
    stop("Argument 'id' must be a logical scalar.")
  sets = lapply(length(by):0L, function(i) by[0L:i])
  jj = substitute(j)
  groupingsets.data.table(x, by = by, sets = sets, .SDcols = .SDcols, 
    id = id, jj = jj)
}

For example, the following:

rollup(flights, sum(distance),  by = c("month", "origin", "carrier"))

Is equivalent to

groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c("month", "origin", "carrier"),
  sets = list(
    c("month", "origin", "carrier"),
    c("month", "origin"),
    c("month"),
    character(0)
  )
)

References

Grouping sets, cube and rollup in PostgreSQL
MS SQL Server
And in Oracle documentation
Introduction to data.table
Efficient reshaping using data.tables

Verbose data.table and uncovering hidden cedta's data table awareness decisions

Sat, 16 Feb 2019 12:00:00 +0000

Introduction

When speed and memory efficiency is important, the data.table package is one of the ways to improve those aspects of our R code dramatically. Including data.table in a package also comes with the added benefit of only importing the methods package, which is part of base R. We must also however pay attention to correctly importing and using methods, as data.table handles data.frame subsetting operators in a special way. This post is mostly a lesson learned for future self on how I did not pay attention and what I found out investigating.

TL;DR if you just want something useful

Use options(datatable.verbose = TRUE) to see useful logging information
If you are getting weird errors with subset methods, check if data frame methods do not get called instead of the data table ones (e.g. running traceback() after the error occurs)
If so, check if data.table:::cedta() returns FALSE for your package. And if it does, check if you import data.table in the NAMESPACE file of your package

A somewhat reproducible example of the issue

Imagine a very simple function that takes a data table and sums a column with a name provided via the y argument, grouped by the column name provided via the by argument. An oversimplified definition and example use with the mtcars dataset could look as follows:

sumData <- function(dt, y, by) dt[, sum(get(y)), by = by]

mtcarsdt <- data.table::as.data.table(datasets::mtcars)
sumData(mtcarsdt, "disp", "gear")

##    gear     V1
## 1:    4 1476.2
## 2:    3 4894.5
## 3:    5 1012.4

So far so good, everything works great. Now we put our awesome function into a nice package called dtexample. Add some roxygen documentation, add data.table into Imports in our DESCRIPTION, try to install our package. All still works. Run R CMD check for good measure and get 0 errors, 0 warnings and 0 notes, like a boss!

Now let’s see our function in action, from within the new package:

dtexample::sumData(mtcarsdt, "disp", "gear")

Error in get(y) : object 'disp' not found

Oops. Something went wrong. Debugging such an issue can be tricky, especially if this happened in a more realistic setting, such as writing the function across multiple days and having a more complicated function than a one-liner. Most often the issue is inside the actual code, especially when passing around more complicated quoted expressions into data table’s subsetting machinery.

Traceback and datatable.verbose to the rescue

Let us look at the traceback() to get some insight into what is going on:

traceback()

## 5: get(y)
## 4: `[.data.frame`(x, i, j)
## 3: `[.data.table`(dt, , sum(get(y)), by = by) at sumData.R#12
## 2: dt[, sum(get(y)), by = by] at sumData.R#12
## 1: dtexample::sumData(dt, "disp", "gear")

Note the 4: despite the object being a data table (which is also confirmed by the third line of the traceback), the data frame method was called. It would also seem that this was deliberate on data table’s side. Let us turn on the datatable.verbose option and see what it has to say:

options(datatable.verbose = TRUE)
dtexample::sumData(mtcarsdt, "disp", "gear")

## cedta decided 'dtexample' wasn't data.table aware. Here is call stack with [[1L]] applied:
## [[1]]
## dtexample::sumData
## 
## [[2]]
## `[`
## 
## [[3]]
## `[.data.table`
## 
## [[4]]
## cedta

Traceback and cedta()

So what is this `cedta()`?

Looking at data table’s verbose output, we immediately notice this message:

cedta decided ‘dtexample’ wasn’t data.table aware. Here is call stack with [[1L]] applied:

So, what is this cedta() and why is it making such decisions? Let us look how we get from subsetting a data table to a function deciding that our package is not data table aware. Examining the first rows of the body of data.table:::[.data.table we can see that the subset method first examines the output of cedta() and if its results is FALSE, calls the data frame methods. This answers our question of why a data frame method was called:

  if (!cedta()) {
    Nargs = nargs() - (!missing(drop))
    ans = if (Nargs < 3L) {
      `[.data.frame`(x, i)
    }
    else if (missing(drop)) 
      `[.data.frame`(x, i, j)
    else `[.data.frame`(x, i, j, drop)
    if (!missing(i) & is.data.table(ans)) 
      setkey(ans, NULL)
    return(ans)
  }

Now looking into data.table:::cedta() itself we see that in case topenv(parent.frame(n)) is not a namespace, cedta() happily returns TRUE. This explains why our function worked when it was defined and run from the global environment. However, in case we are in the context of a namespace, our namespace must satisfy at least one of eight conditions:

  ans = nsname == "data.table" || 
  "data.table" %chin% names(getNamespaceImports(ns)) ||
  (nsname == "utils" && exists(
    "debugger.look",
    parent.frame(n + 1L)
  )) ||
  (nsname == "base" && all(c("FUN", "X") %chin% ls(parent.frame(n)))) ||
  (nsname %chin% cedta.pkgEvalsUserCode && any(
    sapply(sys.calls(), function(x)
      is.name(x[[1L]]) && (x[[1L]] == "eval" || x[[1L]] == "evalq"))
    )
  ) ||
  nsname %chin% cedta.override ||
  isTRUE(ns$.datatable.aware) ||
  tryCatch(
    "data.table" %chin% get(
      ".Depends",
      paste("package", nsname, sep = ":"),
      inherits = FALSE
    ), error = function(e) FALSE
  )

Out of which the most relevant for us is:

"data.table" %chin% names(getNamespaceImports(ns))

When I first saw this, I was like (probably more than 50% of the sentence self-censored):

No way. I could not possibly be so stupid to forget to import data table in the NAMESPACE! (… of course I could)

So, about a minute later, place @import data.table into the roxygen tags, regenerate the NAMESPACE, re-install the package and all works great.

How could I possibly fail to import anything from data.table and find out earlier?

I think the reason (apart from plain forgetting the obvious) is a combination of the following:

the subsetting operator is such second nature, that it just did not occur to me to import it with the @importFrom tag and I rarely use @import on entire packages
R CMD check was successful with no notes, warning or errors, again because even if I usually relatively strictly use qualified calls, the subsetting would seem very unnatural like that. There was therefore no mention of data.table:: in the entire code and the checking procedure had nothing to complain about
the data table method actually did dispatch correctly, so only after a closer look we see the data frame method kicking in. The first thing to investigate (most of the time correctly) is the actual implementation of what is going on with the expressions inside the subsetting operator, especially when passing around and evaluating quoted expressions

So, if you ever see cedta() making decisions about data table awareness, check your NAMESPACE. Maybe you have just missed the obvious as I did. Happy data tabling!

R Markdown: 3 sources of reproducibility issues and options how to tackle them

Sat, 02 Feb 2019 12:00:00 +0000

Introduction

R Markdown is a great tool to use for creating reports, presentations and even websites that contain evaluated and rendered code. This can help us immensely when presenting data science type of work to audiences, while still being able to version control the content creation process.

One of the challenges that stay is reproducibility of the rendered results. In this post, I will list a few sources of reproducibility issues I came across and how I tried to solve them. As an introductory disclaimer, this post is not an exhaustive guide but merely a retrospect on the issues I faced and how I tackled them when writing posts for this blog using blogdown.

For this post, we would consider an R Markdown document reproducible if we can be sure that it produces identical rendered output as long as the content of the Rmd document, the data used within it and the rendering function stay the same.

This sounds like a reasonable thing to ask, however, there are many ways in which this assumption can be broken. And they are not always trivial - that is, unless your name is Yihui Xie :)

My guess is that you upgraded Pandoc first, saw the diffs, then updated the rmarkdown package (from a very old version), which now defaults the html output format to html4 (which generates <div>) instead of Pandoc's default html5 (which generates <section>).
— Yihui Xie (@xieyihui) January 30, 2019

We will try to categorize some of the reasons into groups.

Output changes caused by code chunks not behaving reproducibly

The first group is the one that we have full control over, as it directly relates to the content of the code chunks in our R Markdown document.

Simple examples that showcase the issue

Obviously, the output can change each time we run this chunk:

```{r}
Sys.time()
```

Another scenario is code chunks that make use of random number generation. If we render an Rmd that contains this code chunk multiple times we will get different results, unless we take precautions:

```{r}
runif(5)
```

Running timing (benchmarking) code is almost certain to produce different results each time it is run, even though the benchmarked code is identical:

```{r}
system.time(runif(1e6))
```

Solution 1 - Remove output change source from the chunks

The most obvious and clean way to tackle the issue is to change our code such that the source of variability is removed. For random number generation, this can be achieved by setting a seed, e.g using set.seed(). This solution can get more complex as the scope increases - if you are interested in reading more on the topic of reproducibility with RNG, look here.

Solution 2 - Run the code once and store the results

For some code, such as benchmarking, fixing the code such that the output does not change is very difficult in principle, therefore we must find a workaround that would ensure the results stay untouched. One approach is to run the code once, store the results and do not run the code again on render. Some ways to do that:

Using the cache=TRUE chunk option

In practice, this can be done nicely by using the cache=TRUE option, which provides this behavior and also makes sure that the cache is updated automatically when the code chunk changes, so the correctness of results is ensured. Exceptions to this exist for some special cases, read the details in the knitr manual for a deeper understanding. Here is an example chunk using that option:

```{r cache=TRUE}
system.time(runif(1e6))
```

Storing a needed representation of the object directly in the Rmd

One property of knitr’s caching that could be considered a downside is that the cache storage uses binary files, which while being a completely natural choice is not the best for version control. Especially when this is a concern and the code chunk outputs are small in size, other options may also be considered.

One such example would be to save a needed representation of the result directly in the Rmd and use eval=FALSE on the code chunk. This comes with trade-offs too, notably we must pay attention for the chunk changes, as there would be no automated update similar to the one provided by knitr’s cache mechanism.

As an example, we could rewrite the chunk above into two chunks like so - the first chunk shows the code in the output without running, the second makes sure that the results that we store gets shown (but the code does not):

```{r eval=FALSE}
system.time(runif(1e6))
```

```{r echo=FALSE}
# This is pre-calculated and just shown to keep the output static

structure(
  c(0.081, 0.003, 0.084, 0, 0)
  class = "proc_time"
  .Names = c("user.self", "sys.self", "elapsed", "user.child", "sys.child")
)
```

The content of the second chunk can be obtained by using dput(system.time(runif(1e6))). Naturally, this may become quite impractical to use with bigger objects.

Storing the rendered output directly in the Rmd

Another variation of the above would be to not even bother with obtaining the representation of the output via evaluating a code chunk, but just placing the rendered output itself into the Rmd directly. This comes with the same downsides as the previous approach with some extras. Mainly, we need to create a format-specific output, meaning this approach can be considered only if the output format is fixed and will not change.

For example, we can be reasonably sure that for a blogdown website the output will be HTML. In case we are ok with all those trade-offs, we can place the following into our .Rmd to represent the above code chunks in HTML:

<pre><code>
##    user  system elapsed 
##   0.081   0.003   0.084
</code></pre>

Output changes caused by different package versions

The tricky issues start when our Rmd content is actually reproducible under our current local setup, however the rendered output changes with a different setup, often with changed package versions. A concrete real-life example is when an update of the highcharter package slightly tweaks the output:

Slightly different highchart representation

Solution - Package version management, e.g. with packrat

Solving issues with package versions is a broad topic, so we will only mention one that is relatively easy to use, especially with RStudio - Packrat. Packrat is an R package that works by creating a separate library of packages on a project basis.

We can create separate R projects for Rmd files with shared package dependencies, or even a separate project per Rmd. This will ensure that we always have the intended set of packages loaded and used for them.

As all of the above, apart from the extra overhead of using it, this approach can have its caveats as well. Using packrat to manage a blogdown site or a bookdown book likely means that all of the site posts or book chapters will use a shared package library, which may not be granular enough for all use cases.

Using packrat with a project

Care also has to be taken to make sure that the packrat managed libraries are used when rendering the content, i.e. by ensuring that the packrat/init.R is sourced before the rendering happens.

Output changes caused by changed system dependencies

Going into deeper circles of the dreaded dependency hell, even if we manage our R packages with care, the system dependencies can still cause behavior that would change our output in an unintended way. In the case of R Markdown and knitr the most notable dependency of this type is Pandoc, the powerhouse behind R Markdown.

Why would this happen?

A way this dependency can change is for example when updating RStudio Server, which comes bundled with a certain version of Pandoc:

RStudio Server v1.1.453 comes with Pandoc 1.19.2.1
RStudio Server v1.2.1234 comes with Pandoc 2.3.1
You may have an even newer version on Pandoc installed if you got it separately. As of the date of writing this, the latest stable release of Pandoc is 2.5

Solution - To containerization and beyond

Solving system dependencies is a tricky task made easier by a few tools but it is way out of the scope of this post, so we will only briefly list some options:

Containerize your environment with an implementation of Operating-system level virtualization such as the ever so popular Docker
Create a predefined VM setup, for example using configuration management software such as Ansible or Puppet

Create R Markdown reports and presentations even better with these 3 practical tips

Sat, 19 Jan 2019 12:00:00 +0000

Introduction

Including R Markdown in the workflow for presenting and publishing analyses that use code in R or other languages is a great way to make presentations, dashboards or reports good looking, reproducible and version controllable.

In this post, we will look at three simple ways to improve that workflow even further with methods that are lesser known and can make producing results with R Markdown more efficient and reviewing them more interactive.

Live preview of R Markdown files with xaringan’s infinite_moon_reader()

If you are familiar with R notebooks, you probably know that as you edit the notebook in RStudio and save, the preview will automatically update in the RStudio viewer. Similarly for blogdown users, the serve_site() function provides live updates of the blog as the content is edited and saved.

However, if you are producing presentation slides or a more complex html report with R Markdown, you are stuck with re-knitting every time you want to see the updated content in action. Enter the infinite_moon_reader() function from the xaringan package.

Even though the xaringan package focuses on creating slides with the remark.js JavaScript library, this function works to provide a live preview with any single-file html output, be it a report, slides such as ioslides, a shiny document or another format.

If using RStudio, all you need to do to get the live preview is call the function and the default values of the arguments will take care of launching the live preview of the document currently active in the RStudio editor. As if this was not handy enough, the package comes with a premade RStudio addin, so you can get the same functionality just clicking in the IDE, or assigning a keyboard shortcut to it.

It is really that easy

Here is how long it takes to get it up and running, package installation included:

Infinite moon reader

Kind of obviously, this functionality can become a huge time saver, especially if you are tweaking the design of your slides and want to see the results quickly without the need to click/call knit over and over again.

Creating beautiful, multi format reports directly from R scripts

When creating R Markdown documents, the workflow often looks something like the following:

Create a new .Rmd file, edit the metadata
Write some content
Add code chunks, test
Write some more content
Add some more code chunks, test
Rinse and repeat

This works, but when your goal is to first create functioning code that you can run as-is and share with others, creating an R Markdown file from such a script with that approach can become a time consuming and error-prone process of copy-pasting the code into code chunks and maintaining it in two places in case you want to also keep the runnable script version.

In comes knitr’s spin()

The solution to the above problem is very simple once you are aware of it. You can use knitr’s spin() function to produce a beautiful report directly from an R script, with advanced formatting and options still being available - via formatted comments and the function’s parameters.

This way we can keep the script fully runnable as comments do not interfere with running the code, and still be able to produce that nice output R Markdown is known for.

A quick example

A working example is worth more than explanations here, so here we go. Just copy the following, save for example into script.R and run knitr::spin("script.R"):

#' # This is just an R script
#' ## Rendered to a html report with knitr::spin()
#' * just by adding comments we can make a really nice output

#'
#' > And the code runs just like normal, eg. via `Rscript` after all
#' __comments__ are just *comments*.
#'
#' ## The report begins here
#+
knitr::kable(head(mtcars))

#' ## A chart
#+ fig.width=8, fig.height=8
heatmap(cor(mtcars))

#' ## Some tips
#'
#' 1. Optional chunk options are written using `#+`
#' 1. You can write comments between `/*` and `*/`

By default, the result will something like the following:

Spinned it right round

Compile Report and RMarkdown’s `render()` vs. knitr’s `spin()`

We can achieve similar results in RStudio by clicking on File -> Compile Report..., which is equivalent to using rmarkdown::render() on an R script file. This will call spin() and add some metadata such as title, author and time to the output.

So why bother with spin() at all?

The default behavior has some important differences between calling the functions mentioned above. One of them for HTML output is that render() will by default include inline base64 representations of fonts and JavaScript sources, increasing the output file size from less them 20 KB to more than 600 KB even with the smallest amount of content.

This is why I personally like to call knitr::spin() to keep the output at smaller sizes by default, without having to dig in into the options passed to pandoc.

Regardless of the technical details, being able to produce good looking reports directly from R scripts can save a lot of time and error-prone copying, while keeping the content and runnable code in one place, instead of copy-pasting into code chunks of an R Markdown file.

This is of course not to say that R Markdown files are not useful. To the contrary, they are great for many use cases. However, if the content is mostly code with some accompanying text, using spin() can come in really handy.

Advanced chunk options with useful effects

When working with R Markdown the code chunk options provide helpful modifications to the chunk code’s behavior. The simple and widely used chunk options such as the following are well known, we mention them for a quick reference:

eval=FALSE - do not evaluate the code in the chunk at all
echo=FALSE - do not show the chunk code in the output file
include=FALSE - do not show code output in the output file
message=FALSE - do not show messages in the output file
warning=FALSE - do not show warnings in the output file
error=TRUE - do not prevent rendering on error and show error messages in the output

`results='asis'` to keep content generated by a chunk unprocessed

Especially when producing HTML output it may be helpful to create functions that produce output we want to include directly in the rendered document without any processing, such as HTML code produced by a pre-made function.

One example I use often is a function makeHighChart() that creates a lightweight JavaScript representation of a chart created via highcharter from an R object. The output of that function is HTML code that should be placed as-is into the output, for which the results='as-is' chunk option is made:

# This chunk uses the results='as-is' option like so:
# ```{r results='asis'}
# The results is an interactive chart:
jhaddins::makeHighChart(
  highcharter::hcboxplot(mtcars$hp),
  chartname = "examplechart",
  docat = TRUE
)

# This one does not use the results option, it is just
# ```{r}
# The result is not very useful printed HTML:
jhaddins::makeHighChart(
  highcharter::hcboxplot(mtcars$mpg),
  chartname = "examplechart",
  docat = TRUE
)

## <script type="text/javascript">
## $(function () {
##   $('#examplechart').highcharts({
##   title: {     
##     text: null     
##   },     
##   yAxis: {     
##     title: {     
##       text: null     
##     }     
##   },     
##   credits: {     
##     enabled: false     
##   },     
##   exporting: {     
##     enabled: false     
##   },     
##   plotOptions: {     
##     series: {     
##       label: {     
##         enabled: false     
##       },     
##       turboThreshold: 0,     
##       marker: {     
##         symbol: "circle"     
##       },     
##       showInLegend: false     
##     },     
##     treemap: {     
##       layoutAlgorithm: "squarified"     
##     }     
##   },     
##   chart: {     
##     type: "bar"     
##   },     
##   xAxis: {     
##     type: "category",     
##     categories: ""     
##   },     
##   series: [     
##     {     
##       name: null,     
##       data: [     
##         {     
##           name: null,     
##           low: 10.4,     
##           q1: 15.35,     
##           median: 19.2,     
##           q3: 22.8,     
##           high: 33.9     
##         }     
##       ],     
##       type: "boxplot",     
##       id: null     
##     }     
##   ]     
## }     
##   );
## });
## </script>
## 
## <div id="examplechart"></div>

`class.output="some_css_class"` to format chunk output with custom css

For HTML output, we may want to style it with our own css. This option allows to use defined css classes to style the output produced by that chunk. This can be very convenient if we want to style some chunks in a different manner elegantly. We can also provide multiple css classes in a character vector instead of just one.

`cache=TRUE` to render faster and more reproducibly

In case your documents contain calculations that a take lot of time, or just cause unnecessary pain when re-executed with each render, for example when including benchmarking results in posts, it is very convenient to cache the chunk results. This will not only make the rendering faster, but also ensure that the results of the same code will stay the same in the output, even if we re-render the document.

Note that for keeping reproducibility when random number generation is included with caching results, it is advised to also include knitr::opts_chunk$set(cache.extra = knitr::rand_seed) in the document. More details on that are available here.

Resources

Here's why 2019 is a great year to start with R: A story of 10 year old R code then and now

Sat, 05 Jan 2019 12:00:00 +0000

Introduction

It has been more than ten years since I wrote my first R code. And in those years, the R world has changed dramatically, and mostly to the better. I believe that the current time may be one of the best times to start working with R.

In this new year’s post we will look at the R world 10 years ago and today, and provide links to many tools that helped it become a great language to solve and present everyday tasks with a welcoming community of users and developers.

My first exposure to R, more than ten years ago

The year was 2007 and I was studying probability and mathematical statistics at my faculty when one of the professors introduced us to R - a free programming language that we could use to solve many statistical tasks, from simple matrix operations and fitting models, to data visualization. This sounded great, as many other solutions that were traditionally used such as Stata or SPSS were even not free to use, let alone open source.

Now to get a bit of context, my most recent exposures to programming at that time were using Borland’s Deplhi 7 and C++ Builder, both mature IDEs with very pleasant and advanced user interfaces and features, where you could literally have a Windows application with a nice UI ready, compiled and running in an hour.

Deplhi 7, released in 2002

Rgui times

When I first opened the RGui it felt, well, slightly underwhelming:

Rgui rocking R version 2.6.1, released 26th Nov 2007

But why did you not use RStudio?

Well, the first beta version of RStudio was released about 3 years later in February 2011. By the wat, those RStudio blog posts from 2011 still have comment sections available below them and I really enjoy reading through them. Anyway, I was stuck with the Rgui and it was not a pleasant experience. At that time, I disliked that experience so much, I still wrote some of the code in Delphi or C++ Builder.

Which currently popular packages existed?

But dplyr syntax makes everything so easy, why not use that?

Looking at the CRAN snapshots from the beginning of 2008, the latest released R version at that time was R-2.6.1 and there were around 1200 packages available on CRAN. At the time of writing of this post the number of packages available on CRAN reached 13600.

Looking at the top 40 most downloaded packages in the past month, only two of those packages existed on CRAN at that time - ggplot2 and digest - no filter, summarize or group_by for me back then.

StackOverflow, GitHub and Twitter communities

Why did you not just ask StackOverflow, Twitter or check GitHub ?

According to Wikipedia, StackOverflow was launched 15th September 2008 and GitHub on 10th of April 2008, so in the beginning of 2008 none of the two today’s giants even existed.

Not that I was using Twitter at that time, but the first #rstats tweet I was able to find is from 4th April 2009:

RT @ChrisAlbon @drewconway #rstats is the official R statistical language hashtag. #rstats (because #R doesn't cut it)
— brendan o'connor (@brendan642) April 4, 2009

For comparison, R itself was first released 29th of February 2000, a date easily remembered.

The growth of R

There are many ways to look at a growth of a programming language and this does not mean to be a comprehensive and objective growth assessment. I rather took a look at 2 metrics I found interesting that show some trends in the R world.

If you are interested in the topic of programming language popularity, there are indices such as PYPL and TIOBE, and of course they have their criticisms.

Downloads of R packages

RStudio’s CRAN mirror provides a REST API from which we can look at and visualize the number of monthly downloads of R packages in the past 5 years. The chart speaks for itself:

Interest on StackOverflow

Another interesting point of view is the statistics on trends on StackOverflow, paraphrasing their blogpost:

When we see a rapid growth in the number of questions about a technology, it usually reflects a real change in what developers are using and learning.

And how does R look within the StackOverflow trends compared to other languages? Looks like the growth of R is so remarkable, even the data scientists at StackOverflow itself noticed and wrote a blogpost about it in 2017:

R now versus then - A much better world

Going back to that story of my first R codes, I think time has made working with R much better than it was before in many ways. I will list just few of the many reasons why with the links to relevant resources to follow:

Availability of free information and support is great

The amazing amount of free information readily available such as (tidyverse oriented) R for Data Science, or Advanced R books make R more accessible to learn and use
Communities of R users such as the one on StackOverflow make it easy to ask questions and get answers, the #rstats hashtag on Twitter is a good way to interact with the community
Many user and developer blogs on r-bloggers.com and curated selections of content on RWeekly.org can serve as an inspiration and overview of the news in the community

Software tools that make working with R efficient

Tools like RStudio make using R a much more pleasant experience compared to the original RGui, with many useful features and a Server version running in browser
Well documented R packages that make common data science tasks easier and/or more performant such as the popular tidyverse or data.table make it easier to start
R packages that support development, testing and documentation such as devtools, testthat and roxygen2 make R code efficient to develop, test and document
For portability, reproducibility and dependency management, tools such as packrat can make life less painful
Code repository managers such as GitHub, GitLab or others make it easy to share code, collaborate and even perform CI/CD tasks where necessary

Professionally presenting and publishing R results is simple

Tools like RMarkdown, Bookdown, Blogdown and others make it easy to publish the results of your work, be it an interactive dashboard, a paper in pdf, a presentation, even a book or a blog (such as this one)
Many packages for generating interactive charts, maps and animations such as highcharter, leaflet and more help create amazing data visualizations
Shiny takes it to the next level allowing for advanced interactive web applications

Mature interfaces to programming languages, file formats and more

R now has mature interfaces to many programming languages, software libraries, database systems and file formats, just a few examples include Rcpp, rJava, httr, openxlsx, XLConnect, highcharter, jsonlite, xml2, sparklyr and DBI

Guidance on packages per topic on CRAN

CRAN task views provide guidance on R packages per topic, such as Web Technologies and Services, High-Performance and Parallel Computing, Machine Learning & Statistical Learning and many more

If you really came for that ugly old code

I hope this post motivated you to dive a bit deeper into the R world and check some of the many amazing contributions created by developers and users in the R community mentioned above.

But if you really feel like having a good laugh first, feel free check some of the oldest R scripts I was able to find unedited on GitLab here.

They date somewhere to the end of 2007/beginning of 2008 and, for it’s worth, should still be runnable.

Thank you for reading and
have a happy new yeaR

5 amazing free tools that can help with publishing R results and blogging

Sat, 22 Dec 2018 12:00:00 +0000

Introduction

It is Christmas time! And what better time than this to write about the great tools that are available to all who like R and would like to publish their R work or even blog about it. This post is meant as a praise to the tools that are helping me to write this blog and make it a very nice experience, allowing me to focus on the content.

In this post we will praise 5 free tools that can help anyone make blogging about R or publishing results of R work a pleasant experience.

RStudio + R Markdown to prepare the content

The first is probably the most obvious, but still worth mentioning. The RStudio IDE is a good productivity tool for all R-related work, however the integration with R Markdown makes it the default environment for me to write the blog posts. I especially enjoy use the RStudio Server, which makes it easy to have one aligned environment regardless of where you are (as long as there is internet connection ;) and what computer you are using (as long as there is a recent version of a web browser).

I often find myself even editing the css style sheets, HTML partials and JavaScript within RStudio itself. And probably the best thing about it is that with combination with Blogdown, you can see all those changes instantly as you make them in RStudio’s Viewer. Using RStudio’s Terminal for the necessary git commands makes RStudio Server a unified tool with all that I need for almost all of the work.

As an honorary mention, R Markdown would probably not be possible without the powerhouse behind it - Pandoc. Pandoc is an open-source document converter, widely used as a writing tool and as a basis for publishing workflows, and also used as a backend by knitr, which is in turn used by R Markdown to generate the rendered outputs from R Markdown documents.

Blogdown (+ Hugo) to make it a nice blog

To write and review the posts for this blog, I almost exclusively use the combination of the RStudio (Server) IDE and the Blogdown package by Yihui Xie. As most of the readers probably know both of those are free and very easy to setup. If you never heard of Blogdown, it is an open-source R package to generate static websites based on R Markdown and Hugo.

But that short explanation does not really do it justice, so you may want to check out Awesome Blogdown for a curated list of blogs built using blogdown. If you want to learn more, there is even a free online book written by the authors of blogdown. In terms of the design, you can find hundreds of themes to choose from in the Hugo theme gallery, this blog uses customized natrium theme, a simple responsive blog theme for Hugo based on the Lithium theme.

GitLab and GitLab Pages to version control and publish it

GitLab pages enable us to create websites for our GitLab projects, groups, or user account using any static website generator, Hugo included. Since GitLab is my repository manager of choice, allowing for free private repositories, integration with the pages comes very naturally and easily. Essentially all that is needed to make it work is a .gitlab-ci.yml file similar to this one.

There is even a full example of a Hugo page available to see how it may look like with a nice readme. For advanced use, it is also possible to connect your custom domain and TLS certificates and host the websites on your own GitLab instance. On GitLab.com, the hosting of the sites is free. To read more documentation and watch video tutorials, just click here.

Highcharts and the highcharter R package for interactive charts

I have been using highcharts for interactive charting for projects for years and was very excited when the first version of the highcharter R package providing an interface between R and highcharts arrived on CRAN in 2016.

For the relatively rare occurrences when I need a chart included in a blog post I happily use highcharter mainly thanks to the amazing variability and ease of use provided by the now very mature highcharts JavaScript library. For a taste, just look at the highcharts demo. And yes, they can do pretty highmaps too.

ScreenToGif for moving screen captures

ScreenToGif is an open source tool that allows you to record a selected area of your screen, edit and save it as a gif or video. I find screen recording and showing it as a gif one the best ways to easily show examples without the need to record a video, which takes much more effort and this tool does just that very conveniently. You can download it for free from its website and take a look at the code in the GitHub repo.

Resources

RStudio desktop and RStudio Server
R Markdown on GitHub
Blogdown on GitHub
Blogdown and GitLab Pages
Hugo, one of the most popular open-source static site generators
Hugo theme gallery
GitLab pages, a feature that allows you to publish static websites directly from a repository in GitLab
Highcharts makes it easy for developers to set up interactive charts in their web pages
highcharter is an R wrapper for Highcharts JavaScript library and its modules
ScreenToGif, a screen, webcam and sketchboard recorder with an integrated editor.

Thank you for reading and
have a verry merry Christmas :o)

How to sort data by one or more columns with base R, dplyr and data.table

Sat, 08 Dec 2018 12:00:00 +0000

Introduction

In this post in the R:case4base series we will examine sorting (ordering) data in base R. We will learn to sort our data based on one or multiple columns, with ascending or descending order and as always look at alternatives to base R, namely the tidyverse’s dplyr and data.table to show how we can achieve the same results.

It is recommended to first have a look at the post on subsetting to understand the concepts underlying the sorting process in more detail.

How to use this article

This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a case4base after all!
If you have no time for reading, you can click here to get just the code with commentary

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

gdi <- read.csv(
  stringsAsFactors = FALSE
, url("https://jozef.io/post/data/ESA2010_GDI.csv")
              )
head(gdi[, 1:6, drop = FALSE])

##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0

Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.

The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.

Subsetting as a mechanism for sorting data

Sorting a data frame is loosely coupled with subsetting. To get the rows of a data frame in order reverse to the current one, we can just subset the rows with an index that goes from the last row to the very first (or safer, zeroth) like so:

gdi_reversed_rows <- gdi[nrow(gdi):0, ]

We can take a very similar approach to reverse order the columns:

gdi_reversed_cols <- gdi[, ncol(gdi):0]

Or both rows and columns at the same time. We also add the drop = FALSE for safety here as we omitted it in the 2 above examples for readability:

gdi_reversed <- gdi[nrow(gdi):0, ncol(gdi):0, drop = FALSE]
head(gdi_reversed)

##     Y.2016  Y.2015    Y.2014    Y.2013    Y.2012    Y.2011    Y.2010
## 35      NA      NA        NA        NA        NA        NA        NA
## 34      NA 1631795 1438281.4 1268729.8 1081744.9  971545.3  807128.5
## 33  458641  447094  449119.3  437596.6  428131.2  420404.9  412363.1
## 32 1627136 1606745 1496128.0 1419380.0 1347970.0 1272065.0 1204442.0
## 31      NA      NA 1055733.5  980494.9  934077.3  872900.3  798916.7
## 30 1330854 1298475 1269177.0 1219699.0 1195227.0 1160813.0 1151812.0
##       Y.2009    Y.2008    Y.2007    Y.2006   Y.2005   Y.2004   Y.2003
## 35        NA        NA        NA        NA       NA       NA       NA
## 34  689431.6        NA        NA        NA       NA       NA       NA
## 33  404446.9  399834.1  389468.0  368868.0 352620.1 341709.9 337742.9
## 32 1150829.0 1105563.0 1021911.0  943515.0 975153.0 894892.0 854026.0
## 31  858678.9  909995.1  827339.5  681058.3 631210.9 536194.9 478645.8
## 30 1101109.0 1080225.0 1063178.0 1005630.0 966175.0 926670.0 893528.0
##      Y.2002   Y.2001   Y.2000   Y.1999   Y.1998   Y.1997   Y.1996   Y.1995
## 35       NA       NA       NA       NA       NA       NA       NA       NA
## 34       NA       NA       NA       NA       NA       NA       NA       NA
## 33 335845.6 336581.4 326269.3 312478.7 303239.5 296324.6 291208.4 287865.4
## 32 800130.0 727228.0 704697.0 660196.0 630865.0 582597.0 549694.0 522981.0
## 31 447572.6 400145.0 369181.0       NA       NA       NA       NA       NA
## 30 857352.0 829908.0 789615.0 737419.0 715396.0 691951.0 656455.0 618959.0
##           country
## 35         Serbia
## 34         Turkey
## 33    Switzerland
## 32         Norway
## 31        Iceland
## 30 United Kingdom

Sorting data by contents of a column

To order the rows (countries) by GDI in 2016, we use the function order, which finds the permutation that rearranges the values into ascending order and save that order into a variable called rowidx. Then we simply use rowidx to subset the rows of gdi in the order we wanted:

rowidx <- order(gdi[, "Y.2016"])
rowidx

##  [1] 13  8 16 18 17 26 27  4  9 28 24 22  3 21 33 11  6 23 14 30 12 32  7
## [24] 29  5  2  1 10 15 19 20 25 31 34 35

gdi_sorted <- gdi[rowidx, , drop = FALSE]

# We can of course do it in one go:
gdi_sorted <- gdi[order(gdi[, "Y.2016"]), , drop = FALSE]

# Look at the 2 relevant columns of the result 
gdi_sorted[, c(1, 23)]

##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA

To order in descending order, we can use decreasing = TRUE, to see NAs first we can use na.last = FALSE

rowidx <- order(gdi[, "Y.2016"], decreasing = TRUE, na.last = FALSE)
gdi[rowidx, c(1, 23), drop = FALSE]

##           country     Y.2016
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00

Sorting by multiple vectors with different order

That looks good, but we may want to order the rows that have NA as GDI in 2016 alphabetically by country (or generalize even further). To use multiple vectors for ordering is also very simple:

rowidx <- order(gdi[, "Y.2016"], gdi[, "country"])
gdi[rowidx, c(1, 23), drop = FALSE]

##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 15         Cyprus         NA
## 10         Greece         NA
## 19        Hungary         NA
## 31        Iceland         NA
## 20          Malta         NA
## 25        Romania         NA
## 35         Serbia         NA
## 34         Turkey         NA

To order by multiple columns in different orders, for numeric vectors we can use a simple -, since negated numeric vector will order in reverse order. To order our GDI dataset by GDI in 2016 descending and then by country alphabetically:

rowidx <- order(-gdi[, "Y.2016"], gdi[, "country"])
gdi[rowidx, c(1, 23), drop = FALSE]

##           country     Y.2016
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00
## 15         Cyprus         NA
## 10         Greece         NA
## 19        Hungary         NA
## 31        Iceland         NA
## 20          Malta         NA
## 25        Romania         NA
## 35         Serbia         NA
## 34         Turkey         NA

For non-numeric vectors, we can take advantage of the xtfrm function, which returns a numeric vector which will sort in the same order as the one provided to it. Then we just use - to get a vector that will order in reverse order. To order our GDI dataset by GDI ascending in 2016 and then by country reverse-alphabetically:

rowidx <- order(gdi[, "Y.2016"], -xtfrm(gdi[, "country"]))
gdi[rowidx, c(1, 23), drop = FALSE]

##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 34         Turkey         NA
## 35         Serbia         NA
## 25        Romania         NA
## 20          Malta         NA
## 31        Iceland         NA
## 19        Hungary         NA
## 10         Greece         NA
## 15         Cyprus         NA

Alternatives to base R

Using the tidyverse

The dplyr package comes with a set of very user-friendly functions that are very easy to use, especially in an interactive setting where we know the column names up front, so we can take advantage of the non-standard evaluation:

library(dplyr)
gdi %>% 
  arrange(Y.2016, desc(country)) %>% 
  select(1, 23)

##           country     Y.2016
## 1         Croatia       0.00
## 2         Estonia   12548.30
## 3          Latvia   15737.79
## 4      Luxembourg   20155.80
## 5       Lithuania   24743.49
## 6        Slovenia   24756.63
## 7        Slovakia   48882.91
## 8        Bulgaria   60237.00
## 9         Ireland   97318.90
## 10        Finland  126590.00
## 11       Portugal  128789.39
## 12        Austria  214980.60
## 13        Belgium  243825.50
## 14    Netherlands  357383.00
## 15    Switzerland  458641.00
## 16          Spain  698701.00
## 17        Denmark 1091542.00
## 18         Poland 1136916.00
## 19          Italy 1142273.40
## 20 United Kingdom 1330854.00
## 21         France 1425435.00
## 22         Norway 1627136.00
## 23        Germany 2019917.00
## 24         Sweden 2402587.00
## 25 Czech Republic 2523229.00
## 26   Euro area 19 6736686.43
## 27          EU 28 9454683.60
## 28         Turkey         NA
## 29         Serbia         NA
## 30        Romania         NA
## 31          Malta         NA
## 32        Iceland         NA
## 33        Hungary         NA
## 34         Greece         NA
## 35         Cyprus         NA

If we need to provide the names of the columns instead, we can use arrange_at:

gdi %>% 
  arrange_at("country", desc) %>%
  arrange_at("Y.2016") %>%
  select(1, 23)

##           country     Y.2016
## 1         Croatia       0.00
## 2         Estonia   12548.30
## 3          Latvia   15737.79
## 4      Luxembourg   20155.80
## 5       Lithuania   24743.49
## 6        Slovenia   24756.63
## 7        Slovakia   48882.91
## 8        Bulgaria   60237.00
## 9         Ireland   97318.90
## 10        Finland  126590.00
## 11       Portugal  128789.39
## 12        Austria  214980.60
## 13        Belgium  243825.50
## 14    Netherlands  357383.00
## 15    Switzerland  458641.00
## 16          Spain  698701.00
## 17        Denmark 1091542.00
## 18         Poland 1136916.00
## 19          Italy 1142273.40
## 20 United Kingdom 1330854.00
## 21         France 1425435.00
## 22         Norway 1627136.00
## 23        Germany 2019917.00
## 24         Sweden 2402587.00
## 25 Czech Republic 2523229.00
## 26   Euro area 19 6736686.43
## 27          EU 28 9454683.60
## 28         Turkey         NA
## 29         Serbia         NA
## 30        Romania         NA
## 31          Malta         NA
## 32        Iceland         NA
## 33        Hungary         NA
## 34         Greece         NA
## 35         Cyprus         NA

Using data.table

There are multiple ways to achieve the desired results with data.table, the one syntactically similar to base R is:

library(data.table)
gdidt <- as.data.table(gdi)
gdidt[order(Y.2016, -country), c(1, 23)]

##            country     Y.2016
##  1:        Croatia       0.00
##  2:        Estonia   12548.30
##  3:         Latvia   15737.79
##  4:     Luxembourg   20155.80
##  5:      Lithuania   24743.49
##  6:       Slovenia   24756.63
##  7:       Slovakia   48882.91
##  8:       Bulgaria   60237.00
##  9:        Ireland   97318.90
## 10:        Finland  126590.00
## 11:       Portugal  128789.39
## 12:        Austria  214980.60
## 13:        Belgium  243825.50
## 14:    Netherlands  357383.00
## 15:    Switzerland  458641.00
## 16:          Spain  698701.00
## 17:        Denmark 1091542.00
## 18:         Poland 1136916.00
## 19:          Italy 1142273.40
## 20: United Kingdom 1330854.00
## 21:         France 1425435.00
## 22:         Norway 1627136.00
## 23:        Germany 2019917.00
## 24:         Sweden 2402587.00
## 25: Czech Republic 2523229.00
## 26:   Euro area 19 6736686.43
## 27:          EU 28 9454683.60
## 28:         Turkey         NA
## 29:         Serbia         NA
## 30:        Romania         NA
## 31:          Malta         NA
## 32:        Iceland         NA
## 33:        Hungary         NA
## 34:         Greece         NA
## 35:         Cyprus         NA
##            country     Y.2016

Another option is to take advantage of the setorderv method provided by data.table. The important distinction is that this will sort the existing data.table in place, changing the source object. The other methods used above leave the source object untouched:

# This will sort the gdidt by reference - changing the input object
setorderv(gdidt, c("Y.2016", "country"), c(1, -1), na.last = TRUE)
# So we now just subset the (already sorted) gdidt
gdidt[, c(1, 23)]

##            country     Y.2016
##  1:        Croatia       0.00
##  2:        Estonia   12548.30
##  3:         Latvia   15737.79
##  4:     Luxembourg   20155.80
##  5:      Lithuania   24743.49
##  6:       Slovenia   24756.63
##  7:       Slovakia   48882.91
##  8:       Bulgaria   60237.00
##  9:        Ireland   97318.90
## 10:        Finland  126590.00
## 11:       Portugal  128789.39
## 12:        Austria  214980.60
## 13:        Belgium  243825.50
## 14:    Netherlands  357383.00
## 15:    Switzerland  458641.00
## 16:          Spain  698701.00
## 17:        Denmark 1091542.00
## 18:         Poland 1136916.00
## 19:          Italy 1142273.40
## 20: United Kingdom 1330854.00
## 21:         France 1425435.00
## 22:         Norway 1627136.00
## 23:        Germany 2019917.00
## 24:         Sweden 2402587.00
## 25: Czech Republic 2523229.00
## 26:   Euro area 19 6736686.43
## 27:          EU 28 9454683.60
## 28:         Turkey         NA
## 29:         Serbia         NA
## 30:        Romania         NA
## 31:          Malta         NA
## 32:        Iceland         NA
## 33:        Hungary         NA
## 34:         Greece         NA
## 35:         Cyprus         NA
##            country     Y.2016

Quick benchmarking

For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned methods. To do the benchmarking, we will use a very slightly modified flights data frame provided by Hadley Wickham’s nycflights13 package.

bench <- microbenchmark::microbenchmark(times = 100,
  base_order   = {flights[order(flights[, "flight"], -xtfrm(flights[, "carrier"])), ] },
  dt_oder      = {flightsdt[order(flight, -carrier), ] },
  dplyr_nse    = {flights %>% arrange(flight, desc(carrier)) },
  dplyr_scoped = {flights %>% arrange_at("carrier", desc) %>% arrange_at("flight") }
)

Under our particular circumstances, base R’s method seems to be the slowest of the options with data.table being the fastest.

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

References

How to work with strings in base R - An overview of 20+ methods for daily use

Sat, 24 Nov 2018 12:00:00 +0000

Introduction

In this post in the R:case4base series we will look at string manipulation with base R, and provide an overview of a wide range of functions for our string working needs.

We will use simple examples to learn to perform basic string operations, concatenate strings, work with substrings, switch cases, quote, find and replace within strings and more. Some interesting bonuses will also be included.

As always, some popular alternatives to base R will also be suggested and many useful references provided for further reading.

Quick overview of the very basics

This post is aimed to serve as an overview of functionality provided by base R to work with strings. Note that the term “string” is used somewhat loosely and refers to character vectors and character strings. In R documentation, references to character string, refer to character vectors of length 1.

Also since this is an overview, we will not examine the details of the functions, but rather list examples with simple, intuitive explanations trading off technical precision.

# String constants can be assigned using
# double quotes 
a <- "this is a character string"
# or single quotes 
b <- 'this is a character string, too'
# To use literal quotes, we can escape with `\`: 
c <- "this is \"it\""

# To make a character vector with multiple elements:
d <- c("this", "vector", "has", "five", "elements")

# To get the length of a character vector
# (how many elements are in a character vector)
length(d)

## [1] 5

# To get the number of characters in elemets of a vector
# ("how many characters in each of the elements")
nchar(d)

## [1] 4 6 3 4 8

# To create a missing character value
NA_character_

## [1] NA

# To test if an object is a character vector
is.character("is this a character vector?")

## [1] TRUE

# To convert other objects to character vectors
# Can surprise the unwary
as.character(c(
  42,
  Sys.time(),
  factor("A", levels = LETTERS)
))

## [1] "42"         "1543050000" "1"

# One of the ways to output a vector is `cat`
cat("Show me this")

## Show me this

# To include line breaks use `"\n"`
# To include tabs use `"\t"`:
cat("Break\ta\ta\nline")

## Break    a   a
## line

# When in doubt about an object
# str or summary may help
weirdList <- list(
  "What is this?",
  Sys.time(),
  b = 5L,
  c = c("one", 2),
  d = factor(c("red", "blue")),
  e = NA_character_,
  f = NA_integer_
)

str(weirdList)

## List of 7
##  $  : chr "What is this?"
##  $  : POSIXct[1:1], format: "2018-11-24 09:00:00"
##  $ b: int 5
##  $ c: chr [1:2] "one" "2"
##  $ d: Factor w/ 2 levels "blue","red": 2 1
##  $ e: chr NA
##  $ f: int NA

summary(weirdList)

##   Length Class   Mode     
##   1      -none-  character
##   1      POSIXct numeric  
## b 1      -none-  numeric  
## c 2      -none-  character
## d 2      factor  numeric  
## e 1      -none-  character
## f 1      -none-  numeric

String concatenation

String concatenation is the process of “joining” two strings together and one the most common operations.

Simple concatenation

# We will use these vectors for our examples:
1:3

## [1] 1 2 3

month.name

##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"

# Use paste to concatenate
# R recycles 1:3 4 times to fit the length of month.name
paste(1:3, month.name)

##  [1] "1 January"   "2 February"  "3 March"     "1 April"     "2 May"      
##  [6] "3 June"      "1 July"      "2 August"    "3 September" "1 October"  
## [11] "2 November"  "3 December"

# Specify the sep argument to 
# separate the elements differently
paste(1:3, month.name, sep = ": ")

##  [1] "1: January"   "2: February"  "3: March"     "1: April"    
##  [5] "2: May"       "3: June"      "1: July"      "2: August"   
##  [9] "3: September" "1: October"   "2: November"  "3: December"

# A shorthard for sep = ""
paste0(1:3, month.name)

##  [1] "1January"   "2February"  "3March"     "1April"     "2May"      
##  [6] "3June"      "1July"      "2August"    "3September" "1October"  
## [11] "2November"  "3December"

# Alternatively, sprintf is very useful
sprintf("%s: %s", 1:3, month.name)

##  [1] "1: January"   "2: February"  "3: March"     "1: April"    
##  [5] "2: May"       "3: June"      "1: July"      "2: August"   
##  [9] "3: September" "1: October"   "2: November"  "3: December"

Concatenate a vector into a single character string

# Provide the collapse argument to paste
# to get a character string (length 1 vector):
paste(1:3, month.name, sep = ": ", collapse = ", ")

## [1] "1: January, 2: February, 3: March, 1: April, 2: May, 3: June, 1: July, 2: August, 3: September, 1: October, 2: November, 3: December"

# Or, use toString
toString(paste(1:3, month.name, sep = ": "))

## [1] "1: January, 2: February, 3: March, 1: April, 2: May, 3: June, 1: July, 2: August, 3: September, 1: October, 2: November, 3: December"

String manipulation and properties

String lengths

# How many elements does a vector have?
length(month.name)

## [1] 12

# To get the number of characters in elemets of a vector
# ("how many characters in each of the elements?")
nchar(month.name)

##  [1] 7 8 5 5 3 4 4 6 9 7 8 8

# Are the elements non-empty strings?
nzchar(month.name)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Switching to upper/lower case

# Switch to all lower case
tolower(month.name)

##  [1] "january"   "february"  "march"     "april"     "may"      
##  [6] "june"      "july"      "august"    "september" "october"  
## [11] "november"  "december"

# Switch to all upper case
toupper(month.name)

##  [1] "JANUARY"   "FEBRUARY"  "MARCH"     "APRIL"     "MAY"      
##  [6] "JUNE"      "JULY"      "AUGUST"    "SEPTEMBER" "OCTOBER"  
## [11] "NOVEMBER"  "DECEMBER"

# Casefold is a wrapper for S-PLUS compatibility
casefold(month.name, upper = FALSE)

##  [1] "january"   "february"  "march"     "april"     "may"      
##  [6] "june"      "july"      "august"    "september" "october"  
## [11] "november"  "december"

casefold(month.name, upper = TRUE)

##  [1] "JANUARY"   "FEBRUARY"  "MARCH"     "APRIL"     "MAY"      
##  [6] "JUNE"      "JULY"      "AUGUST"    "SEPTEMBER" "OCTOBER"  
## [11] "NOVEMBER"  "DECEMBER"

# Also, custom translation:
chartr("OIZEASGTC", "01234567(" , toupper(month.name))

##  [1] "J4NU4RY"   "F3BRU4RY"  "M4R(H"     "4PR1L"     "M4Y"      
##  [6] "JUN3"      "JULY"      "4U6U57"    "53P73MB3R" "0(70B3R"  
## [11] "N0V3MB3R"  "D3(3MB3R"

Removing white spaces

# Remove all leading and trailing whitespaces
trimws(" This has trailing spaces.  ")

## [1] "This has trailing spaces."

# Remove leading whitespaces
trimws(" This has trailing spaces.  ", which = "left")

## [1] "This has trailing spaces.  "

# Remove trailing whitespaces
trimws(" This has trailing spaces.  ", which = "right")

## [1] " This has trailing spaces."

Encoding conversion

# Convert a character vector between encodings
iconv("šibrinkuje", "UTF-8", "ASCII", "?")

## [1] "??ibrinkuje"

Quoting

# Quoting text for fancier priting:
sQuote(month.name)

##  [1] "'January'"   "'February'"  "'March'"     "'April'"     "'May'"      
##  [6] "'June'"      "'July'"      "'August'"    "'September'" "'October'"  
## [11] "'November'"  "'December'"

dQuote(month.name)

##  [1] "\"January\""   "\"February\""  "\"March\""     "\"April\""    
##  [5] "\"May\""       "\"June\""      "\"July\""      "\"August\""   
##  [9] "\"September\"" "\"October\""   "\"November\""  "\"December\""

# Not to be confused with quoting strings for passing to OS shell
system(paste("echo", shQuote("Weird\nstuff")))

# Also not be confused with quoting expressions
str(quote(1 + 1))

##  language 1 + 1

Retrieving and working with substrings

# Get the first three characters from all the month.names
substr(month.name, 1, 3)

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

# Get the last three characters from all the month.names
substr(month.name, nchar(month.name) - 2, nchar(month.name))

##  [1] "ary" "ary" "rch" "ril" "May" "une" "uly" "ust" "ber" "ber" "ber"
## [12] "ber"

# Wrapper around substr for S Compability:
substring(month.name, 1, 3)

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

# Check whether elements start with a string
startsWith(month.name, "J")

##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [12] FALSE

# Check whether elements end with a string
endsWith(month.name, "ember")

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [12]  TRUE

# Trim character strings to specified display widths.
strtrim(month.name, 3)

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

# Abbreviate strings to at least minlength characters
abbreviate(month.name, minlength = 3)

##   January  February     March     April       May      June      July 
##     "Jnr"     "Fbr"     "Mrc"     "Apr"     "May"     "Jun"     "Jly" 
##    August September   October  November  December 
##     "Ags"     "Spt"     "Oct"     "Nvm"     "Dcm"

Basic pattern matching and replacement

Pattern matching and replacement using regular expressions in an extremely powerful feature, however it is out of scope of this overview to cover them.

Check the references for better resources if you are interested. A lot more useful detail can also be found in R’s documentation.

The following is just to show very basic use and list useful functions.

Replace substring with other strings

myStrings <- paste(1:3, month.name, sep = ". ")

# Replace all ones with zeros:
# fixed will match the first argument as is
gsub("1", "0", myStrings, fixed = TRUE)

##  [1] "0. January"   "2. February"  "3. March"     "0. April"    
##  [5] "2. May"       "3. June"      "0. July"      "2. August"   
##  [9] "3. September" "0. October"   "2. November"  "3. December"

# Replace only the first "a" in each for "A"
sub("a", "A", myStrings, fixed = TRUE)

##  [1] "1. JAnuary"   "2. FebruAry"  "3. MArch"     "1. April"    
##  [5] "2. MAy"       "3. June"      "1. July"      "2. August"   
##  [9] "3. September" "1. October"   "2. November"  "3. December"

# Replace any number with 0
# note that the fixed argument is now FALSE (default)
gsub("[0-9]", "0", myStrings)

##  [1] "0. January"   "0. February"  "0. March"     "0. April"    
##  [5] "0. May"       "0. June"      "0. July"      "0. August"   
##  [9] "0. September" "0. October"   "0. November"  "0. December"

# Replace literal dots with 0
gsub(".", "0", myStrings, fixed = TRUE)

##  [1] "10 January"   "20 February"  "30 March"     "10 April"    
##  [5] "20 May"       "30 June"      "10 July"      "20 August"   
##  [9] "30 September" "10 October"   "20 November"  "30 December"

# This will replace all characters (except "\n") with zeros
gsub(".", "0", myStrings)

##  [1] "0000000000"   "00000000000"  "00000000"     "00000000"    
##  [5] "000000"       "0000000"      "0000000"      "000000000"   
##  [9] "000000000000" "0000000000"   "00000000000"  "00000000000"

# Also replace literal dots but without "fixed = TRUE"
# by escaping "." using "\\." instead.
# This will treat "." literally instead of its special meaning
gsub("\\.", "0", myStrings)

##  [1] "10 January"   "20 February"  "30 March"     "10 April"    
##  [5] "20 May"       "30 June"      "10 July"      "20 August"   
##  [9] "30 September" "10 October"   "20 November"  "30 December"

Check if a pattern is present within elements of a character vector

myStrings <- paste(1:3, month.name, sep = ". ")

# Is a pattern present (returns a logical vector)?
grepl("ember", myStrings)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [12]  TRUE

# In which elements is a pattern present (returns indices)?
grep("ember", myStrings)

## [1]  9 11 12

# In which elements is a pattern present (returns the values)?
grep("ember", myStrings, value = TRUE)

## [1] "3. September" "2. November"  "3. December"

Check where the matches are within the elements of a character vector

myStrings <- paste(1:3, month.name, sep = ". ")

# Where is the first "a" located in each of the elements?
# pattern if not found in that element, returns -1
regexpr("a", myStrings)

##  [1]  5  9  5 -1  5 -1 -1 -1 -1 -1 -1 -1
## attr(,"match.length")
##  [1]  1  1  1 -1  1 -1 -1 -1 -1 -1 -1 -1
## attr(,"useBytes")
## [1] TRUE

# Where are all the "a" located in each of the elements?
# If pattern not found in that element, returns -1
gregexpr("a", myStrings)

## [[1]]
## [1] 5 8
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 9
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[5]]
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[6]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[7]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[8]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[9]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[10]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[11]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[12]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE

# Where are all the "a" located in the first element?
gregexpr("a", myStrings[1])

## [[1]]
## [1] 5 8
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE

# or also 
gregexpr("a", myStrings)[[1]]

## [1] 5 8
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE

We skip regexec here as parenthesized sub-expressions are very much out of scope of this post.

Extract the matching substrings

The above regexpr() and gregexpr() tell us where the patterns we are looking for are located. It is often useful to extract the actual substrings that are at those locations and regmatches() does that for us:

myStrings <- paste(1:3, month.name, sep = ". ")

# Find substrings that start with 1 or 2 and end
# in "ber" within myStrings
regmatches(
  myStrings,
  regexpr("^[1-2].*ber$", myStrings)
)

## [1] "1. October"  "2. November"

# Alternatively, the same as a list of the same
# length as myStrings
regmatches(
  myStrings,
  gregexpr("^[1-2].*ber$", myStrings)
)

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## character(0)
## 
## [[9]]
## character(0)
## 
## [[10]]
## [1] "1. October"
## 
## [[11]]
## [1] "2. November"
## 
## [[12]]
## character(0)

# We can also get the non-matched substrings
# using invert = TRUE
regmatches(
  myStrings,
  regexpr("^[1-2].*ber$", myStrings),
  invert = TRUE
)

## [[1]]
## [1] "1. January"
## 
## [[2]]
## [1] "2. February"
## 
## [[3]]
## [1] "3. March"
## 
## [[4]]
## [1] "1. April"
## 
## [[5]]
## [1] "2. May"
## 
## [[6]]
## [1] "3. June"
## 
## [[7]]
## [1] "1. July"
## 
## [[8]]
## [1] "2. August"
## 
## [[9]]
## [1] "3. September"
## 
## [[10]]
## [1] "" ""
## 
## [[11]]
## [1] "" ""
## 
## [[12]]
## [1] "3. December"

Bonuses

Strings

# The Levenshtein distance between strings
adist(c("lazy", "lasso", "lassie"), c("lazy", "lazier", "laser"))

##      [,1] [,2] [,3]
## [1,]    0    3    3
## [2,]    3    4    2
## [3,]    4    3    3

# Repeat elements of a character vector a given number of times 
strrep(c(":)", ":P ", ";) "), 1:3)

## [1] ":)"        ":P :P "    ";) ;) ;) "

# Convert strings to integers of a given base
strtoi(c("101010", "11111000101"), base =  2L)

## [1]   42 1989

strtoi(c("2A", "7C5"), base = 16L)

## [1]   42 1989

# Symbolic Number Coding
cors <- lapply(split(iris, iris$Species), function(x) cor(x[, 1:4]))
lapply(cors, symnum, abbr.colnames = 6)

## $setosa
##              Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1                          
## Sepal.Width  ,      1                   
## Petal.Length               1            
## Petal.Width                .      1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
## 
## $versicolor
##              Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1                          
## Sepal.Width  .      1                   
## Petal.Length ,      .      1            
## Petal.Width  .      ,      ,      1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
## 
## $virginica
##              Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1                          
## Sepal.Width  .      1                   
## Petal.Length +      .      1            
## Petal.Width         .      .      1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Alternatives to base R

Using the tidyverse’s stringr and glue

Stringr is built on top of stringiand focuses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine.
glue strings to data in R. Small, fast, dependency free interpreted string literals.

Using stringi

Stringi is an R package for very fast, correct, consistent, and convenient string/text processing in each locale and any native character encoding.

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

References

Pattern Matching And Replacement R documentation
Regular Expressions As Used In R
Regular Expressions in R Programming for Data Science
Regular Expressions (video)
Handling Strings with R by Gaston Sanchez
Cheat Sheet for basic regular expressions in R

4 ways to be more efficient using RStudio's Code Snippets, with 11 ready to use examples

Sat, 10 Nov 2018 12:00:00 +0000

Introduction

In this post we will look at yet another productivity increasing feature of the RStudio IDE - Code Snippets. Code Snippets let us easily insert and potentially execute predefined pieces of code and work not just for R code, but many other languages as well.

In this post we will cover 4 different ways to increase productivity using Code Snippets and provide 11 real-life examples of their use that you can take advantage of instantly.

How do Code Snippets work

Using, Viewing and editing snippets

In RStudio, we can browse and define snippets under Tools -> Global Options... -> Code -> Edit Snippets window
When typing code, the snippet will appear as an auto-complete option (similar to function names) if we type the first few letters of its name
Use Shift+Tab to insert the snippet immediately or pick the snippet from the auto-complete list (by clicking or scrolling on it and pressing Tab)

Note that as there is no auto-completion when editing R Markdown documents, we need to use the Shift+Tab method exclusively in that case.

Four common use-case scenarios

1. Automatically insert boilerplate or template-style code

The first and probably most frequent use of the Code Snippets feature is to quickly insert predefined pieces of code that require a lot of typing with little alternation, a.k.a. boilerplate code. A good illustration is a snippet covering a tryCatch block:

snippet tryc
    ${1:variable} <- tryCatch({
        ${2}
    }, warning = function(w) {
        message(sprintf("Warning in %s: %s", deparse(w[["call"]]), w[["message"]]))
        ${3}
    }, error = function(e) {
        message(sprintf("Error in %s: %s", deparse(e[["call"]]), e[["message"]]))
        ${4}
    }, finally = {
        ${5}
    })

Note that the snippet definition is intended using <tab> instead of spaces.

After defining this Snippet and running it we will automatically get a good template for the block and we can focus on writing the important parts:

The numbered sections prefixed with $ such as ${2} let us define sections to which the cursor will jump after pressing Tab. We can also use ${1:predefinedvalue} to predefine a value for the sections.

Another example of this type of use may be a testthat block that quickly prepares a unit-testing file:

snippet tt
    context("${1}")

    # ${2} ----------
    test_that(
      "${2}",
      expect_${3}(${4})
    )

2. Pre-fill code to be ran quickly

The second use case scenario where the Code Snippets come in really handy is to use them in the console when we want to run a block of code that we execute often in some scenarios. One such example is to attach the packages we use in a particular context. For example, when developing an R package, the following may be handy:

snippet dd
    "library('devtools'); library('testthat'); library('pryr')"

With this snippet, after pressing dd and then Shift+Tab in the console, the library statements will appear and we can just press enter to run them and attach the mentioned packages. We can of course make separate snippets for example for attaching packages we use for interactive data analysis and plotting. This is one way to keep our .Rprofile clean and still have packages easily available when needed.

Another example for this scenario is to quickly run a benchmark comparing two or more pieces of code and visualize the results with a boxplot to get an overview:

snippet mm
    bench <- microbenchmark::microbenchmark(
        times = ${1:1:100},
        ${2:one} = ${3},
        ${4:two} = ${5}
        )
    if (requireNamespace("highcharter")) {
      highcharter::hcboxplot(bench[["time"]], bench[["expr"]], outliers = FALSE)
    } else {
      boxplot(bench, outline = FALSE)
    }

3. Execute code combined with `rstudioapi`

The one scenario where RStudio really shines is combining multiple features it offers. We can neatly combine the use of snippets, rstudioapi and the Terminal feature that we discussed previously for an amazing variety of productivity boosts.

Just one practical example convenient when writing a blogdown site is to instantly serve a preview of the blog in a separate session via the Terminal and use the RStudio Viewer in one go to view the site. This is handy especially in the RStudio Server setting, where the site serving in the same session can make the IDE behave slow:

snippet ss
    `r eval({
      nocon <- function(link = 'http://127.0.0.1:9999') {
        inherits(suppressWarnings(try({
            con <- url(link, open = 'rb')
            close(con)
        }, silent = TRUE)), 'try-error')
      }
      if (nocon()) {
        termId <- rstudioapi::terminalExecute(
          'R -q -e \"blogdown::serve_site(port = 9999,  browser = FALSE)\"',
          show = FALSE
        )
        while (nocon() && !identical(rstudioapi::terminalExitCode(termId), 1L)) {
            Sys.sleep(0.25)
            cat(".")
        }
      }
      if (identical(rstudioapi::terminalExitCode(termId), 1L)) {
        cat(rstudioapi::terminalBuffer(termId), sep = "\n")
      } else {
        rstudioapi::viewer('http://127.0.0.1:9999')
      }
    })`

After pressing ss and Shift+Tab, the site will be served in a separate R Session and previewed in the viewer.

Using eval(expression) like above lets us execute R code in snippets. This gives a lot of flexibility, even more extensive when combined with eval(parse(text = "code as character string"))

4. Execute code and paste result at cursor

The fourth option is to inject text following the cursor using $$. An example simple but potentially powerful use of this feature is to pass commands to be executed via base R’s system and getting the results directly at our cursor:

snippet $$
    `r eval(parse(text = "system('$$', intern = TRUE)"))`

With the above, when typing $$ls into the editor and pressing Shift+Tab, we will see the list of files present in our working directory placed at our cursor.

Another handy use of this feature is to be able to quickly get a reproducible object definition by deparsing it:

snippet $$
    `r paste("$$ <-", deparse(eval(parse(text="$$")), width.cutoff = 500L))`

TL;DR - Just give me the snippets

The promised 11 potentially helpful snippets can be found here.

Resources

Code Snippets by J.J. Allaire at the RStudio support
4 ways to be more productive, using RStudio’s terminal

How to perform merges (joins) on two or more data frames with base R, tidyverse and data.table

Sat, 27 Oct 2018 12:00:00 +0000

Introduction

In this post in the R:case4base series we will look at one of the most common operations on multiple data frames - merge, also known as JOIN in SQL terms.

We will learn how to do the 4 basic types of join - inner, left, right and full join with base R and show how to perform the same with tidyverse’s dplyr and data.table’s methods. A quick benchmark will also be included.

Joins

Merging (joining) two data frames with base R

To showcase the merging, we will use a very slightly modified dataset provided by Hadley Wickham’s nycflights13 package, mainly the flights and weather data frames. Let’s get right into it and simply show how to perform the different types of joins with base R.

First, we prepare the data and store the columns we will merge by (join on) into mergeCols:

dataurl <- "https://jozef.io/post/data/"
weather <- readRDS(url(paste0(dataurl, "r006/weather.rds")))
flights <- readRDS(url(paste0(dataurl, "r006/flights.rds")))

mergeCols <- c("time_hour", "origin")

head(flights)

##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute           time_hour
## 1     15 2013-01-01 05:00:00
## 2     29 2013-01-01 05:00:00
## 3     40 2013-01-01 05:00:00
## 4     45 2013-01-01 05:00:00
## 5      0 2013-01-01 06:00:00
## 6     58 2013-01-01 05:00:00

head(weather)

##   origin year month day hour  temp  dewp humid wind_dir wind_speed
## 1    EWR 2013     1   1    1 39.02 26.06 59.37      270   10.35702
## 2    EWR 2013     1   1    2 39.02 26.96 61.63      250    8.05546
## 3    EWR 2013     1   1    3 39.02 28.04 64.43      240   11.50780
## 4    EWR 2013     1   1    4 39.92 28.04 62.21      250   12.65858
## 5    EWR 2013     1   1    5 39.02 28.04 64.43      260   12.65858
## 6    EWR 2013     1   1    6 37.94 28.04 67.21      240   11.50780
##   wind_gust precip pressure visib           time_hour
## 1        NA      0   1012.0    10 2013-01-01 01:00:00
## 2        NA      0   1012.3    10 2013-01-01 02:00:00
## 3        NA      0   1012.5    10 2013-01-01 03:00:00
## 4        NA      0   1012.2    10 2013-01-01 04:00:00
## 5        NA      0   1011.9    10 2013-01-01 05:00:00
## 6        NA      0   1012.4    10 2013-01-01 06:00:00

Now, we show how to perform the 4 merges (joins):

Inner join

inner <- merge(flights, weather, by = mergeCols)

Left (outer) join

left  <- merge(flights, weather, by = mergeCols, all.x = TRUE)

Right (outer) join

right <- merge(flights, weather, by = mergeCols, all.y = TRUE)

Full (outer) join

full <- merge(flights, weather, by = mergeCols, all = TRUE)

Other join types

# Cross Join (Cartesian product)
cross <- merge(flights, weather, by = NULL)

# Natural Join
natural <- merge(flights, weather)

The arguments of merge

The key arguments of base merge data.frame method are:

x, y - the 2 data frames to be merged
by - names of the columns to merge on. If the column names are different in the two data frames to merge, we can specify by.x and by.y with the names of the columns in the respective data frames. The by argument can also be specified by number, logical vector or left unspecified, in which case it defaults to the intersection of the names of the two data frames. From best practice perspective it is advisable to always specify the argument explicitly, ideally by column names.
all, all.x, all.y - default to FALSE and can be used specify the type of join we want to perform:
- all = FALSE (the default) - gives an inner join - combines the rows in the two data frames that match on the by columns
- all.x = TRUE - gives a left (outer) join - adds rows that are present in x, even though they do not have a matching row in y to the result for all = FALSE
- all.y = TRUE - gives a right (outer) join - adds rows that are present in y, even though they do not have a matching row in x to the result for all = FALSE
- all = TRUE - gives a full (outer) join. This is a shorthand for all.x = TRUE and all.y = TRUE

Other arguments include

sort - if TRUE (default), results are sorted on the by columns
suffixes - length 2 character vector, specifying the suffixes to be used for making the names of columns in the result which are not used for merging unique
incomparables - for single-column merging only, a vector of values that cannot be matched. Any value in x matching a value in this vector is assigned the nomatch value (which can be passed using ...)

Merging multiple data frames

For this example, let us have a list of all the data frames included in the nycflights13 package, slightly updated such that they can me merged with the default value for by, purely for this exercise, and store them into a list called flightsList:

flightsList <- readRDS(url(paste0(dataurl, "r006/nycflights13-list.rds")))
lapply(flightsList, function(x) c(toString(dim(x)), toString(names(x))))

## $flights
## [1] "336776, 19"                                                                                                                                                                     
## [2] "year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour"
## 
## $weather
## [1] "26115, 15"                                                                                                             
## [2] "origin, year, month, day, hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib, time_hour"
## 
## $airlines
## [1] "16, 2"         "carrier, name"
## 
## $airports
## [1] "1458, 8"                                           
## [2] "origin, airportname, lat, lon, alt, tz, dst, tzone"
## 
## $planes
## [1] "3322, 9"                                                                            
## [2] "tailnum, yearmanufactured, type, manufacturer, model, engines, seats, speed, engine"

Since merge is designed to work with 2 data frames, merging multiple data frames can of course be achieved by nesting the calls to merge:

multiFull <- merge(merge(merge(merge(
  flightsList[[1L]],
  flightsList[[2L]], all = TRUE),
  flightsList[[3L]], all = TRUE),
  flightsList[[4L]], all = TRUE),
  flightsList[[5L]], all = TRUE)

We can however achieve this same goal much more elegantly, taking advantage of base R’s Reduce function:

# For Inner Join
multi_inner <- Reduce(
  function(x, y, ...) merge(x, y, ...), 
  flightsList
)

# For Full (Outer) Join
multi_full <- Reduce(
  function(x, y, ...) merge(x, y, all = TRUE, ...),
  flightsList
)

Note that this example is oversimplified and the data was updated such that the default values for by give meaningful joins. For example, in the original planes data frame the column year would have been matched onto the year column of the flights data frame, which is nonsensical as the years have different meanings in the two data frames. This is why we renamed the year column in the planes data frame to yearmanufactured for the above example.

Alternatives to base R

Using the tidyverse

The dplyr package comes with a set of very user-friendly functions that seem quite self-explanatory:

library(dplyr)
inner_dplyr <- inner_join(flights, weather, by = mergeCols)
left_dplyr  <- left_join(flights,  weather, by = mergeCols)
right_dplyr <- right_join(flights, weather, by = mergeCols)
full_dplyr  <- full_join(flights,  weather, by = mergeCols)

We can also use the “forward pipe” operator %>% that becomes very convenient when merging multiple data frames:

inner_dplyr <- flights %>% inner_join(weather, by = mergeCols)
left_dplyr  <- flights %>% left_join(weather,  by = mergeCols)
right_dplyr <- flights %>% right_join(weather, by = mergeCols)
full_dplyr  <- flights %>% full_join(weather,  by = mergeCols)

Using data.table

The data.table package provides an S3 method for the merge generic that has a very similar structure to the base method for data frames, meaning its use is very convenient for those familiar with that method. In fact the code is exactly the same as the base one for our example use.

One important difference worth noting is that the by argument is by default constructed differently with data.table.

We however provide it explicitly, therefore this difference does not directly affect our example:

setkeyv(weather, mergeCols)
setkeyv(flights, mergeCols)

# Note that this is identical to the code for base 
# The data.table method is called automatically for objects of class data.table
inner_dt <- merge(flights, weather, by = mergeCols)
left_dt  <- merge(flights, weather, by = mergeCols, all.x = TRUE)
right_dt <- merge(flights, weather, by = mergeCols, all.y = TRUE)
full_dt  <- merge(flights, weather, by = mergeCols, all = TRUE)

Alternatively, we can write data.table joins as subsets:

inner_dt <- flights[weather, on = mergeCols, nomatch = 0]
left_dt  <- weather[flights, on = mergeCols]
right_dt <- flights[weather, on = mergeCols]

Quick benchmarking

For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned packages:

Inner join

bench_inner <- microbenchmark::microbenchmark(times = 100,
  base        = base::merge.data.frame(flights, weather, by = mergeCols),
  base_nosort = base::merge.data.frame(flights, weather, by = mergeCols, sort = FALSE),
  dt_merge    = merge(flights, weather, by = mergeCols),
  dt_subset   = flights[weather, on = mergeCols, nomatch = 0], 
  dplyr       = inner_join(flights, weather, by = mergeCols),
  dplyr_pipe  = flights %>% inner_join(weather, by = mergeCols)
)

Full (outer) join

bench_outer <- microbenchmark::microbenchmark(times = 100,
  base        = base::merge.data.frame(flights, weather, by = mergeCols, all = TRUE),
  base_nosort = base::merge.data.frame(flights, weather, by = mergeCols, all = TRUE, sort = FALSE),
  dt_merge    = merge(flights, weather, by = mergeCols, all = TRUE),
  dplyr       = full_join(flights, weather, by = mergeCols),
  dplyr_pipe  = flights %>% full_join(weather, by = mergeCols)
)

Visualizing the results in this case shows base R comes way behind the two alternatives, even with sort = FALSE.

Note: The benchmarks are ran on a standard droplet by DigitalOcean, with 2GB of memory a 2vCPUs.

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

References

Animated inner join, left join, right join and full join by Garrick Aden-Buie for an easier understanding
Base merge help
Join two tbls together with dplyr
Merge two data.tables
Joining Data in R with dplyr by Wiliam Surles
Join (SQL) Wikipedia page
The nycflights13 package on CRAN

Exactly 100 years ago tomorrow, October 28th, 1918 the independence of Czechoslovakia was proclaimed by the Czechoslovak National Council, resulting in the creation of the first democratic state of Czechs and Slovaks in history.

How to import a directory of csvs at once with base R and data.table. Can you guess which way is the fastest?

Sat, 13 Oct 2018 12:00:00 +0000

Introduction

Inspired by a recent post on how to import a directory of csv files at once using purrr and readr by Garrick, in this post we will try achieving the same using base R with no extra packages, and with data·table, another very popular package and as an added bonus, we will play a bit with benchmarking to see which of the methods is the fastest, including the tidyverse approach in the benchmark.

Let us show how to import all csvs from a folder into a data frame, with nothing but base R

To get the source data, download the zip file from this link and unzip it into a folder, we will refer to the folder path as data_dir.

Quick import of all csvs with base R

To import all .csv files from the data_dir directory and place them into a single data frame called result, all we have to do is:

filePaths <- list.files(data_dir, "\\.csv$", full.names = TRUE)
result <- do.call(rbind, lapply(filePaths, read.csv))

# View part of the result
head(result)

##   Month_Year           Hospital_Name Hospital_ID
## 1     Aug-15                   AMNCH        1049
## 2     Aug-15                   AMNCH        1049
## 3     Aug-15                   AMNCH        1049
## 4     Aug-15 Bantry General Hospital         704
## 5     Aug-15 Bantry General Hospital         704
## 6     Aug-15 Bantry General Hospital         704
##           Hospital_Department     ReferralType TotalReferrals
## 1              Paediatric ENT General Referral              2
## 2 Paediatric Gastroenterology General Referral              4
## 3  Paediatric General Surgery General Referral              4
## 4            Gastroenterology General Referral             12
## 5            General Medicine General Referral             18
## 6             General Surgery General Referral             43

A quick explanation of the code:

list.files - produces a character vector of the names of the files in the named directory, in our case data_dir. We have also passed a pattern argument "\\.csv$" to make sure we only process files with .csv at the end of the name and full.names = TRUE to get the file path and not just the name.
read.csv - reads a file in table format and creates a data frame from its content
lapply(X, FUN, ...)- Gives us a list of data.frames, one for each of the files found by list.files. More generally, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. In our case X is the vector of file names in data_dir (returned by list.files) and FUN is read.csv, so we are applying read.csv to each of the file paths
rbind - in our case combines the rows of multiple data frames into one, similarly (even though a bit more rigidly) to UNION in SQL
do.call - will combine all the data frames produced by lapply into one using rbind. More generally, it constructs and executes a function call from a name or a function and a list of arguments to be passed to it. In our case the function is rbind and the list is the list of data frames containing the data loaded from the csvs, produced by lapply.

Reconstructing the results of the original post

To fully reconstruct the results from the original post, we need to do two extra operations

Add the source file names to the data frame
Fix and reformat the dates

To do this, we will simply adjust the FUN in the lapply - in the above example, we have only used read.csv. Below, we will make a small function to do the extra steps:

filePaths <- list.files(data_dir, "\\.csv$", full.names = TRUE)
result <- do.call(rbind, lapply(filePaths, function(path) {
    df <- read.csv(path, stringsAsFactors = FALSE)
    df[["source"]] <- rep(path, nrow(df))
    df[["Month_Year"]] <- as.Date(
      paste0(sub("-20", "-", df[["Month_Year"]], fixed = TRUE), "-01"),
      format = "%b-%y-%d"
    )
    df
}))

# View part of the result
head(result)

##   Month_Year           Hospital_Name Hospital_ID
## 1 2015-08-01                   AMNCH        1049
## 2 2015-08-01                   AMNCH        1049
## 3 2015-08-01                   AMNCH        1049
## 4 2015-08-01 Bantry General Hospital         704
## 5 2015-08-01 Bantry General Hospital         704
## 6 2015-08-01 Bantry General Hospital         704
##           Hospital_Department     ReferralType TotalReferrals
## 1              Paediatric ENT General Referral              2
## 2 Paediatric Gastroenterology General Referral              4
## 3  Paediatric General Surgery General Referral              4
## 4            Gastroenterology General Referral             12
## 5            General Medicine General Referral             18
## 6             General Surgery General Referral             43
##                                                                                          source
## 1 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 2 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 3 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 4 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 5 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 6 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv

Lets look at the extra code in the lapply:

Instead of just using read.csv, we have defined our own little function that will do the extra work for each of the file paths, which are passed to the function as path
We read the data into a data frame called df using read.csv, and can we specify stringsAsFactors = FALSE, as the tidyverse packages do this by default, while base R’s default is different
We add a new column source with the file name stored in path, repeated as many times as df has rows. This is a bit overkill here and could be done simpler, but it is quite robust and will also work with 0-row data frames
We transform the Month_Year into the requested date format with as.Date. Note that the relatively ugly sub() part is caused mostly by inconsistency in the source data itself
Using [[ instead of $ is less pleasing to the eye, but we find it to be good practice, so sacrifice a bit of readability

Alternatives to base R

Using data.table

Another popular package that can help us achieve the same is data.table, so let’s have a look and reconstruct the results with data.table’s features:

library(data.table)
filePaths <- list.files(data_dir, "\\.csv$", full.names = TRUE)
result <- lapply(filePaths, fread)
names(result) <- filePaths
result <- rbindlist(result, use.names = TRUE, idcol = "source")
result[, Month_Year := as.Date(
  paste0(sub("-20", "-", Month_Year, fixed = TRUE), "-01"),
  format = "%b-%y-%d"
)]


# View part of the result
head(result)

##                                                                                           source
## 1: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 2: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 3: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 4: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 5: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 6: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
##    Month_Year           Hospital_Name Hospital_ID
## 1: 2015-08-01                   AMNCH        1049
## 2: 2015-08-01                   AMNCH        1049
## 3: 2015-08-01                   AMNCH        1049
## 4: 2015-08-01 Bantry General Hospital         704
## 5: 2015-08-01 Bantry General Hospital         704
## 6: 2015-08-01 Bantry General Hospital         704
##            Hospital_Department     ReferralType TotalReferrals
## 1:              Paediatric ENT General Referral              2
## 2: Paediatric Gastroenterology General Referral              4
## 3:  Paediatric General Surgery General Referral              4
## 4:            Gastroenterology General Referral             12
## 5:            General Medicine General Referral             18
## 6:             General Surgery General Referral             43

Where

rbindlist does the same as do.call("rbind", l) on data frames, but much faster
fread is similar to read.table (and read.csv, which uses read.table) but faster and more convenient
':='() is the data.table syntax to create new columns in a data.table

Using the tidyverse

This is covered in much detail in the post that inspired this one.

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

Quick benchmarking

First off we are mostly looking at it for the fun of reacting to Twitter discussion, so take it for what it’s worth, by no means this is what we would call proper benchmarking.

Now that we have seen 3 ways to achieve the same goal, let’s look at speed. Note that we will be friendly to the tidyverse and not attach the entire package as is done in the original post, however only those packages that we really need to get a more appropriate benchmark.

Full script run benchmark

First, we will perform an execution of an R script containing just the above code chunks (and the tidyverse one) a thousand times. The timing will also include overhead for launching the process, but this effect is present for all three scenarios and the variance should be safely covered by the fact that we execute 1000 times:

time for i in {1..1000}; 
do Rscript --vanilla data/r005/benchmarking/base.R &>/dev/null;
done


time for i in {1..1000};
do Rscript --vanilla data/r005/benchmarking/datatable.R &>/dev/null;
done


time for i in {1..1000};
do Rscript --vanilla data/r005/benchmarking/tidyverse.R &>/dev/null;
done

Visualizing the results shows that base R is the clear winner here, largely due to package loading overhead. Any performance benefits of the other packages are not enough to catch up in this very small use case:

If interested, you can look at the scripts ran above:

Benchmarking without package loading overhead

We could argue that it is not fair to include the library statements in the benchmark, as the overhead can be relatively big considering how small the actual action done by the code is, as we are only processing 4 small files. Here is a benchmark omitting the overhead and only executing the relevant code with the packages pre-loaded, using microbenchmark with a 100 iterations:

Visualizing the results in this case shows that data.table is a winner, with base R being the slowest of the options.

References

4 ways to be more productive, using RStudio's terminal

Sat, 29 Sep 2018 12:00:00 +0000

Introduction

RStudio version 1.1 introduced the Terminal functionality, which does not seem to be getting enough deserved attention and love even though it is very well integrated with the rest of the IDE and can be extremely useful for several daily use-cases.

In this post we will try to cover 4 very common scenarios where the Terminal can be very useful and productive, and how to get the most of it.

RStudio Terminal Fun

In short, the RStudio Terminal provides access to the system shell directly from the RStudio IDE, supporting xterm emulation, full-screen terminal applications, command line operations and more. It also has useful customizable keyboard shortcut bindings to make frequent usage more efficient and enables usage of multiple such Terminals simultaneously.

The experience may vary based on each user’s setup, this experience comes mostly from using RStudio server on a Linux-based system.

Four common use-cases

1. Execute resource-heavy R code in the Terminal quickly

A very common use case where the Terminal makes my life a lot easier is when I need to execute a longer running or resource-heavy tasks in R. Using the RStudio IDE’s session for such tasks can be challenging because running them can slow the entire IDE down, sometimes even so much that it is barely usable. We can easily prevent this by running such tasks in a separate R process within the Terminal. We could of course do this using putty or other software, however doing it within RStudio brings

seamless keyboard shortcut integration between the editor window and the Terminal
ability to use multiple Terminals easily
no need to use other software

To run commands in the terminal, we simply press:

Shift + Alt + R to open a new terminal
launch R in the Terminal
Ctrl + 1 to focus back to the editor window
Ctrl + Alt + Enter to send commands to be executed directly to the Terminal

We can also do this with multiple Terminals if we need to run multiple such “jobs”, and easily switch between Terminal windows using keyboard shortcuts

Ctrl + Alt + F11 - Previous terminal
Ctrl + Alt + F12 - Next terminal

Note the shortcuts mentioned above are default and more than likely not Mac-relevant, but you can easily find those as well in case you are a Mac user, and change them to your liking as well.

2. Advanced version control directly within RStudio

RStudio has a neat version control integration which is a very nice addition to the IDE, however there are some advanced version control operations that are not possible to handle there directly, git rebase and git push --force being just a couple of examples. Thanks to the Terminal, you can very easily do all those operations without ever leaving your RStudio IDE.

3. Serving your Shiny app/Blogdown site without blocking or slowdown

My favourite use of the Terminal when writing this blog is to serve the site via the Terminal and see the changes I make live, without the IDE being slowed down and laggy, which often happens when serving the site directly from RStudio’s R session. A very similar point also applies when running a Shiny app from within RStudio. This simple use of the Terminal makes things more convenient for me.

Running a Shiny app. We can use a pre-selected port to make viewing later easier:

# Send to the terminal with Ctrl + Alt + Enter:
R -e 'library(shiny); runApp("appdir", port = 9999, launch.browser = FALSE)'
# Then show in the viewer with Ctrl + Enter
rstudioapi::viewer("http://127.0.0.1:9999")

Similarly, serving a Blogdown site:

# Send to the terminal with Ctrl + Alt + Enter:
R -e 'library(blogdown); blogdown::serve_site(port = 9999, browser = FALSE)'
# Then show in the viewer with Ctrl + Enter
rstudioapi::viewer("http://127.0.0.1:9999")

Alternatively, we can also use rstudioapi to send commands to the Terminal:

termId <- rstudioapi::terminalExecute("R -e 'getwd(); library(shiny); runApp(\"appdir\", port = 9999, launch.browser = FALSE)'")
# Then show in the viewer with Ctrl + Enter
rstudioapi::viewer("http://127.0.0.1:9999")
# When done, we can kill that terminal
rstudioapi::terminalKill(termId)

4. Test your bash, python and much more conveniently

Since the Terminal is really just system shell access, you can get very creative with its use. To me, the key here is the keyboard shortcut integration between the editor and the terminal.

Very basic example using the Terminal to run python code. Note that this (somewhat obviously) works around R and the need for an R to Python interface package:

# Ctrl + Alt + Enter to send to the Terminal
# Launch python
python
# Run some python code
1 + 1
# When done with python, exit
exit()

Testing a random bash script

# Ctrl + Alt + Enter to send to the Terminal
echo "Run tmux, split window and run top"
tmux new -s "Fun"
tmux switch -t "Fun"
tmux split-window -h
tmux select-pane -t 0
top

Quick notes

By default, the processes in Terminal run as child processes of the main rsession process, therefore restarting R session will kill those. We can workaround this fact using tools like screen or tmux
You can specify what the Terminal sessions are open with under Tools -> Global Options... -> Terminal
The Terminal can be interfaced with using the rstudioapi package functionality. Read the Interacting with Terminals vignette to learn more.
If the default keyboard shortcuts are not the most convenient for you, they can be updated and more added under the Tools -> Modify Keyboard Shortcuts... menu in RStudio

Resources

Using the RStudio Terminal, a great guide by Gary Ritchie
Interacting with Terminals vignette of the rstudioapi package
Customizing Keyboard Shortcuts in RStudio by Kevin Ushey
Introduction to GNU Screen
The Tao of tmux

The last Saturday of September 20 years ago a key parliamentary election was held in Slovakia, resulting in the end of the reign of Vladimír Mečiar’s government and Slovakia being able to conduct crucial reforms and become a member of the EU and NATO.

3 reasons to not write that new code, and how I failed at it

Sat, 15 Sep 2018 12:00:00 +0000

Introduction

We all know that feeling. We have this great idea about a new project, feature, function, piece of code.

What do we want? Write that amazing new code!
When do we want it? Right NOW!

The aim of this post is to try and give you at 3 good reasons to resist that urge and consider other options, be it in your business projects or your private projects. With an example of how I failed and how I tried to remedy that failure, on a very small scale.

He knows

The 3 reasons

1. New code takes time (and money)

Writing new code is an investment. Time and money will be spent on designing, implementation and code review. These introductory investments are however only a minor part of the total cost of writing new code. The code must be well documented and maintained. The code must be integrated to other parts of the systems. Last but not least, the code must be tested, and writing tests usually involves writing, well, more new code.

2. New code means new bugs

Even through our best efforts and testing, bugs will be found and will need fixing. Numbers on this seem to vary a lot, Code Complete by Steve McConnel estimates an industry average of 15-50 bugs per 1 000 lines of code.

3. We write what we know

Perhaps the most compelling reason to reconsider and resist the code-writing is not in the numbers and statistics, but in the simple realization that we usually write new code using our current knowledge.

Pausing for a while and spending time investigating on the current best practices and methods of solving the issue we are aiming to solve with our new code may not only save us and our business owners valuable resources, but also increase our knowledge base thanks to that investigation.

Putting it to practice in the R world

So we have this brilliant new idea. Instead of starting to write that shiny new code, we can also start with:

Google - It is more than likely that someone has already stumbled upon this very same, or a very similar problem. How have they implemented it? What functionality have they used? What are the best practices and approaches to tackling similar issues?
Stackoverflow and Rseek for R solutions - Can we find solutions to our problem there? Are those solutions good? Can we build upon them?
Evaluate the options - If we have found any, which of them are the most suitable for us? If stability and maintainability is a major concern, can we find a solution with as few dependencies as possible ? If performance is a major concern, are benchmarks available (can we make them)?
Propose a solution - After this research, do we still need to write the new functionality? If so, how much can we build on existing solutions? Are they easy to integrate?
Do we care about dependencies? - The R world is special, one of the reasons for this is CRAN. The number of packages available on CRAN passed 13 000 and it is very convenient to just reach out and grab one more. This approach however has its caveats.

A simplest example - learning from my own mistakes

How I did it wrong

One of the first RStudio addins I have written for my own use was to run a script open in RStudio with R --vanilla via a keyboard shortcut and open a file with the script’s output in RStudio. If I had to guess, my thought process was likely similar to the following:

I will to write a new function to serve as the addin binding
I will to write a new function to serve as command executor for both Unix-like systems via system and Windows via shell
I will to write a new function to create the command to be executed by the above
Maybe some utilities, like the ones converting ~ to a full path, figure out integrating the 4 together, passing arguments, etc.

So, there I was, some time and 92 lines of code and doc later, with a new useful RStudio addin. Oh and yes, there was also 102 lines of test code, fixed a couple of times, too.

How could I do it better

After a second look a few months later when actually reviewing this supposedly good functionality, I realized that

There is a base function called system2, which seems like a much more user-friendly and easy to use version of system (and shell), with no real need to write system-specific code and even though less configurable than system, still perfectly sufficient for my purpose
I do not actually need to make the command, as extra options can be passed to system2 as arguments, including redirecting output
Oh, and I definitely do not need a function to convert ~ to full path, there is path.expand

So after a quick rewrite, we end up with a very similar functionality, only we suddenly need 35 rows of code, doc included and the tests shrink to 10 lines, as there is only 1 function to test instead of 4. That is less than a quarter of the original amount of code to be maintained and bug-fixed, with 0 new dependencies added.

This was of course a very trivial example. Real life problems of real-life projects will be much more difficult to solve. However, as complexity scales, the potential amount of time and resources saved will also scale.

Good luck resisting that urge the next time it comes ;-)

cRafty tRicks - No more typing brackets!

Sat, 01 Sep 2018 13:00:00 +0000

Calling functions in R usually involves typing brackets. And since many of our actions in R involve calling a function, we will have to type a lot of brackets working with R. Often it would make our life a lot easier if we could omit the need to type brackets where convenient. We will do exactly that today.

Work in R faster with custom bracketless commands

A good starting example is, well, quitting R altogether. Usually, one may do:

quit()

Which will in turn likely get you and extra question regarding saving a workspace image. So you then finally type n and are done with it. If you want to be a bit faster, you may do:

q("no")

Better, but still an awful lot of typing just to quit R, especially when working in a terminal-like environment with multiple sessions.

Let us be a bit craftier and make R quit just by typing qq

To make a bracketless command, we will (mis)use the fact that typing an object name into R console and pressing enter will often invoke a print method specific for the class of that object.

All we have to do to create our very first bracketless command is to create a custom print method for a funky class made for this single purpose. Then we make an object of that class and type its name to the console:

qq <- structure("no", class = "quitter")
print.quitter <- function(quitter) base::quit("no")

# This will quit your session NOT saving a workspace image!
qq

Oops…I Did It Again

Switching debugging modes with ease

Quitting R quickly is more useful then it may sound when using multiple sessions in a terminal environment, but we can use the above approach to create different useful shortcuts making our life much easier.

One example I use very frequently is to change the error option, which governs how R behaves when encountering non-catastrophic errors such as those generated by stop, etc.

I find setting the option to options(error = utils::recover) very useful for debugging and at the same time very annoying when undesired.
Typing options(error = NULL) to change it back is however even more annoying. Or is it options("error") = NULL? Or maybe even options(error) = NULL?

In comes the gg shortcut:

gg <- structure(FALSE, class = "debuggerclass")
print.debuggerclass <-  function(debugger) {
  if (!identical(getOption("error"), as.call(list(utils::recover)))) {
    options(error = recover)
    message(" * debugging is now ON - option error set to recover")
  } else {
    options(error = NULL)
    message(" * debugging is now OFF - option error set to NULL")
  }
}

Now we switch between the options with ease:

# When in need of debugging
gg

##  * debugging is now ON - option error set to recover

# The option is now set to recover
getOption("error")

## (function () 
## {
##     if (.isMethodsDispatchOn()) {
##         tState <- tracingState(FALSE)
##         on.exit(tracingState(tState))
##     }
##     calls <- sys.calls()
##     from <- 0L
##     n <- length(calls)
##     if (identical(sys.function(n), recover)) 
##         n <- n - 1L
##     for (i in rev(seq_len(n))) {
##         calli <- calls[[i]]
##         fname <- calli[[1L]]
##         if (!is.na(match(deparse(fname)[1L], c("methods::.doTrace", 
##             ".doTrace")))) {
##             from <- i - 1L
##             break
##         }
##     }
##     if (from == 0L) 
##         for (i in rev(seq_len(n))) {
##             calli <- calls[[i]]
##             fname <- calli[[1L]]
##             if (!is.name(fname) || is.na(match(as.character(fname), 
##                 c("recover", "stop", "Stop")))) {
##                 from <- i
##                 break
##             }
##         }
##     if (from > 0L) {
##         if (!interactive()) {
##             try(dump.frames())
##             cat(gettext("recover called non-interactively; frames dumped, use debugger() to view\n"))
##             return(NULL)
##         }
##         else if (identical(getOption("show.error.messages"), 
##             FALSE)) 
##             return(NULL)
##         calls <- limitedLabels(calls[1L:from])
##         repeat {
##             which <- menu(calls, title = "\nEnter a frame number, or 0 to exit  ")
##             if (which) 
##                 eval(substitute(browser(skipCalls = skip), list(skip = 7 - 
##                   which)), envir = sys.frame(which))
##             else break
##         }
##     }
##     else cat(gettext("No suitable frames for recover()\n"))
## })()

# When done debugging
gg

##  * debugging is now OFF - option error set to NULL

# The option is now back to NULL
getOption("error")

## NULL

Making it practical (and a bit less barbaric)

Defining all the shortcuts in the way shown above every time is both tedious and ugly, making a mess in our global environment. We can therefore decrease the tedium and ugliness by:

Adding the definitions into our .Rprofile with a proper notice, which will run the definitions and make the shortcuts available every time we start R standardly
Enclosing the definitions into a separate environment attached to the search path, potentially with a command to detach it easily

Such an .Rprofile can look similar to:

message("________________________________________")
message("|                                      |")
message("|      SOURCING CUSTOM .Rprofile       |")
message("|                                      |")
message("|  * qq => quit('no')                  |")
message("|  * gg => toggle error = recover/NULL |")
message("|  * dd => detach this madness         |")
message("|______________________________________|")
message("\n")

customCommands <- new.env()

assign("qq", structure("no", class = "quitterclass"), envir = customCommands)
assign("print.quitterclass", function(quitter) {
  message(" * quitting, not saving workspace")
  base::quit(quitter[1L])
}, envir = customCommands)

assign("gg", structure("", class = "debuggerclass"), envir = customCommands)
assign("print.debuggerclass", function(debugger) {
  if (!identical(getOption("error"), as.call(list(utils::recover)))) {
    options(error = recover)
    message(" * debugging is now ON - option error set to recover")
  } else {
    options(error = NULL)
    message(" * debugging is now OFF - option error set to NULL")
  }
}, envir = customCommands)

assign("dd", structure("", class = "detacherclass"), envir = customCommands)
assign("print.detacherclass", function(detacher) {
  detach(customCommands, unload = TRUE, force = TRUE)
})

attach(customCommands)

In terminal environments, shortcuts like this can be even more useful:

Tends to be more useful in the terminal

References

Rprofile chapter of Efficient R programming
Documentation on print
Documentation on options to set and examine a variety of global options.

Today, September 1st 2018 the Constitution of the Slovak Republic celebrates its 26th anniversary. Happy Birthday!

R:case4base - code profiling with base R

Sat, 18 Aug 2018 13:00:00 +0000

Introduction

In this summertime post in the case4base series, we will look at useful tools in base R, which let us profile our code without any extra packages needed to be installed. We will cover simple and easy to use speed profiling, more complex profiling of performance and memory and, as always, look at alternatives to base R as well, with a special shout out to profiling integration in RStudio.

Simple time profiling with `system.time`

Base function system.time returns the difference between two proc.time calls within which it evaluates an expression provided as argument. The simplest usage can be seen below:

system.time(runif(10^8))

##    user  system elapsed 
##   4.376   0.448   4.836

For the purpose of processing the results, we can of course store and examine them within a variable where we can see that it is in fact a numeric vector with 5 elements with a proc_time class. It uses summary as its print method via the print.proc_time. For most our purposes, we would be interested in the “elapsed” element of the result, giving us the ‘real’ elapsed time since the process was started:

tm <- system.time(runif(10^8))

str(tm)

## Class 'proc_time'  Named num [1:5] 4.376 0.448 4.836 0 0
##   ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...

tm["elapsed"]

## elapsed 
##   4.836

We can also very simply run multiple observations for an expression and investigate the results:

expr <- rep(expression(runif(10^8)), 10L)
tm <- unlist(lapply(expr, function(x) system.time(eval(x))["elapsed"]))
summary(tm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.779   4.804   4.816   4.828   4.854   4.893

With a little tweaking we can also run it in a separate process to not block our R session:

script <- shQuote(paste(
  'expr <- rep(expression(runif(10^7)), 10L)',
  'tm <- unlist(lapply(expr, function(x) system.time(eval(x))["elapsed"]))',
  'print(summary(tm))',
  sep = ';'
))

system2('Rscript', args = c('-e', script), wait = FALSE)

Profile R execution with `Rprof`

The utils package included in the base R releases contains a very useful pair of functions for profiling by sampling every interval of seconds:

Use utils::Rprof() to enable the R profiling, run the code to be profiled and use utils::Rprof(NULL) to disable profiling
Afterwards, use utils::summaryRprof() to investigate the results

The most simplistic usage is really this straight-forward:

# Enable profiling
utils::Rprof()

# Run the code to be profiled
x <- lapply(10^(6:7),  runif)
y <- lapply(x, summary)
z <- sort(x[[2]])

# Disable profiling
utils::Rprof(NULL)

# Read the profiling results and view
res <- utils::summaryRprof()
res[["by.self"]]

The profiling can be customized with arguments such as filename, which specifies to which file will the results be written (and also serves as the off switch if set to NULL or ""), interval, which governs the time between profiling samples. More can be found in the function’s help.

Perhaps the most interesting argument is memory.profiling which if set to TRUE will add memory information into the results file:

# Enable profiling with memory profiling
utils::Rprof(filename = "ProfwMemory.out", memory.profiling = TRUE)

# Run the code to be profiled
x <- lapply(10^(6:7),  runif)
y <- lapply(x, summary)
z <- sort(x[[2]])

# Disable profiling
utils::Rprof(NULL)

# Read the profiling results and view results in different ways
utils::summaryRprof(
  filename = "ProfwMemory.out",
  memory = c("stats"),
  lines = "show"
)

utils::summaryRprof(
  filename = "ProfwMemory.out",
  memory = c("both")
)[["by.self"]]

Non-sampling memory use profiling with `Rprofmem`

Base R also offers an option to profile memory use (if R is compiled with R_MEMORY_PROFILING defined) using Rprofmem - a pure memory use profiler. Results are written as simple text into a file, from which they can be read, however the processing of the result may use a bit more polishing here:

# Enable memory profiling profiling
utils::Rprofmem("Rprofmem.out", threshold = 10240)

# Run the code to be profiled
x <- runif(10^5)
y <- runif(10^6)
z <- runif(10^7)

# Disable profiling
utils::Rprofmem(NULL)

# Read the results
readLines("Rprofmem.out")

If our concern is specifically copying of (large) objects which negatively impact the memory requirements of our work, we can (provided that R is compiled with --enable-memory-profiling). Use tracemem(object) to mark object for tracking and print a stack trace it is duplicated. untracemem(object) untraces the object.

For more details see the references section.

Profiling integration within RStudio

Even though this does not really adhere to the case4base rules, we still mention the RStudio profiling integration, which is done using the profvis package and if successful, works really well and provides informative graphical outputs. All we have to to it either select a chunk of code and click on Profile -> Profile Selected Line(s), or click on Profile -> Start Profiling, run our code and then Profile -> Stop profiling. RStudio should then automatically use profvis to produce an interactive output that allows nice exploration of the results:

RStudio+profvis

Background profiling with base R via an RStudio addin

We have also created and written about an RStudio addin that let users profile R code selected in RStudio, with the advantage that the profiling runs asynchronously in a separate process not blocking the current R session and also not requiring external packages such as profvis. You can read more about it and get it here.

Alternatives to base R

References

Profiling R code for speed at Writing R Extensions
Profiling R code for memory use at Writing R Extensions
system.time help
Memory profiling in R

RStudio:addins part 5 - Profile your code on keypress in the background, with no dependencies

Sat, 04 Aug 2018 12:00:00 +0000

Introduction

Profiling our code is a very useful tool to determine how well the code performs on different metrics.

The addin we will create in this article will let us use a keyboard shortcut to run profiling on R code selected in RStudio without blocking the session or requiring any external packages.

Specifically for very simple overview use, it may be beneficial to look at the time needed for a set of expressions to compute, e.g. how fast the code is. Secondly, especially important in case of computing on big datasets in-memory, the amount of memory utilized, e.g. how much RAM was used.

The addin in action

Profiling options provided by base R

Without going into any detail at all, we have 2 very nice options to profile our code with base R:

base::system.time(expr) - returns CPU and other times that expr used
utils::Rprof - can serve as a switch to enable and disable profiling, with a variety of options, saving the results into a file on disk, by default "Rprof.out"

For the use of our addin, we will utilize the second approach, as we are interested not only in time spent, but also in memory utilization of the profiled expressions.

After finishing the profiling, we will use utils::summaryRprof to summarize the results provided to us by the Rprof functionality mentioned above. To get an overview, we will examine only the total time the selected expressions took to execute and the maximum memory.

The very simplistic implementation can look as follows:

profileExpression <- function(expr) {
  on.exit({
    unlink("Rprof.out")
    utils::Rprof(NULL)
  })

  if (!is.expression(expr)) {
    message("epxr must be an expression in profileExpression()")
    return(data.frame(
      totalTime = numeric(0),
      maxMemory = numeric(0)
    ))
  }
  gc()
  utils::Rprof(
    memory.profiling = TRUE,
    interval = 0.01,
    append = FALSE
  )
  evalRes <- try(eval(expr), silent = TRUE)
  utils::Rprof(NULL)
  if (inherits(evalRes, "try-error")) {
    return(data.frame(stringsAsFactors = FALSE,
                      totalTime = "EvalError",
                      maxMemory = "EvalError"
    ))
  }
  res <- utils::summaryRprof(memory = "both")
  data.frame(
    totalTime = max(res[["by.total"]][, 1L]),
    maxMemory = max(res[["by.total"]][, 5L])
  )
}

Since we maybe be interested in more than one execution of the expressions to be profiled and the profiling will be running in background, a wrapper executing the profiling itself multiple times may come in handy. Except the number of times to execute, which is a very standard argument, we can also attempt to provide a time frame we want to invest into the profiling:

multiProfile <- function(
  expr,
  times = 10L,
  maxtime = getOption("jhaddins_profiler_maxtime", default = NULL)
){
  if (!(is.integer(times) || is.integer(maxtime))) {
    message("Times or maxtime must be integer in multiProfile()")
    return(data.frame(
      totalTime = numeric(0),
      maxMemory = numeric(0)
    ))
  }

  first <- profileExpression(expr)
  if (!is.null(maxtime)) {
    if (is.numeric(first[["totalTime"]])) {
      times <- floor(maxtime / first[["totalTime"]])
    } else {
      message("Eval failed, cannot compute times from maxtime.")
      return(first)
    }
  }
  if (times <= 1L) {
    return(first)
  }
  rest <- do.call(
    rbind,
    lapply(rep(list(expr), times - 1L), profileExpression)
  )
  rbind(first, rest)
}

Asynchronous execution and communication of the results with the session

Since we are only using base R functionality without taking advantage of external packages that would help us execute the profiling asynchronously, we have 3 challenges:

Asynchronous execution of the profiling

We can take advantage of base R’s convenient interface system2, which allows us to invoke OS commands, with the option to run asynchronously providing wait = FALSE as argument.

Communicating the results between our R session and the one running via system2

To kill two birds with one stone, we can simply use the rstudioapi to navigate to a created file, into which we will later write the profiling results using the asynchronously running process. This way we have the results immediately available within in RStudio and we can keep working conveniently on the tasks at hand. Since our application is very simple, we also avoid complications with communication between the processes for example via sockets.

Contents of the workspace

When selecting a code chunk to profile in RStudio, it will likely happen very soon that the execution of expressions included in the selected code will rely on the current state of the global environment (aka. workspace). We can therefore make our functionality more convenient by storing the contents of the global environment on disk and loading it before running the profiler in our asynchronous process.

A simple example implementation of the thoughts above it once again presented below. Note that this implementation is very bare-bones and could use much polishing, which may happen sometime after publishing this article:

runProfiler <- function(
  inpContext = rstudioapi::getActiveDocumentContext()
){
  force(inpContext)
  inpString <- inpContext[["selection"]][[1L]][["text"]]
  cat(inpString, file = file.path("~/temp.R"))
  expr <- try(parse("~/temp.R"), silent = TRUE)
  if (inherits(expr, "try-error")) {
    message("Selected text cannot be parsed, cannot profile.")
    unlink(file.path("~/temp.R"))
    return(1L)
  }
  save(
    list = ls(all.names = TRUE, envir = .GlobalEnv),
    file = "~/tmp.RData",
    envir = .GlobalEnv
  )
  script <- paste(sep = "; ",
    "load('~/tmp.RData')",
    "res <- jhaddins:::multiProfile(parse('~/temp.R'))",
    "jhaddins:::writeProfileDf(res)",
    "unlink('~/temp.R')",
    "unlink('~/tmp.RData')"
  )
  file.create("~/tmp_prof.txt")
  rstudioapi::navigateToFile("~/tmp_prof.txt")
  system2(
    command = 'Rscript',
    args = c('-e', shQuote(script)),
    wait = FALSE
  )
  message("Profiler running in the background")
}

Results of the profiling

For the use that this simple functionality was developed, the main interest is knowing 2 very simple sets of information - how fast did the expressions execute and how much maximum memory was utilized. This is why the results are extracted and written in an extremely simplistic way, as can be seen below:

“quand il n’y a plus rien à retrancher”

Based on real-life usage we may still improve the presentation (a bit ;) in the future.

The addin formalities

If you follow this blog for a bit, you can safely skip this part. A few things to make our new addin available and easy to use:

Add the addin bindings into inst/addins.dcf

Name: runProfiler
Description: experimental, runProfiler
Binding: runProfiler
Interactive: false

Re-install the package
Assign a keyboard shortcut in the Tools -> Addins -> Browse Addins... -> Keyboard Shortcuts... menu in RStudio:

Assigning a keyboard shortcut to use the Addin

TL;DR - Just give me the package

https://gitlab.com/jozefhajnala/jhaddins.git

References

Profiling R code for speed at Writing R Extensions
Profiling R code for memory use at Writing R Extensions
system.time help
Profvis package with useful graphical overviews.
Microbenchmark package infrastructure to accurately measure and compare the execution time of R expressions
parallel package
callR package - to perform a computation in a separate R process

RStudio:addins part 4 - Unit testing coverage investigation and improvement, made easy

Sat, 21 Jul 2018 14:00:00 +0000

Introduction

A developer always pays his technical debts! And we have a debt to pay to the gods of coding best practices, as we did not present many unit tests for our functions yet. Today we will show how to efficiently investigate and improve unit test coverage for our R code, with focus on functions governing our RStudio addins, which have their own specifics.

As a practical example, we will do a simple resctructuring of one of our functions to increase its test coverage from a mere 34% to over 90%.

The pretty rewards for your tests

Fly-through of unit testing in R

Much has been written on the importance of unit testing, so we will not spend more time on convincing the readers, but rather very quickly provide a few references in case the reader is new to unit testing with R. In the later parts of the article we assume that these basics are known.

In a few words

devtools - Makes package development easier by providing R functions that simplify common tasks
testthat- Is the most popular unit testing package for R
covr- Helps track test coverage for R packages and view reports locally or (optionally) upload the results

For a start guide to use testthat within a package, visit the Testing section of R packages by Hadley Wickham. I would also recommend checking out the showcase on the 2.0.0 release of the testthat itself.

Investigating test coverage within a package

For the purpose of investigating the test coverage of a package we can use the covr package. Within an R project, we can call the package_coverage() function to get a nicely printed high-level overview, or we can provide a specific path to a package root directory and call it as follows:

# This looks much prettier in the R console ;)
covr::package_coverage(pkgPath)

## jhaddins Coverage: 59.05%

## R/viewSelection.R: 34.15%

## R/addRoxytag.R: 40.91%

## R/makeCmd.R: 92.86%

For a deeper investigation, converting the results to a data.frame might be very useful. The below shows the count of number of times that given expression was called during the running of our tests for each group of code lines:

covResults <- covr::package_coverage(pkgPath)
as.data.frame(covResults)[, c(1:3, 5, 11)]

##             filename         functions first_line last_line value
## 1     R/addRoxytag.R            roxyfy         10        12     6
## 2     R/addRoxytag.R            roxyfy         11        11     2
## 3     R/addRoxytag.R            roxyfy         13        15     4
## 4     R/addRoxytag.R            roxyfy         14        14     2
## 5     R/addRoxytag.R            roxyfy         16        16     2
## 6     R/addRoxytag.R            roxyfy         17        17     2
## 7     R/addRoxytag.R            roxyfy         18        18     2
## 8     R/addRoxytag.R        addRoxytag         29        29     0
## 9     R/addRoxytag.R        addRoxytag         30        37     0
## 10    R/addRoxytag.R        addRoxytag         32        34     0
## 11    R/addRoxytag.R        addRoxytag         38        38     0
## 12    R/addRoxytag.R    addRoxytagCode         44        44     0
## 13    R/addRoxytag.R    addRoxytagLink         50        50     0
## 14    R/addRoxytag.R     addRoxytagEqn         56        56     0
## 15       R/makeCmd.R           makeCmd         20        24     5
## 16       R/makeCmd.R           makeCmd         21        21     0
## 17       R/makeCmd.R           makeCmd         23        23     5
## 18       R/makeCmd.R           makeCmd         25        27     5
## 19       R/makeCmd.R           makeCmd         26        26     4
## 20       R/makeCmd.R           makeCmd         28        32     5
## 21       R/makeCmd.R           makeCmd         33        35     5
## 22       R/makeCmd.R           makeCmd         34        34     2
## 23       R/makeCmd.R           makeCmd         36        38     5
## 24       R/makeCmd.R           makeCmd         37        37     1
## 25       R/makeCmd.R           makeCmd         39        39     5
## 26       R/makeCmd.R      replaceTilde         48        50     1
## 27       R/makeCmd.R      replaceTilde         49        49     1
## 28       R/makeCmd.R      replaceTilde         51        51     1
## 29       R/makeCmd.R        executeCmd         61        61     5
## 30       R/makeCmd.R        executeCmd         62        66     5
## 31       R/makeCmd.R        executeCmd         68        72     3
## 32       R/makeCmd.R        executeCmd         69        69     0
## 33       R/makeCmd.R        executeCmd         71        71     3
## 34       R/makeCmd.R runCurrentRscript         90        90     1
## 35       R/makeCmd.R runCurrentRscript         91        91     1
## 36       R/makeCmd.R runCurrentRscript         92        96     1
## 37       R/makeCmd.R runCurrentRscript         93        95     1
## 38       R/makeCmd.R runCurrentRscript         94        94     0
## 39 R/viewSelection.R     viewSelection          7         7     0
## 40 R/viewSelection.R     viewSelection          8        12     0
## 41 R/viewSelection.R     viewSelection         10        10     0
## 42 R/viewSelection.R     viewSelection         13        13     0
## 43 R/viewSelection.R  getFromSysframes         24        24     6
## 44 R/viewSelection.R  getFromSysframes         25        25     3
## 45 R/viewSelection.R  getFromSysframes         26        26     3
## 46 R/viewSelection.R  getFromSysframes         28        28     3
## 47 R/viewSelection.R  getFromSysframes         29        29     3
## 48 R/viewSelection.R  getFromSysframes         30        30     3
## 49 R/viewSelection.R  getFromSysframes         31        31    92
## 50 R/viewSelection.R  getFromSysframes         32        32    92
## 51 R/viewSelection.R  getFromSysframes         33        33    92
## 52 R/viewSelection.R  getFromSysframes         34        34     2
## 53 R/viewSelection.R  getFromSysframes         37        37     1
## 54 R/viewSelection.R        viewObject         56        56     3
## 55 R/viewSelection.R        viewObject         57        57     3
## 56 R/viewSelection.R        viewObject         58        58     3
## 57 R/viewSelection.R        viewObject         61        61     0
## 58 R/viewSelection.R        viewObject         64        64     0
## 59 R/viewSelection.R        viewObject         65        65     0
## 60 R/viewSelection.R        viewObject         66        66     0
## 61 R/viewSelection.R        viewObject         69        69     0
## 62 R/viewSelection.R        viewObject         70        70     0
## 63 R/viewSelection.R        viewObject         71        71     0
## 64 R/viewSelection.R        viewObject         74        74     0
## 65 R/viewSelection.R        viewObject         76        76     0
## 66 R/viewSelection.R        viewObject         77        77     0
## 67 R/viewSelection.R        viewObject         79        79     0
## 68 R/viewSelection.R        viewObject         81        81     0
## 69 R/viewSelection.R        viewObject         82        82     0
## 70 R/viewSelection.R        viewObject         83        83     0
## 71 R/viewSelection.R        viewObject         88        88     0
## 72 R/viewSelection.R        viewObject         89        89     0
## 73 R/viewSelection.R        viewObject         91        91     0
## 74 R/viewSelection.R        viewObject         92        92     0
## 75 R/viewSelection.R        viewObject         93        93     0
## 76 R/viewSelection.R        viewObject         96        96     0

Calling covr::zero_coverage with a overage object returned by package_coverage will provide a data.frame with locations that have 0 test coverage. The nice thing about running it within RStudio is that it outputs the results on the Markers tab in RStudio, where we can easily investigate:

zeroCov <- covr::zero_coverage(covResults)

zero_coverage markers

Test coverage for RStudio addin functions

Investigating our code, let us focus on the results for the viewSelection.R, which has a very weak 34% test coverage. We can analyze exactly which lines have no test coverage in a specific file:

zeroCov[zeroCov$filename == "R/viewSelection.R", "line"]

##  [1]  7  8  9 10 11 12 13 61 64 65 66 69 70 71 74 76 77 79 81 82 83 88 89
## [24] 91 92 93 96

Looking at the code, we can see that the first chuck of lines - 7:13 represent the viewSelection function, which just calls lapply and invisibly returns NULL. The main weak spot however is the function viewObject, out of which we only test the early return in case of invalid chr argument provided. None of the other functionality is tested.

The reason behind this is that when running the tests, RStudio functionality is not available and therefore we would not be able to test even the not-so-well designed return values, as they are almost always preceded by a call to rstudioapi or other RStudio-related functionality such as the object viewer, because that is what they are designed to do. This means we must restructure the code in such a way that we contain the RStudio-dependent functionality to a necessary minimum, keeping a big majority of the code testable - only calling the side-effecting rstudioapi when actually executing the addin functionality itself.

Rewriting an addin function for better coverage

We will now show one potential way to solve this issue for the particular case of our viewObject function.

The idea behind the solution is to only return the arguments for the call to the RStudio API related functionality, instead of executing them in the function itself - hence the rename to getViewArgs.

This way we can test the function’s return value against the expected arguments and only execute them with do.call in the addin execution wrapper itself. A picture may be worth a thousand words, so here is the diff with relevant changes:

Refactoring for testability

Testing the rewritten function and gained coverage

Now that our return values are testable across the entire getViewArgs function, we can easily write tests to cover the entire function, a couple examples:

test_that("getViewArgs for function"
        , expect_equal(
            getViewArgs("reshape")
          , list(what = "View", args = list(x = reshape, title = "reshape"))
          )
        )

test_that("getViewArgs for data.frame"
        , expect_equal(
            getViewArgs("datasets::women")
          , list(what = "View",
                 args = list(x = data.frame(
                     height = c(58, 59, 60, 61, 62, 63, 64, 65,
                                66, 67, 68, 69, 70, 71, 72),
                     weight = c(115, 117, 120, 123, 126, 129, 132, 135,
                                139, 142, 146, 150, 154, 159, 164)
                     ),
                   title = "datasets::women"
                   )
            )
          )
        )

Looking at the test coverage provided after our changes, we can see that we are at more than 90% percent coverage for viewSelection.R:

# This looks much prettier in the R console ;)
covResults <- covr::package_coverage(pkgPath)
covResults

## jhaddins Coverage: 82.05%

## R/addRoxytag.R: 40.91%

## R/viewSelection.R: 90.57%

## R/makeCmd.R: 92.86%

And looking at the lines that not covered for viewSelection.R, we can indeed see that the only uncovered lines left are in fact those with the viewSelection function, which is responsible only for executing the addin itself:

covResults <- as.data.frame(covResults)
covResults[covResults$filename == "R/viewSelection.R" &
             covResults$value == 0, c(1:3, 5, 11)]

##             filename     functions first_line last_line value
## 59 R/viewSelection.R viewSelection          7         7     0
## 60 R/viewSelection.R viewSelection          8        11     0
## 61 R/viewSelection.R viewSelection         10        10     0
## 62 R/viewSelection.R viewSelection         12        12     0
## 74 R/viewSelection.R    viewObject         50        50     0
## 75 R/viewSelection.R    viewObject         51        51     0

In the ideal world we would of course want to also automate the testing of our addin execution itself by examining if their effects in the RStudio IDE are as expected, however this is far beyond the scope of this post. For some of our addin functionality we can however even directly test the side-effects, such as when the addin should produce a file with certain content.

TL;DR - Just give me the package

get the status of the package after this article
or use git clone from https://gitlab.com/jozefhajnala/jhaddins.git

References

Testthat - unit testing for R
Testing chapter of R packages by Hadley Wickham
covr - Track test coverage for your R package

A primer in using Java from R - part 2

Sat, 07 Jul 2018 13:00:00 +0000

Introduction

In this part of the primer we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava.

In the first part we talked about using the rJava package to create objects, call methods and work with arrays, we examined the various ways to call Java methods and calling Java code from R directly via execution of shell commands.

R <3 Java, or maybe not?

Using rJava with custom built classes

Preparing a .jar archive for use

Getting back to our example with running the main method of our HelloWorldDummy class from the first part of this primer, in practice we most likely want to actually create objects and invoke methods for such classes rather than simply call the main method.

For our resources to be available to rJava, we need to create a .jar archive and add it to the class path. An example of the process can be as follows. Compile our code to create the class file, and jar it:

$ javac DummyJavaClassJustForFun/HelloWorldDummy.java
$ cd DummyJavaClassJustForFun/
$ jar cvf HelloWorldDummy.jar HelloWorldDummy.class

Adding the .jar file to the class path

Within R, attach rJava, initialize the JVM and investigate our current class path using .jclassPath:

library(rJava)
.jinit()

.jclassPath()

Now, we add our newly created .jar to the class path using .jaddClassPath:

.jaddClassPath(paste0(jardir, "HelloWorldDummy.jar"))

If this worked, we can see the added jar(s) in the class path if we call .jclassPath() again.

Creating objects, investigating methods and fields

Now that we have our .jar in the class path, we can create a new Java object from our class:

dummyObj <- .jnew("DummyJavaClassJustForFun/HelloWorldDummy")
str(dummyObj)

## Formal class 'jobjRef' [package "rJava"] with 2 slots
##   ..@ jobj  :<externalptr> 
##   ..@ jclass: chr "DummyJavaClassJustForFun/HelloWorldDummy"

We can also investigate the available constructors, methods and fields for our class (or provide the object as argument, then its class will be queried):

.jconstructors returns a character vector with all constructors for a given class or object
.jmethods returns a character vector with all methods for a given class or object
.jfields returns a character vector with all fields (aka attributes) for a given class or object
.DollarNames returns all fields and methods associated with the object. Method names are followed by ( or () depending on arity.

# Requesting vectors of methods, constructors and fields by class
.jmethods("DummyJavaClassJustForFun/HelloWorldDummy")

##  [1] "public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()"              
##  [2] "public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])"      
##  [3] "public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException"   
##  [4] "public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException"
##  [5] "public final void java.lang.Object.wait() throws java.lang.InterruptedException"           
##  [6] "public boolean java.lang.Object.equals(java.lang.Object)"                                  
##  [7] "public java.lang.String java.lang.Object.toString()"                                       
##  [8] "public native int java.lang.Object.hashCode()"                                             
##  [9] "public final native java.lang.Class java.lang.Object.getClass()"                           
## [10] "public final native void java.lang.Object.notify()"                                        
## [11] "public final native void java.lang.Object.notifyAll()"

.jconstructors("DummyJavaClassJustForFun/HelloWorldDummy")

## [1] "public DummyJavaClassJustForFun.HelloWorldDummy()"

.jfields("DummyJavaClassJustForFun/HelloWorldDummy")

## NULL

# Requesting vectors of methods, constructors and fields by object
.jmethods(dummyObj)

##  [1] "public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()"              
##  [2] "public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])"      
##  [3] "public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException"   
##  [4] "public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException"
##  [5] "public final void java.lang.Object.wait() throws java.lang.InterruptedException"           
##  [6] "public boolean java.lang.Object.equals(java.lang.Object)"                                  
##  [7] "public java.lang.String java.lang.Object.toString()"                                       
##  [8] "public native int java.lang.Object.hashCode()"                                             
##  [9] "public final native java.lang.Class java.lang.Object.getClass()"                           
## [10] "public final native void java.lang.Object.notify()"                                        
## [11] "public final native void java.lang.Object.notifyAll()"

.jconstructors(dummyObj)

## [1] "public DummyJavaClassJustForFun.HelloWorldDummy()"

.jfields(dummyObj)

## NULL

Calling methods 3 different ways

We can now invoke our SayMyName method on this object in the three ways as discussed is the first part of this primer:

# low level
lres <- .jcall(dummyObj, "Ljava/lang/String;", "SayMyName")

# high level
hres <- J(dummyObj, method = "SayMyName") 

# convenient $ shorthand
dres <- dummyObj$SayMyName() 

c(lres, hres, dres)

## [1] "My name is DummyJavaClassJustForFun.HelloWorldDummy"
## [2] "My name is DummyJavaClassJustForFun.HelloWorldDummy"
## [3] "My name is DummyJavaClassJustForFun.HelloWorldDummy"

Very quick look at performance

The low-level is much faster, since J has to use reflection to find the most suitable method. The $ seems to be the slowest, but also very convenient, as it supports code completion:

microbenchmark::microbenchmark(times = 100
, .jcall(dummyObj, "Ljava/lang/String;", "SayMyName")
, J(dummyObj, "SayMyName")
, dummyObj$SayMyName()
)

## Unit: microseconds
##                                                 expr      min       lq
##  .jcall(dummyObj, "Ljava/lang/String;", "SayMyName")   45.503   65.507
##                             J(dummyObj, "SayMyName")  870.890  917.514
##                                 dummyObj$SayMyName() 1148.603 1217.089
##        mean    median       uq      max neval
##    95.20935   77.6195   84.445 1976.195   100
##  1091.08645  963.7035 1064.606 7603.580   100
##  1307.03536 1260.5855 1377.438 1731.829   100

Usage of jars in R packages

To use rJava within an R package, Simon Urbanek, the author of rJava even provides a convenience function for this purpose which initializes the JVM and registers Java classes and native code contained in the package with it. A quick step by step guide to use .jars within a package is as follows:

place our .jars into inst/java/
add Depends: rJava and SystemRequirements: Java into our NAMESPACE
add a call to .jpackage(pkgname, lib.loc=libname) into our .onLoad.R or .First.lib for example like so:

.onLoad <- function(libname, pkgname) {
  .jpackage(pkgname, lib.loc = libname)
}

if possible, add .java source files into /java folder of our package

If you are interested in more detail than provided in this super-quick overview, Tobias Verbeke created a Hello Java World! package with a vignette providing a verbose step-by-step tutorial for interfacing to Java archives inside R packages.

Setting java.parameters

The .jpackage function calls .jinit with the default parameters = getOption("java.parameters"), so if we want to set some of the java parameters, we can do it for example like so:

.onLoad <- function(libname, pkgname) {
  options(java.parameters = c("-Xmx1000m"))
  .jpackage(pkgname, lib.loc = libname)
}

Note that the options call needs to be done before the call to .jpackage, as Java parameters can only be used during JVM initialization. Consequently, this will only work if other package did not intialize the JVM already.

Handling Java exceptions in R

rJava maps Java exceptions to R conditions relayed by the stop function, therefore we can use the standard R mechanisms such as tryCatch to handle the exceptions.

The R condition object, assume we call it e for this, is actually an S3 object (a list) that contains:

call - a language object containing the call resulting in the exception
jobj - an S4 object containing the actual exception object, so we can for example investigate investigate it’s class: e[["jobj"]]@jclass

tryCatch(
  iOne <- .jnew(class = "java/lang/Integer", 1),
  error = function(e) {
    message("\nLets look at the condition object:")
    str(e)
    
    message("\nClass of the jobj item:")
    print(e[["jobj"]]@jclass)
    
    message("\nClasses of the condition object: ")
    class(e)
  }
)

## 
## Lets look at the condition object:

## List of 3
##  $ message: chr "java.lang.NoSuchMethodError: <init>"
##  $ call   : language .jnew(class = "java/lang/Integer", 1)
##  $ jobj   :Formal class 'jobjRef' [package "rJava"] with 2 slots
##   .. ..@ jobj  :<externalptr> 
##   .. ..@ jclass: chr "java/lang/NoSuchMethodError"
##  - attr(*, "class")= chr [1:9] "NoSuchMethodError" "IncompatibleClassChangeError" "LinkageError" "Error" ...

## 
## Class of the jobj item:

## [1] "java/lang/NoSuchMethodError"

## 
## Classes of the condition object:

## [1] "NoSuchMethodError"            "IncompatibleClassChangeError"
## [3] "LinkageError"                 "Error"                       
## [5] "Throwable"                    "Object"                      
## [7] "Exception"                    "error"                       
## [9] "condition"

Since class(e) is a vector of simple java class names which allows the R code to use direct handlers, we can handle different such classes differently:

withCallingHandlers(
  iOne <- .jnew(class = "java/lang/Integer", 1)
  , error = function(e) {
    message("Meh, just a boring error")
  }
  , NoSuchMethodError = function(e) {
    message("We have a NoSuchMethodError")
  }
  , IncompatibleClassChangeError = function(e) {
    message("We also have a IncompatibleClassChangeError - lets recover")
    recover()
    # recovering here and looking at 
    # 2: .jnew(class = "java/lang/Integer", 1)
    # we see that the issue is in 
    # str(list(...))
    # List of 1
    #  $ : num 1
    # We actually passed a numeric, not integer
    # To fix it, just do
    # .jnew(class = "java/lang/Integer", 1L)
  }
  , LinkageError = function(e) {
    message("Ok, this is getting a bit overwhelming,
               lets smile and end here
               :o)")
  }
)

## Meh, just a boring error

## We have a NoSuchMethodError

## We also have a IncompatibleClassChangeError - lets recover

## recover called non-interactively; frames dumped, use debugger() to view

## Ok, this is getting a bit overwhelming,
##                lets smile and end here
##                :o)

## Error in .jnew(class = "java/lang/Integer", 1): java.lang.NoSuchMethodError: <init>

References

Hello Java World! vignette - a tutorial for interfacing to Java archives inside R packages by Tobias Verbeke
rJava basic crashcourse - at the rJava site on rforge, scroll down to the Documentation section
The JNI Type Signatures - at Oracle JNI specs
rJava documentation on CRAN
Calling Java code from R by prof. Darren Wilkinson

A primer in using Java from R - part 1

Sat, 23 Jun 2018 13:00:00 +0000

Introduction

This primer shall consist of two parts and its goal is to provide a walk-through of using resources developed in Java from R. It is structured as more of a “note-to-future-self” rather than a proper educational article, I however hope that some readers may still find it useful. It will also list a set of references that I found very helpful, for which I thank the respective authors.

The primer is split into 2 posts:

In this first one we talk about using of the rJava package to create objects, call methods and work with arrays, we examine the various ways to call Java methods and calling Java code from R directly via execution of shell commands.
In the second one we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava.

R <3 Java, or maybe not?

Calling Java from R directly

Calling Java resources from R directly can be achieved using R’s system() function, which invokes the specified OS command. We can either use an already compiled java class, or invoke the compilation also via a system() call from R. Of course for any real world practical uses, we will probably do the Java coding, compilation and jaring in a Java IDE and provide R with just the final .jar file(s), I however find it helpful to have a small example of the simplest complete case, for which even the following is sufficient. Integrating pre-prepared .jars into an R packages will be covered in detail by the second part of this primer.

Let us show that by writing a very silly dummy class with just 2 methods:

main, that prints “Hello World!” + an optional suffix, if provided as argument
SayMyName method, that returns a string constructed from “My name is” and getClass().getName()

This HelloWorldDummy.java file can look as follows:

package DummyJavaClassJustForFun;

public class HelloWorldDummy {

  public String SayMyName() {
   return("My name is " + getClass().getName());
  }
  
  public static void main(String[] args) {
    String stringArg = "And that is it.";
    if (args.length > 0) {
      stringArg = args[0];
    }
    System.out.println("Hello, World. " + stringArg);
  }
}

Compilation and execution via bash commands

Now that we have our dummy class ready, we can put together the commands and test them by just executing via a shell, or for RStudio fans, we can test the commands via RStudio’s cool Terminal feature. First, the compilation command, which may look something like the following, assuming that we are in the correct working directory:

$ javac DummyJavaClassJustForFun/HelloWorldDummy.java

Now that we have the class compiled, we can execute the main method, with and without the argument provided:

$ java DummyJavaClassJustForFun/HelloWorldDummy
$ java DummyJavaClassJustForFun/HelloWorldDummy "I like winter"

In case we need to compile and run with more .jars that are in folder jars/, we specify the folder using -cp (class path):

$ javac -cp "jars/*" DummyJavaClassJustForFun/HelloWorldDummy.java
$ java -cp "jars/*:compile/src" DummyJavaClassJustForFun/HelloWorldDummy

Compilation and execution of Java code from R

Now that we have tested our commands, we can use R to do the compilation via the system function. Do not forget to cd into the correct directory within a single system call if needed:

system('cd data/; javac DummyJavaClassJustForFun/HelloWorldDummy.java')

After that we can also execute the main method, and the main method with one argument specified, just like we did it outside of R, once again using cd to enter the proper working directory if needed:

system('cd data/; java DummyJavaClassJustForFun/HelloWorldDummy')
system('cd data/; java DummyJavaClassJustForFun/HelloWorldDummy "Also I like winter"')

The rJava package - an R to Java interface

The rJava package provides a low-level interface to Java virtual machine. It allows creation of objects, calling methods and accessing fields of the objects. It also provides functionality to include our java resources into R packages easily.

We can install it with the classic:

install.packages("rJava")

Note the system requirement Java JDK 1.2 or higher and for JRI/REngine JDK 1.4 or higher. After attaching the package, we also need to initialize a Java Virtual Machine (JVM):

## Attach rJava and Init a JVM
library(rJava)
.jinit()

In case of issues with attaching the package using library, one can refer to this helpful StackOverflow thread.

Creating Java objects with rJava

We will now very quickly go through the basic uses of the package. The .jnew function is used to create a new Java object. Note that the class argument requires a fully qualified class name in Java Native Interface notation.

# Creating a new object of java.lang class String
sHello <- .jnew(class = "java/lang/String", "Hello World!")
# Creating a new object of java.lang class Integer
iOne <- .jnew(class = "java/lang/Integer", "1")

Working with arrays via rJava

# Creating new arrays
iArray <- .jarray(1L:2L)
.jevalArray(iArray)

## [1] 1 2

# Using a list of 2 and lapply
# Integer Matrix int[2][2]
iMatrix <- .jarray(list(iArray, iArray), contents.class = "[I")
lapply(iMatrix, .jevalArray)

## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] 1 2

# Integer Matrix int[2][2]
square <- array(1:4, dim = c(2, 2))
square

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

# Using dispatch = TRUE to create the array 
# Using simplify = TRUE to return a nice R array
dSquare <- .jarray(square, dispatch = TRUE)
.jevalArray(dSquare, simplify = TRUE)

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

# Integer Tesseract int[2][2][2][2]
tesseract <- array(1L:16L, dim = c(2, 2, 2, 2))
tesseract

## , , 1, 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]   13   15
## [2,]   14   16

# Use dispatch = TRUE to create the array 
# Use simplify = TRUE to return a nice R array
# Interestingly, this seems weird
dTesseract <- .jarray(tesseract, dispatch = TRUE)
.jevalArray(dTesseract, simplify = TRUE)

## , , 1, 1
## 
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    0
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    8
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]    9    0
## [2,]    0    0
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0   16

Calling Java methods using the rJava package

rJava provides two levels of API:

fast, but inflexible low-level JNI-API in the form of the .jcall function
convenient (at the cost of performance) high-level reflection API based on the $ operator.

In practice, there are three ways available to us from the rJava package enabling us to call Java methods, each of them with their positives and negatives.

The low-level way - `.jcall()`

.jcall(obj, returnSig = "V", method, ...) calls a Java method with the supplied arguments the “low-level” way. A few important notes regarding the usage, for more refer to the R help on .jcall:

requires exact match of argument and return types, doesn’t perform any lookup in the reflection tables
passing sub-classes of the classes present in the method definition requires explicit casting using .jcast
passing null arguments needs a proper class specification with .jnull
vector of length 1 corresponding to a native Java type is considered a scalar, use .jarray to pass a vector as array for safety

# Calling a Java method length on the object low-level way
.jcall(sHello, returnSig = "I", "length")

## [1] 12

# Also we must be careful with the data types:

# This works
.jcall(sHello, returnSig = "C", "charAt", 5L)

## [1] 32

# This does not
.jcall(sHello, returnSig = "C", "charAt", 5)

## Error in .jcall(sHello, returnSig = "C", "charAt", 5): method charAt with signature (D)C not found

The high-level way - `J()`

J(class, method, ...) is the high level API for accessing Java, it is slower than .jnew or .jcall since it has to use reflection to find the most suitable method.

to call a method, the method argument must be present as a character vector of length 1
if method is missing, J creates a class name reference

# Calling a Java method length on the object high-level way
J(sHello, "length")

## [1] 12

# Also, the high-level will not help here this way
J(sHello, "charAt", 5L)

## Error in .jcall(o, "I", "intValue"): method intValue with signature ()I not found

J(sHello, "charAt", 5)

## Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.NoSuchMethodException: No suitable method for the given parameters

The high-level way with convenience - `$`

Closely connected to the J function, the $ operator for jobjRef Java object references provides convenience access to object attributes and calling Java methods by implementing relevant methods for the completion generator for R.

$ returns either the value of the attribute or calls a method, depending on which name matches first
$<- assigns a value to the corresponding Java attribute

# And via the $ operator
sHello$length()

## [1] 12

# But these still do not work
sHello$charAt(5L)

## Error in .jcall(o, "I", "intValue"): method intValue with signature ()I not found

sHello$charAt(5)

## Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.NoSuchMethodException: No suitable method for the given parameters

Examining methods and fields

.DollarNames returns all fields and methods associated with the object. Method names are followed by ( or () depending on arity:

# vector of all fields and methods associated with sHello
.DollarNames(sHello)

##  [1] "CASE_INSENSITIVE_ORDER" "equals("               
##  [3] "toString()"             "hashCode()"            
##  [5] "compareTo("             "compareTo("            
##  [7] "indexOf("               "indexOf("              
##  [9] "indexOf("               "indexOf("              
## [11] "valueOf("               "valueOf("              
## [13] "valueOf("               "valueOf("              
## [15] "valueOf("               "valueOf("              
## [17] "valueOf("               "valueOf("              
## [19] "valueOf("               "length()"              
## [21] "isEmpty()"              "charAt("               
## [23] "codePointAt("           "codePointBefore("      
## [25] "codePointCount("        "offsetByCodePoints("   
## [27] "getChars("              "getBytes()"            
## [29] "getBytes("              "getBytes("             
## [31] "getBytes("              "contentEquals("        
## [33] "contentEquals("         "equalsIgnoreCase("     
## [35] "compareToIgnoreCase("   "regionMatches("        
## [37] "regionMatches("         "startsWith("           
## [39] "startsWith("            "endsWith("             
## [41] "lastIndexOf("           "lastIndexOf("          
## [43] "lastIndexOf("           "lastIndexOf("          
## [45] "substring("             "substring("            
## [47] "subSequence("           "concat("               
## [49] "replace("               "replace("              
## [51] "matches("               "contains("             
## [53] "replaceFirst("          "replaceAll("           
## [55] "split("                 "split("                
## [57] "join("                  "join("                 
## [59] "toLowerCase("           "toLowerCase()"         
## [61] "toUpperCase()"          "toUpperCase("          
## [63] "trim()"                 "toCharArray()"         
## [65] "format("                "format("               
## [67] "copyValueOf("           "copyValueOf("          
## [69] "intern()"               "wait("                 
## [71] "wait("                  "wait()"                
## [73] "getClass()"             "notify()"              
## [75] "notifyAll()"            "chars()"               
## [77] "codePoints()"

Signatures in JNI notation

Java Type	Signature
boolean	Z
byte	B
char	C
short	S
int	I
long	J
float	F
double	D
type[]	[ type
method type	( arg-types ) ret-type
fully-qualified-class	Lfully-qualified-class ;

In the fully-qualified-class row of the table above note the

L prefix

; suffix

For example

the Java method: long f (int n, String s, int[] arr);

has type signature: (ILjava/lang/String;[I)J

References

rJava basic crashcourse - at the rJava site on rforge, scroll down to the Documentation section
The JNI Type Signatures - at Oracle JNI specs
rJava documentation on CRAN
Calling Java code from R by prof. Darren Wilkinson
Mapping of types between Java (JNI) and native code
Fixing issues with loading rJava

R:case4base - data aggregation with base R

Sat, 09 Jun 2018 13:00:00 +0000

Introduction

In the previous articles of the R:case4base series we discussed and learned

how to reshape data with base R to a form that is practical for our use and
how to subset data to get the relevant parts of it with base R.

In this one, we will look at aggregation techniques using base R’s stats::aggregate generic function, focusing on the method for data frames. This will allow us to easily and safely create simple aggregations, but also provide a framework for completely custom aggregation functionality defined as separate functions that can be properly documented and unit tested.

How to use this article

This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a case4base after all!
If you have no time for reading, you can click here to get just the code with commentary

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download) and reshape them to get a nice, long format data to work with:

gdi <- read.csv(
  stringsAsFactors = FALSE
, file = "https://jozef.io/post/data/ESA2010_pretty.csv"
)

gdi <- reshape(data = gdi
             , direction = "long" # we are going from wide to long
             , varying = 2:67     # columns that will be stacked into 1
             , idvar = "country"  # identifying the subject in rows
             )

The goal of the article is therefore not really in presenting these conrete results, but to focus on the technical aspects and usefulness of the presented methods.

Simple aggregations

In this paragraph, we will try to show how to perform simple aggregation on data.frames. As the first example, let us look at the mean gross saving across the years per country:

aggregate(x = gdi["GrossSaving"]
        , by = list(country = gdi[["country"]])
        , FUN = mean
        )

##           country GrossSaving
## 1         Austria  24724.6227
## 2         Belgium  28961.7136
## 3        Bulgaria  -1711.6136
## 4         Croatia          NA
## 5          Cyprus          NA
## 6  Czech Republic 208404.0000
## 7         Denmark  53667.7273
## 8         Estonia    487.1409
## 9           EU 28          NA
## 10   Euro area 19          NA
## 11        Finland   7656.7727
## 12         France 169311.6818
## 13        Germany 265215.6818
## 14         Greece   5289.8464
## 15        Hungary          NA
## 16        Iceland          NA
## 17        Ireland   5831.3136
## 18          Italy 135086.8591
## 19         Latvia    147.1718
## 20      Lithuania    394.4595
## 21     Luxembourg   2510.5136
## 22          Malta          NA
## 23    Netherlands  37810.7727
## 24         Norway 113559.5000
## 25         Poland  45032.8636
## 26       Portugal   9348.6191
## 27        Romania          NA
## 28         Serbia          NA
## 29       Slovakia   2470.1173
## 30       Slovenia   2346.7668
## 31          Spain          NA
## 32         Sweden 207348.7273
## 33    Switzerland  74211.0864
## 34         Turkey          NA
## 35 United Kingdom  79609.8636

As we can see, we provided 3 arguments to aggregate (specifically the aggregate.data.frame method that gets called if the provided x is a data frame):

x - the data we want to aggregate, in our case the GrossSaving column of the gdi data.frame
by - a list of 1 element - country which specifies how the data will be grouped
FUN - function which will be used, in our case arithmetic mean

Simple aggregate

We can also see in our results, that for some countries such as Croatia, Cyprus and more, we have NA as a result. This is because numerical operations on vectors that contain even a single NA value will usually return NA as a result. If we want, we can usually work around this by providing an extra na.rm = TRUE argument to the function, which will strip the NA values before computation:

aggregate(x = gdi["GrossSaving"]
        , by = list(country = gdi[["country"]])
        , FUN = mean
        , na.rm = TRUE
        )

##           country  GrossSaving
## 1         Austria   24724.6227
## 2         Belgium   28961.7136
## 3        Bulgaria   -1711.6136
## 4         Croatia   18301.8727
## 5          Cyprus     438.6838
## 6  Czech Republic  208404.0000
## 7         Denmark   53667.7273
## 8         Estonia     487.1409
## 9           EU 28  924443.4983
## 10   Euro area 19  754148.9800
## 11        Finland    7656.7727
## 12         France  169311.6818
## 13        Germany  265215.6818
## 14         Greece    5289.8464
## 15        Hungary 1220273.5714
## 16        Iceland    -336.9933
## 17        Ireland    5831.3136
## 18          Italy  135086.8591
## 19         Latvia     147.1718
## 20      Lithuania     394.4595
## 21     Luxembourg    2510.5136
## 22          Malta          NaN
## 23    Netherlands   37810.7727
## 24         Norway  113559.5000
## 25         Poland   45032.8636
## 26       Portugal    9348.6191
## 27        Romania     271.2048
## 28         Serbia          NaN
## 29       Slovakia    2470.1173
## 30       Slovenia    2346.7668
## 31          Spain   57683.3333
## 32         Sweden  207348.7273
## 33    Switzerland   74211.0864
## 34         Turkey  129045.3843
## 35 United Kingdom   79609.8636

Grouping by more variables and small tweaks

To make things even easier, we can use the fact that data.frames are also lists and we can therefore substitute by = list(country = gdi[["country"]] by a much simpler and easier to read gdi["country"]. Note and be careful that we only use [] for the sub-setting to get the sub-list, as gdi[["country"]] would give us the vector of countries, as well as gdi$country:

is.list(list(country = gdi[["country"]]))

## [1] TRUE

is.list(gdi["country"])

## [1] TRUE

is.list(gdi[["country"]])

## [1] FALSE

We can also group the data by more than one column, or a column translated in any way that fits our purposes, the only constraint is that the grouping elements (elements of the by argument), are each as long as the variables in the data frame x. And of course we also can aggregate more than 1 column at the same time.

As an example, let us

calculate the mean not only for each country, but extend the grouping to decades
calculate the mean for more variables, not just "GrossSaving"

aggregate(x = gdi[c("ConspC", "AGDIpC", "GrossSaving")]
        , by = list(decade = paste0(substr(gdi[["time"]], 1L, 3L), "0s")
                  , country = gdi[["country"]]
                  )
        , FUN = mean
        , na.rm = TRUE
        )

##     decade        country      ConspC      AGDIpC   GrossSaving
## 1    1990s        Austria   19434.288   22578.640   20956.42000
## 2    2000s        Austria   21943.003   25145.214   25327.26000
## 3    2010s        Austria   23375.659   26135.279   26555.28571
## 4    1990s        Belgium   18938.482   21987.036   25395.72000
## 5    2000s        Belgium   21081.202   24088.858   30272.62000
## 6    2010s        Belgium   22594.490   24889.807   29636.12857
## 7    1990s       Bulgaria    3449.757    3494.050     -68.02000
## 8    2000s       Bulgaria    5578.549    5084.613   -2892.99000
## 9    2010s       Bulgaria    7813.849    7535.431   -1197.92857
## 10   1990s        Croatia         NaN         NaN           NaN
## 11   2000s        Croatia   51543.151   54675.474   15148.93750
## 12   2010s        Croatia   52515.373   57510.730   26709.70000
## 13   1990s         Cyprus   12302.576   12804.322     324.48600
## 14   2000s         Cyprus   15719.001   16671.523     727.46300
## 15   2010s         Cyprus   15788.383   15973.670      52.55000
## 16   1990s Czech Republic  159897.720  177018.694  132593.40000
## 17   2000s Czech Republic  200904.419  220895.554  204166.20000
## 18   2010s Czech Republic  228702.080  250919.280  268608.42857
## 19   1990s        Denmark  182933.412  180861.862   25988.60000
## 20   2000s        Denmark  205686.057  202040.782   48128.80000
## 21   2010s        Denmark  216901.757  220419.361   81351.28571
## 22   1990s        Estonia    3871.640    4249.542     270.28000
## 23   2000s        Estonia    6470.004    6500.569     160.78000
## 24   2010s        Estonia    7915.886    8509.247    1108.27143
## 25   1990s          EU 28   15214.920   16667.920  721765.79000
## 26   2000s          EU 28   17108.231   18635.367  890431.08600
## 27   2010s          EU 28   18125.639   19647.553 1001986.61714
## 28   1990s   Euro area 19   17607.510   19749.530  602634.05000
## 29   2000s   Euro area 19   19134.583   21366.280  733063.78800
## 30   2010s   Euro area 19   19740.273   21802.314  805915.67286
## 31   1990s        Finland   17220.232   18555.942    5659.60000
## 32   2000s        Finland   21616.329   23135.134    7536.40000
## 33   2010s        Finland   24631.860   26265.441    9255.28571
## 34   1990s         France   17903.622   20437.534  127231.80000
## 35   2000s         France   20696.734   23594.476  169890.90000
## 36   2010s         France   22088.164   25010.304  198541.28571
## 37   1990s        Germany         NaN   22112.486  215917.60000
## 38   2000s        Germany         NaN   23846.360  256128.00000
## 39   2010s        Germany         NaN   25848.440  313411.00000
## 40   1990s         Greece   12037.446   13388.490    9475.84600
## 41   2000s         Greece   15757.594   16707.839    8902.80000
## 42   2010s         Greece   13893.707   13620.999   -2861.51571
## 43   1990s        Hungary 1266767.368 1457751.030  868855.20000
## 44   2000s        Hungary 1737385.357 1838395.760 1140422.10000
## 45   2010s        Hungary 1727120.847 1864085.342 1646208.00000
## 46   1990s        Iceland         NaN         NaN           NaN
## 47   2000s        Iceland 3665145.798 3246304.470    5208.11000
## 48   2010s        Iceland 3491617.010 3112374.812  -11427.20000
## 49   1990s        Ireland   14151.552   14664.146    2801.42000
## 50   2000s        Ireland   20927.056   21749.417    6089.36000
## 51   2010s        Ireland   21803.019   22959.619    7626.88571
## 52   1990s          Italy   17703.632   20908.074  140031.20000
## 53   2000s          Italy   19631.544   22234.294  143875.11000
## 54   2010s          Italy   18584.590   20404.033  119000.54286
## 55   1990s         Latvia    3268.952    3188.684     -92.53200
## 56   2000s         Latvia    5400.014    5542.577     412.48900
## 57   2010s         Latvia    7088.467    6945.319     -60.63571
## 58   1990s      Lithuania    3260.052    3348.234     235.99800
## 59   2000s      Lithuania    5823.319    5947.266     422.12200
## 60   2010s      Lithuania    7934.150    8047.637     468.12857
## 61   1990s     Luxembourg   27550.836   31879.426    1411.74000
## 62   2000s     Luxembourg   32355.940   37663.168    2240.48000
## 63   2010s     Luxembourg   33700.054   40006.649    3681.11429
## 64   1990s          Malta         NaN         NaN           NaN
## 65   2000s          Malta         NaN         NaN           NaN
## 66   2010s          Malta         NaN         NaN           NaN
## 67   1990s    Netherlands   18829.144   20298.764   32126.40000
## 68   2000s    Netherlands   22457.564   23556.095   34825.10000
## 69   2010s    Netherlands   23204.377   24568.551   46136.28571
## 70   1990s         Norway  198946.604  207770.146   54321.60000
## 71   2000s         Norway  258438.853  269837.239   93655.20000
## 72   2010s         Norway  315219.324  334970.457  184307.00000
## 73   1990s         Poland   15919.330   18408.280   55897.00000
## 74   2000s         Poland   21828.780   22907.936   51046.50000
## 75   2010s         Poland   28529.733   28772.321   28681.85714
## 76   1990s       Portugal   10704.874   11856.422    8673.37200
## 77   2000s       Portugal   12562.477   13609.388   10208.75700
## 78   2010s       Portugal   12298.499   13066.471    8602.17000
## 79   1990s        Romania    8152.276    8427.888       5.88000
## 80   2000s        Romania   13854.047   13125.680  -11695.86000
## 81   2010s        Romania   19617.068   20486.043   20437.41667
## 82   1990s         Serbia         NaN         NaN           NaN
## 83   2000s         Serbia         NaN         NaN           NaN
## 84   2010s         Serbia         NaN         NaN           NaN
## 85   1990s       Slovakia    5050.218    5647.868    1810.30200
## 86   2000s       Slovakia    6824.102    7211.535    2189.37300
## 87   2010s       Slovakia    8479.726    8932.469    3342.47714
## 88   1990s       Slovenia    8573.522    9491.256    1009.57000
## 89   2000s       Slovenia   10719.190   12155.871    2626.34200
## 90   2010s       Slovenia   11666.361   12996.130    2902.51429
## 91   1990s          Spain   13961.670   15200.950   38715.00000
## 92   2000s          Spain   15808.068   17234.555   55932.60000
## 93   2010s          Spain   15180.146   16500.587   62894.14286
## 94   1990s         Sweden  182324.968  183408.674   68070.60000
## 95   2000s         Sweden  223273.148  229308.996  162984.10000
## 96   2010s         Sweden  251975.783  273511.021  370211.14286
## 97   1990s    Switzerland   41046.446   44855.838   53641.70000
## 98   2000s    Switzerland   44297.743   49556.481   69279.00000
## 99   2010s    Switzerland   47207.609   54372.377   95949.34286
## 100  1990s         Turkey         NaN         NaN           NaN
## 101  2000s         Turkey         NaN         NaN   69969.50000
## 102  2010s         Turkey         NaN         NaN  138891.36500
## 103  1990s United Kingdom   14625.190   15250.624   74342.00000
## 104  2000s United Kingdom   18919.282   19157.107   73031.80000
## 105  2010s United Kingdom   19716.279   20172.029   92769.85714

Using aggregate as a framework with custom aggregation functions

Perhaps one of the most useful cases for aggregate is using it as a supporting framework for custom aggregations, since the FUN argument can be set to a function defined to suit specific purposes. This provides a very flexible environment where one can

implement the custom aggregation functions in the most suitable way for the purpose
have unit testing for those functions
documentation and other aspects of implementation in place

And use the aggregate as a reliable executor for such functionality, all using standard base R evaluation principles. An over-simplified example of the above approach could be the following:

We define the aggregation function dummyaggfun

dummyaggfun <- function(v) {
  c(max = max(v)
  , min = min(v)
  , rng = max(v) - min(v)
  )
}

And apply the aggregation

aggregate(gdi["GrossSaving"]
        , by = list(decade = paste0(substr(gdi[["time"]], 1L, 3L), "0s")
                  , country = gdi[["country"]]
                  )
        , FUN = dummyaggfun
        )

##     decade        country GrossSaving.max GrossSaving.min GrossSaving.rng
## 1    1990s        Austria        23226.80        19097.10         4129.70
## 2    2000s        Austria        31618.00        19897.90        11720.10
## 3    2010s        Austria        28755.60        25194.20         3561.40
## 4    1990s        Belgium        27350.10        24448.40         2901.70
## 5    2000s        Belgium        39041.60        25650.80        13390.80
## 6    2010s        Belgium        33126.40        27251.20         5875.20
## 7    1990s       Bulgaria          448.40         -483.00          931.40
## 8    2000s       Bulgaria         -758.60        -7200.60         6442.00
## 9    2010s       Bulgaria         2925.80        -4525.60         7451.40
## 10   1990s        Croatia              NA              NA              NA
## 11   2000s        Croatia              NA              NA              NA
## 12   2010s        Croatia              NA              NA              NA
## 13   1990s         Cyprus          545.50          185.62          359.88
## 14   2000s         Cyprus         1194.23          280.04          914.19
## 15   2010s         Cyprus              NA              NA              NA
## 16   1990s Czech Republic       145286.00       116646.00        28640.00
## 17   2000s Czech Republic       295156.00       156060.00       139096.00
## 18   2010s Czech Republic       293141.00       246605.00        46536.00
## 19   1990s        Denmark        42398.00         9694.00        32704.00
## 20   2000s        Denmark        72548.00        15456.00        57092.00
## 21   2010s        Denmark       111688.00        36971.00        74717.00
## 22   1990s        Estonia          401.20          200.20          201.00
## 23   2000s        Estonia         1115.10         -278.30         1393.40
## 24   2010s        Estonia         1415.10          839.50          575.60
## 25   1990s          EU 28              NA              NA              NA
## 26   2000s          EU 28      1077659.22       769059.51       308599.71
## 27   2010s          EU 28      1029579.38       976054.92        53524.46
## 28   1990s   Euro area 19              NA              NA              NA
## 29   2000s   Euro area 19       879005.73       596298.12       282707.61
## 30   2010s   Euro area 19       822350.15       781605.58        40744.57
## 31   1990s        Finland         6772.00         4436.00         2336.00
## 32   2000s        Finland        10986.00         6200.00         4786.00
## 33   2010s        Finland        10801.00         7534.00         3267.00
## 34   1990s         France       131350.00       119588.00        11762.00
## 35   2000s         France       206161.00       136627.00        69534.00
## 36   2010s         France       206511.00       191738.00        14773.00
## 37   1990s        Germany       217330.00       214836.00         2494.00
## 38   2000s        Germany       291363.00       216433.00        74930.00
## 39   2010s        Germany       345523.00       292290.00        53233.00
## 40   1990s         Greece        11398.81         8234.04         3164.77
## 41   2000s         Greece        11510.19         6390.06         5120.13
## 42   2010s         Greece         2897.71        -7727.10        10624.81
## 43   1990s        Hungary      1012178.00       710576.00       301602.00
## 44   2000s        Hungary      1614306.00       788873.00       825433.00
## 45   2010s        Hungary              NA              NA              NA
## 46   1990s        Iceland              NA              NA              NA
## 47   2000s        Iceland        88500.00       -52886.40       141386.40
## 48   2010s        Iceland              NA              NA              NA
## 49   1990s        Ireland         3219.60         2592.00          627.60
## 50   2000s        Ireland        11973.40         1384.10        10589.30
## 51   2010s        Ireland         9545.40         6374.60         3170.80
## 52   1990s          Italy       163452.00       116367.20        47084.80
## 53   2000s          Italy       156700.60       111087.70        45612.90
## 54   2010s          Italy       124778.90       104720.00        20058.90
## 55   1990s         Latvia           36.17         -206.97          243.14
## 56   2000s         Latvia         1922.58          -81.56         2004.14
## 57   2010s         Latvia          620.12         -555.45         1175.57
## 58   1990s      Lithuania          610.73          -78.62          689.35
## 59   2000s      Lithuania         1003.24         -719.82         1723.06
## 60   2010s      Lithuania         1516.18         -119.85         1636.03
## 61   1990s     Luxembourg         1488.10         1344.60          143.50
## 62   2000s     Luxembourg         2964.30         1584.80         1379.50
## 63   2010s     Luxembourg         4119.00         3192.70          926.30
## 64   1990s          Malta              NA              NA              NA
## 65   2000s          Malta              NA              NA              NA
## 66   2010s          Malta              NA              NA              NA
## 67   1990s    Netherlands        34110.00        28988.00         5122.00
## 68   2000s    Netherlands        47342.00        28712.00        18630.00
## 69   2010s    Netherlands        50314.00        40945.00         9369.00
## 70   1990s         Norway        66426.00        42704.00        23722.00
## 71   2000s         Norway       140538.00        51542.00        88996.00
## 72   2010s         Norway       253022.00       117285.00       135737.00
## 73   1990s         Poland        69410.00        43081.00        26329.00
## 74   2000s         Poland        84850.00        27414.00        57436.00
## 75   2010s         Poland        49574.00        14823.00        34751.00
## 76   1990s       Portugal         9717.60         7907.00         1810.60
## 77   2000s       Portugal        13217.79         8530.86         4686.93
## 78   2010s       Portugal        11929.76         6245.18         5684.58
## 79   1990s        Romania         2749.90        -2122.40         4872.30
## 80   2000s        Romania         1146.30       -24932.80        26079.10
## 81   2010s        Romania              NA              NA              NA
## 82   1990s         Serbia              NA              NA              NA
## 83   2000s         Serbia              NA              NA              NA
## 84   2010s         Serbia              NA              NA              NA
## 85   1990s       Slovakia         2073.54         1132.87          940.67
## 86   2000s       Slovakia         3119.07         1697.54         1421.53
## 87   2010s       Slovakia         4622.03         2627.62         1994.41
## 88   1990s       Slovenia         1214.62          731.94          482.68
## 89   2000s       Slovenia         3578.14         1587.45         1990.69
## 90   2010s       Slovenia         3175.60         2337.12          838.48
## 91   1990s          Spain              NA              NA              NA
## 92   2000s          Spain        93604.00        38368.00        55236.00
## 93   2010s          Spain        74681.00        53982.00        20699.00
## 94   1990s         Sweden       100539.00        50227.00        50312.00
## 95   2000s         Sweden       257867.00        85342.00       172525.00
## 96   2010s         Sweden       452834.00       280354.00       172480.00
## 97   1990s    Switzerland        56395.80        51875.60         4520.20
## 98   2000s    Switzerland        83474.60        59724.20        23750.40
## 99   2010s    Switzerland       104819.20        84475.00        20344.20
## 100  1990s         Turkey              NA              NA              NA
## 101  2000s         Turkey              NA              NA              NA
## 102  2010s         Turkey              NA              NA              NA
## 103  1990s United Kingdom        84296.00        55971.00        28325.00
## 104  2000s United Kingdom       102670.00        58101.00        44569.00
## 105  2010s United Kingdom       126386.00        68648.00        57738.00

Advanced details of aggregate use

Examining the code of aggregate.data.frame will give us a good picture of how the function operates. This could be roughly described in the following way, abstracting from the defensive programming aspects and details and focusing on the functionality itself:

create grp - group labels that are (most likely) numbers stored as character by factorizing the elements of by
create y - a data.frame with the data grouping resulting from processing by, to which the results will be binded
take the input data x (coerced to a data.frame) and column by column split the data into groups according to grp
apply FUN (that was retrieved by match.fun) on the results of the split, assign the results into z
bind the y that has the group labels with z that has the results

Providing the FUN argument

One specific should be noted - providing FUN as a character string (name of the function, e.g. FUN = "mean") will trigger the non-standard evaluation part of code in match.fun, which we may like to avoid. This is easily achieved by providing the FUN argument with the function diretly, not via the function’s name (e.g. FUN = mean) as in that case match.fun just returns the provided FUN without further changes

Argument structure of FUN

The value returned from split is a list of vectors containing the values for the groups. The FUN is provided with the elements of that list via lapply, which are vectors. This is helpful for the setup of the custom FUN. We can also take advantage of the ... concept and dedicate a part of the FUN code to process more provided arguments.

Aggregate’s methods for other object classes

So far we have mostly used the aggregate.data.frame method, however aggregate is a generic function with methods for multiple classes of objects, here is a very quick overview:

aggregate.default - the default method, which uses the time series method if x is a time series, and otherwise coerces x to a data.frame and calls the data.frame method
aggregate.ts - the time series method, is further discussed in R’s help on ?aggregate. Investigation of the code is also very advisable.
aggregate.formula - the formula method, is a standard formula interface to aggregate.data.frame
aggregate.data.frame - is discussed in this article

Alternatives to base R

dplyr::summarize and friends
using data.table

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

Exercises

Looking at the aggregate(state.x77, list(Region = state.region), mean) example in ?aggregate, how does R know how to match the states to the regions? Would the example still work if the data in state.x77 were sorted differently?
What is the difference between aggregate(x = gdi["GrossSaving"], by = gdi["country"], FUN = mean) and aggregate(x = gdi[["GrossSaving"]], by = gdi["country"], FUN = mean). What is the issue with the latter? Looking at the code of aggregate.data.frame, why does the latter still work?

References

aggregate at rdocumentation.org
split at rdocumentation.org
discussion on ... (ellipsis) on stack overflow
original eurostat data source

Exercise answers

At the bottom of the code for the article

RStudio:addins part 3 - View objects, files, functions and more with 1 keypress

Sat, 26 May 2018 14:00:00 +0000

Introduction

In this post in the RStudio:addins series we will try to make our work more efficient with an addin for better inspection of objects, functions and files within RStudio. RStudio already has a very useful View function and a Go To Function / File feature with F2 as the default keyboard shortcut and yes, I know I promised automatic generation of @importFrom roxygen tags in the previous post, unfortunately we will have to wait a bit longer for that one but I believe this one more than makes up for it in usefulness.

The addin we will create in this article will let us use RStudio to View and inspect a wide range of objects, functions and files with 1 keypress.

The addins in action

Retrieving objects from sys.frames

As a first step, we need to be able to retrieve the value of the object we are looking for based on a character string from a frame within the currently present sys.frames() for our session. This may get tricky, as it is not sufficient to only look at parent frames, because we may easily have multiple sets of “parallel” call stacks, especially when executing addins.

An example can be seen in the following screenshot, where we have a browser() call executed during the Addin execution itself. We can see that our current frame is 18 and browsing through its parent would get us to frames 17 -> 16 -> 15 -> 14 -> 0 (0 being the .GlobalEnv). The object we are looking for is however most likely in one of the other frames (9 in this particular case):

Example of sys.frames

getFromSysframes <- function(x) {
  if (!(is.character(x) && length(x) == 1 && nchar(x) > 0)) {
    warning("Expecting a non-empty character of length 1. Returning NULL.")
    return(invisible(NULL))
  }
  validframes <- c(sys.frames()[-sys.nframe()], .GlobalEnv)
  res <- NULL
  for (i in validframes) {
    inherits <- identical(i, .GlobalEnv)
    res <- get0(x, i, inherits = inherits)
    if (!is.null(res)) {
      return(res)
    }
  }
  return(invisible(res))
}

Viewing files, objects, functions and more efficiently

As a second step, we write a function to actually view our object in RStudio. We have quite some flexibility here, so as a first shot we can do the following:

Open a file if the selection (or the selection with quotes added) is a path to an existing file. This is useful for viewing our scripts, data files, etc. even if they are not quoted, such as the links in your Rmd files
Attempt to retrieve the object by the name and if found, try to use View to view it
If we did not find the object, we can optionally still try to retrieve the value by evaluating the provided character string. This carries some pitfalls, but is very useful for example for
- viewing elements of lists, vectors, etc. where we need to evaluate [, [[ or $ to do so.
- viewing operation results directly in the viewer, as opposed to writing them out into the console, useful for example for wide matrices that (subjectively) look better in the RStudio viewer, compared to the console output
If the View fails, we can still show useful information by trying to View its structure, enabling us to inspect objects that cannot be coerced to a data.frame and therefore would fail to be viewed.

viewObject <- function(chr,
                       tryEval = getOption("jhaddins_view_tryeval",
                                           default = TRUE)
                       ) {

  if (!(is.character(chr) && length(chr) == 1 && nchar(chr) > 0)) {
    message("Invalid input, expecting a non-empty character of length 1")
    return(invisible(1L))
  }

  ViewWrap <- get("View", envir = as.environment("package:utils"))

  # maybe it is an unquoted filename - if so, open it
  if (file.exists(chr)) {
    rstudioapi::navigateToFile(chr)
    return(invisible(0L))
  }
  # or maybe it is a quoted filename - if so, open it
  if (file.exists(gsub("\"", "", chr, fixed = TRUE))) {
    rstudioapi::navigateToFile(gsub("\"", "", chr, fixed = TRUE))
    return(invisible(0L))
  }

  obj <- getFromSysframes(chr)

  if (is.null(obj)) {
    if (isTRUE(tryEval)) {
      # object not found, try evaluating
      try(obj <- eval(parse(text = chr)), silent = TRUE)
    }
    if (is.null(obj)) {
      message(sprintf("Object %s not found", chr))
      return(invisible(1L))
    }
  }

  # try to View capturing output for potential errors
  Viewout <- utils::capture.output(ViewWrap(obj, title = chr))
  if (length(Viewout) > 0 && grepl("Error", Viewout)) {
    # could not view, try to at least View the str of the object
    strcmd <- sprintf("str(%s)", chr)
    message(paste(Viewout,"| trying to View", strcmd))
    ViewWrap(utils::capture.output(utils::str(obj)), title = strcmd)
  }

  return(invisible(0L))
}

This function can of course be improved and updated in many ways, for example using the summary method instead of str for selected object classes, or showing contents of .csv (or other data) files already read into a data.frame.

The addin function, updating the .dcf file and key binding

If you followed the previous posts in the series, you most likely already know what is coming up next. First, we need a function serving as a binding for the addin that will execute out viewObject function on the active document’s selections:

viewSelection <- function() {
  context <- rstudioapi::getActiveDocumentContext()
  lapply(X = context[["selection"]]
         , FUN = function(thisSel) {
           viewObject(thisSel[["text"]])
         }
  )
  return(invisible(NULL))
}

Secondly, we update the inst/rstudio/addins.dcf file by adding the binding for the newly created addin:

Name: viewSelection
Description: Tries to use View to View the object defined by a text selected in RStudio
Binding: viewSelection
Interactive: false

Finally, we re-install the package and assign the keyboard shortcut in the Tools -> Addins -> Browse Addins... -> Keyboard Shortcuts... menu. Personally I assigned a single F4 keystroke for this, as I use it very often:

Assigning a keyboard shortcut to use the Addin

The addin in action

Now, let’s view a few files, a data.frame, a function and a try-error class object just pressing F4.

TL;DR - Just give me the package

get the status of the package after this article
or use git clone from https://gitlab.com/jozefhajnala/jhaddins.git

References

Environments chapter of Advanced R
Using RStudio’s Data Viewer

RStudio:addins part 2 - roxygen documentation formatting made easy

Sat, 12 May 2018 14:00:00 +0000

Introduction

Code documentation is extremely important if you want to share the code with anyone else, future you included. In this second post in the RStudio:addins series we will pay a part of our technical debt from the previous article and document our R functions conveniently using a new addin we will build for this purpose.

The addin we will create in this article will let us create well formatted roxygen documentation easily by using keyboard shortcuts to add useful tags such as \code{} or \link{} around selected text in RStudio.

Quick intro to documentation with roxygen2

1. Documenting your first function

To help us generate documentation easily we will be using the roxygen2 package. You can install it using install.packages("roxygen2"). Roxygen2 works with in-code tags and will generate R’s documentation format .Rd files, create a NAMESPACE, and manage the Collate field in DESCRIPTION (not relevant to us at this point) automatically for our package.

Documenting a function works in 2 simple steps:

Documenting a function

Inserting a skeleton - Do this by placing your cursor anywhere in the function you want to document and click Code Tools -> Insert Roxygen Skeleton (default keyboard shortcut Ctrl+Shift+Alt+R).
Populating the skeleton with relevant information. A few important tags are:

#' @params - describing the arguments of the function
#' @return - describing what the function returns
#' @importFrom package function - in case your function uses a function from a different package Roxygen will automatically add it to the NAMESPACE
#' @export - if case you want the function to be exported (mainly for use by other packages)
#' @examples - showing how to use the function in practice

2. Generating and viewing the documentation

Generating and viewing the documentation

We generate the documentation files using roxygen2::roxygenise() or devtools::document() (default keyboard shortcut Ctrl+Shift+D)
Re-installing the package (default keyboard shortcut Ctrl+Shift+B)
Viewing the documentation for a function using ?functioname e.g. ?mean, or placing cursor on a function name and pressing F1 in RStudio - this will open the Viewer pane with the help for that function

3. A real-life example

Let us now document runCurrentRscript a little bit:

#' runCurrentRscript
#' @description Wrapper around executeCmd with default arguments for easy use as an RStudio addin
#' @param path character(1) string, specifying the path of the file to be used as Rscript argument (ideally a path to an R script)
#' @param outputFile character(1) string, specifying the name of the file, into which the output produced by running the Rscript will be written
#' @param suffix character(1) string, specifying additional suffix to pass to the command
#' @importFrom rstudioapi getActiveDocumentContext
#' @importFrom rstudioapi navigateToFile
#' @seealso executeCmd
#' @return side-effects
runCurrentRscript <- function(
  path = replaceTilde(rstudioapi::getActiveDocumentContext()[["path"]])
, outputFile = "output.txt"
, suffix = "2>&1") {
  cmd <- makeCmd(path, outputFile = outputFile, suffix = suffix)
  executeCmd(cmd)
  if (!is.null(outputFile) && file.exists(outputFile)) {
    rstudioapi::navigateToFile(outputFile)
  }
}

As we can see by looking at ?runCurrentRscript versus ?mean, our documentation does not quite look up to par with documentation for other functions:

What is missing if we abstract from the richness of the content is the usage of markup commands (tags) for formatting and linking our documentation. Some of the very useful such tags are for example:

\code{}, \strong{}, \emph{} for font style
\link{}, \href{}, \url{} for linking to other parts of the documentation or external resources
\enumerate{}, \itemize{}, \tabular{} for using lists and tables
\eqn{}, \deqn{} for mathematical expressions such as equations etc.

For the full list of options regarding text formatting, linking and more see Writing R Extensions’ Rd format chapter

Our addins to make documenting a breeze

As you can imagine, typing the markup commands in full all the time is quite tedious. The goal of our new addin will therefore be to make this process efficient using keyboard shortcuts - just select a text and our addin will place the desired tags around it. For this time, we will be satisfied with simple 1 line tags.

1. Add a selected tag around a character string

roxyfy <- function(str, tag = NULL, splitLines = TRUE) {
  if (is.null(tag)) {
    return(str)
  }
  if (!isTRUE(splitLines)) {
    return(paste0("\\", tag, "{", str, "}"))
  }
  str <- unlist(strsplit(str, "\n"))
  str <- paste0("\\", tag, "{", str, "}")
  paste(str, collapse = "\n")
}

2. Apply the tag on a selection in an active document in RStudio

We will make the functionality available for multi-selections as well by lapply-ing over the selection elements retrieved from the active document in RStudio.

addRoxytag <- function(tag = NULL) {
  context <- rstudioapi::getActiveDocumentContext()
  lapply(X = context[["selection"]]
       , FUN = function(thisSel, contextid) {
           rstudioapi::modifyRange(location = thisSel[["range"]]
                                 , roxyfy(thisSel[["text"]], tag)
                                 , id = contextid)
         }
       , contextid = context[["id"]]
       )
  return(invisible(NULL))
}

3. Wrappers around `addRoxytag` to be used as addin for some useful tags

addRoxytagCode <- function() {
  addRoxytag(tag = "code")
}

addRoxytagLink <- function() {
  addRoxytag(tag = "link")
}

addRoxytagEqn <- function() {
  addRoxytag(tag = "eqn")
}

4. Add the addin bindings into `addins.dcf` and assign keyboard shortcuts

As the final step, we need to add the bindings for our new addins to the inst/rstudio/addins.dcf file and re-install the package.

Name: addRoxytagCode
Description: Adds roxgen tag code to current selections in the active RStudio document
Binding: addRoxytagCode
Interactive: false

Name: addRoxytagLink
Description: Adds roxgen tag link to current selections in the active RStudio document
Binding: addRoxytagLink
Interactive: false

Name: addRoxytagEqn
Description: Adds roxgen tag eqn to current selections in the active RStudio document
Binding: addRoxytagEqn
Interactive: false

assigning keyboard shortcuts to addins

The addins in action

And now, let’s just select the text we want to format and watch our addins do the work for us! Then document the package, re-install it and view the improved help for our functions:

The addins in action

What is next - even more automated documentation

Next time we will try to enrich our addins for generating documentation by adding the following functionalities

automatic generation of @importFrom tags by inspecting the function code
allowing for more complex tags such as itemize

TL;DR - Just give me the package

Get the status of the package after this article
or use git clone from https://gitlab.com/jozefhajnala/jhaddins.git

References

RStudio:addins part 1 - code reproducibility testing

Sat, 05 May 2018 14:00:00 +0000

Introduction

This is the first post in the RStudio:addins series. The aim of the series is to walk the readers through creating an R package that will contain functionality for integrating useful addins into the RStudio IDE. At the end of this first article, your RStudio will be 1 useful addin richer.

The addin we will create in this article will let us run a script open in RStudio in R vanilla mode via a keyboard shortcut and open a file with the script’s output in RStudio.

This is useful for testing whether your script is reproducible by users that do not have the same start-up options as you (e.g. preloaded environment, site file, etc.), making it a good tool to test your scripts before sharing them.

If you want to get straight to the code, you can find it at https://gitlab.com/jozefhajnala/jhaddins.git

Prerequisites and recommendations

To make the most use of the series, you will need the following:

R, ideally version 3.4.3 or more recent, 64bit
RStudio IDE, ideally version 1.1.383 or more recent
Also recommended

git, for version control
TortoiseGit, convenient shell interface to git for those using Windows, with pretty icons and all

Recommended R packages (install with install.packages("packagename"), or via RStudio’s Packages tab):

devtools - makes your development life easier
testthat - provides a framework for unit testing integrated into RStudio
roxygen2 - makes code documentation easy

Step 1 - Creating a package

Use devtools::create to create a package (note that we will update more DESCRIPTION fields later and you can also choose any path you like and it will be reflected in the name of the package)

devtools::create(
  path = "jhaddins"
, description = list("License" = "GPL-3")
)

In RStudio or elsewhere navigate to the jhaddins folder and open the project jhaddins.Rproj (or the name of your project if you chose a different path)
Run the first check and install the package

devtools::check()   # Ctrl+Shift+E or Check button on RStudio's build tab
devtools::install() # Ctrl+Shift+B or Install button on RStudio's build tab

Optionally, initialize git for version control

devtools::use_git()

Step 2 - Writing the first functions

We will now write some functions into a file called makeCmd.R that will let us run the desired functionality:

makeCmd to create a command executable via system or shell, with defaults set up for executing an R file specified by path

makeCmd <- function(path
                  , command = "Rscript"
                  , opts = "--vanilla"
                  , outputFile = NULL
                  , suffix = NULL
                  , addRhome = TRUE) {
  if (Sys.info()["sysname"] == "Windows") {
    qType <- "cmd2"
  } else {
    qType <- "sh"
  }
  if (isTRUE(addRhome)) {
    command <- file.path(R.home("bin"), command)
  }
  cmd <- paste(
    shQuote(command, type = qType)
  , shQuote(opts, type = qType)
  , shQuote(path, type = qType)
  )
  if (!is.null(outputFile)) {
    cmd <- paste(cmd, ">", shQuote(outputFile))
  }
  if (!is.null(suffix)) {
    cmd <- paste(cmd, suffix)
  }
  cmd
}

executeCmd to execute a command

executeCmd <- function(cmd, intern = FALSE) {
  sysName <- Sys.info()["sysname"]
  stopifnot(
    is.character(cmd)
  , length(cmd) == 1
  , sysName %in% c("Windows", "Linux")
  )

  if (sysName == "Windows") {
    shell(cmd, intern = intern)
  } else {
    system(cmd, intern = intern)
  }
}

replaceTilde for Linux purposes

replaceTilde <- function(path) {
  if (substr(path, 1, 1) == "~") {
    path <- sub("~", Sys.getenv("HOME"), path, fixed = TRUE)
  }
  file.path(path)
}

And finally the function which will be used for the addin execution - runCurrentRscript to retrieve the path to the currently active file in RStudio, run it, write the output to a file output.txt and open the file with output.

runCurrentRscript <- function(
  path = replaceTilde(rstudioapi::getActiveDocumentContext()[["path"]])
, outputFile = "output.txt") {
  cmd <- makeCmd(path, outputFile = outputFile)
  executeCmd(cmd)
  if (!is.null(outputFile) && file.exists(outputFile)) {
    rstudioapi::navigateToFile(outputFile)
  }
}

Step 3 - Setting up an addin

Now that we have all our functions ready, all we have to do is create a file addins.dcf under the \inst\rstudio folder of our package. We specify the Name of the addin, write a nice Description of what it does and most importantly specify the Binding to the function we want to call:

creating addins.dcf under inst/rstudio

Name: runCurrentRscript
Description: Executes the currently open R script file via Rscript with --vanilla option
Binding: runCurrentRscript
Interactive: false

Now we can rebuild and install our package and in RStudio’s menu navigate to Tools -> Addins -> Browse Addins..., and there it is - our first addin. For the best experience, we can click the Keyboard Shortcuts... button and assign a keyboard shortcut to our addin for easy use.

setting an RStudio addin keyboard shortcut

Now just open an R script, hit our shortcut and voilà, our script gets execute via RScript in vanilla mode.

Step 4 - Updating our `DESCRIPTION` and `NAMESPACE`

As our last steps, we should

Update our DESCRIPTION file with rstudioapi as Imports, as we will be needing it before using our package:

Package: jhaddins
Title: JH's RStudio Addins
Version: 0.0.0.9000
Authors@R: person("Jozef", "Hajnala", email = "jozef.hajnala@gmail.com", role = c("aut", "cre"))
Description: Useful addins to make RStudio even better.
Depends: R (>= 3.0.1)
Imports: rstudioapi (>= 0.7)
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1

Update our NAMESPACE by importing the functions from other packages that we are using, namely:

importFrom(rstudioapi, navigateToFile)
importFrom(rstudioapi, getActiveDocumentContext)

Now we can finally rebuild and install our package again and run a CHECK to see that we have no errors, warnings and notes telling us something is wrong. Make sure to use the document = FALSE for now.

devtools::install() # Ctrl+Shift+B or Install button on RStudio's build tab
devtools::check(document = FALSE)   # Ctrl+Shift+E or Check button on RStudio's build tab

What is next - Always paying our (technical) debts

In the next post of the series, we will pay our debt of

missing documentation for our functions, that will help us to generate updates to our NAMESPACE automatically and help us get a nice documentation so that we can read about our functions using ?
and unit tests to help us sleep better knowing that our functions get tested!

Wrapping up

We can quickly create an RStudio addin by:

Creating an R package
Writing a function in that package
Creating a addins.dcf in \inst\rstudio folder of our package

TL;DR - Just give me the package

Get the status of the package after this article
or use git clone from https://gitlab.com/jozefhajnala/jhaddins.git

References

RStudio IDE cheat sheet (4.4MB, pdf)
RStudio IDE tricks you might have missed
Understanding Addins - A fantastic webinar, where you can learn how to write and setup addins step-by-step

R:case4base - data subsetting and manipulation with base R

Sat, 21 Apr 2018 00:00:00 +0000

Introduction

In the previous article we discussed and learned how to reshape data with base R to a form that is practical for our use. In this one, we will look at basic data manipulation techniques, namely obtaining relevant subsets of our data. The key will be safety and avoiding complication and confusion as much as possible. This is why we:

try to avoid using subset, as this function is implemented via non-standard evaluation.
also skip $ as it uses partial matching and is hardly used with variables as column names.
not mention more details related to list properties of data.frames here as the topic could get confusing. If you would like to go to more important detail, we strongly recommend a thorough read of the subsetting chapter of Hadley Wickham’s Advanced R

How to use this article

This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a case4base after all!
If you have no time for reading, you can click here to get just the code with commentary

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

gdi <- read.csv(
  stringsAsFactors = FALSE
, url("https://jozef.io/post/data/ESA2010_GDI.csv")
              )
head(gdi[, 1:6, drop = FALSE])

##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0

The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.

Selecting (subsetting) relevant data from a `data.frame`

In this paragraph, we will try to show how to subset with as little hassle as possible while preserving the maximum safety in your operations. We shall go into more detail later in the article. The standard approach to subsetting data.frames can be summarised:

dataframe_name[row_subset, col_subset, drop = FALSE]

Where:

dataframe_name is the name of the data.frame we are subsetting
row_subset is a vector specifying the subset of rows
col_subset is a vector specifying the subset of columns
drop = FALSE is to make sure the result does not get simplified when not indented. This should always be used, unless we specifically want to simplify the result (e.g. to a vector for indexing)

Constructing meaningful subsets simply and safely

In practice, we of course will most of the time not select rows and/or columns with positions known apriori, but based on more variable conditions. For this purpose, the advised way would be is to construct logical vectors:

Let us now subset the rows of our data to get the data for countries that have known (not NA) value in the year 2016 and this value is less than 1 million:

rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000
gdi[rowidx, c(1, 23), drop = FALSE]

##        country    Y.2016
## 3      Belgium 243825.50
## 4     Bulgaria  60237.00
## 8      Estonia  12548.30
## 9      Ireland  97318.90
## 11       Spain 698701.00
## 13     Croatia      0.00
## 16      Latvia  15737.79
## 17   Lithuania  24743.49
## 18  Luxembourg  20155.80
## 21 Netherlands 357383.00
## 22     Austria 214980.60
## 24    Portugal 128789.39
## 26    Slovenia  24756.63
## 27    Slovakia  48882.91
## 28     Finland 126590.00
## 33 Switzerland 458641.00

Note that when creating the rowidx we omitted the drop = FALSE despite the aforementioned best practice. This is because in this particular case we consciously welcome the result being simplified to a vector, as its use is only as an index for subsetting.

More ways to provide subset indices

Subsetting can be done in a few ways. We will now use them to show a subset the first two and the 27th row and the first, 22nd and 23rd column, giving us the GDI for EU28, Euro Area 19 and Slovakia in the years 2015 and 2016:

Logical vectors TRUE for rows/columns to subset, FALSE for those to omit

st1 <- gdi[c(TRUE, TRUE, rep(FALSE, 24), TRUE, rep(FALSE, 8))
         , c(TRUE, rep(FALSE, 20), rep(TRUE, 2))
         , drop = FALSE
         ]

Numeric vectors of row/column numbers to subset

st2 <- gdi[c(1:2, 27) 
         , c(1, 22:23)
         , drop = FALSE
         ]

Negative numeric vectors of row/column numbers to omit

st3 <- gdi[c(-3:-26, -28:-35)
         , c(-2:-21)
         , drop = FALSE
         ]

Character vectors of row/column names to subset

st4 <- gdi[c("1", "2", "27") # we do not have very meaningful rownames
         , c("country", "Y.2015", "Y.2016")
         , drop = FALSE
         ]
st4

##         country     Y.2015     Y.2016
## 1         EU 28 9439578.39 9454683.60
## 2  Euro area 19 6598231.27 6736686.43
## 27     Slovakia   47464.71   48882.91

All of the above give identical results

identical(st1, st2) && identical(st2, st3) && identical(st3, st4)

## [1] TRUE

Tips

The above methods are also working and safe for matrices

Negative and positive numeric vectors cannot be combined

Alternatives to base R

dplyr::select and dplyr::filter
Using data.table

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

Exercises

What is the difference between gdi[3, 3] and gdi[3, 3, drop = FALSE] ?
What is the difference between gdi[-3, 3] and gdi[3, -3] ? What about gdi[-3, 3, drop = FALSE] ?
Why cannot we omit the first part of the & in rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000. What would happen if we just did rowidx <- gdi[, "Y.2016"] < 1000000 ?
Bonus question 1: Why is identical(gdi[, "Y.2016", drop = FALSE], gdi["Y.2016"])
Bonus question 2: Why is identical(gdi[, "Y.2016"], gdi[["Y.2016"]])

References

Advanced R’s chapter on subsetting
and on data types
original eurostat data source

Exercise answers

At the bottom of the code for the article

R:case4base - reshape data with base R

Sat, 07 Apr 2018 00:00:00 +0000

Introduction

This is the first post in the R:case4base series. The aim of the series is to elaborate on very useful features of base R that are lesser known and many times substituted with custom functionality of external packages.

The simplest, yet probably one of the most common use cases would be to change the data from what is called “wide” shape to “long” shape. Base R offers a very good function for this very purpose. Meet stats::reshape.

How to use this article

This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a case4base after all!
If you have no time for reading, you can click here to get just the code with commentary

Basic wide to long reshape

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

gdi <- read.csv(
  stringsAsFactors = FALSE
, url("https://jozef.io/post/data/ESA2010_GDI.csv")
              )
head(gdi[, 1:7])

##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0
##      Y.2000
## 1 6425313.4
## 2 4598956.1
## 3  161753.6
## 4   17676.4
## 5 1359309.0
## 6  639955.0

The goal of the article is therefore not really in presenting these conrete results, but to focus on the technical aspects and usefulness of the presented methods.

To reshape our data.frame from wide to long, all we have to do is:

gdi_long <- reshape(data = gdi         # data.frame in wide format to be reshaped
                  , direction = "long" # we are going from wide to long
                  , varying = 2:23     # columns that will be stacked into 1
                  )

head(gdi_long)

##               country time        Y id
## 1.1995          EU 28 1995       NA  1
## 2.1995   Euro area 19 1995       NA  2
## 3.1995        Belgium 1995 140734.1  3
## 4.1995       Bulgaria 1995   1036.0  4
## 5.1995 Czech Republic 1995 894042.0  5
## 6.1995        Denmark 1995 566363.0  6

Before we get into clean-up such that the output data.frame is nice and tidy, let us first take look at the arguments of the function that we used already

data - almost obviously, this is the data.frame we want to reshape
varying - names or indices of columns which we want to stack on each other into a single column

Tip

We can see that R automatically recognizes the Y and the years that get translated into the time column. This is because the column names are in a format that reshape can guess automatically: [string].[integer], in our case "Y.1996", "Y.1997", etc. It has a lot of benefits to keep this naming convention for your column names before reshaping. If your names have a different character between the [string] and the [integer] (for example "something_1996", "something_1997"), you can specify this character with the sep argument (e.g. sep = "_").

Now looking back at the reshaped gdi_long, we see that the reshape worked, however we have 4 improvements that can be done providing the function with more arguments:

the id column, which is not particularly useful this way
the Y column, which does have the correct data, however we would perhaps like to call it something a bit more descriptive
the time column, which could be named differently
we may want to update the values in the time column to something custom

gdi_long_full <- reshape(data = gdi         # data.frame in wide format to be reshaped
                       , direction = "long" # still going from wide to long
                       , varying = 2:23     # columns that will be stacked into 1
                       , idvar = "country"  # what identifies the rows?
                       , v.names = "GDI"    # how will the column with values be called
                       , timevar = "year"   # how will the time column be called
                       , times = 1995:2016  # what are the values for the timevar column
                       )
head(gdi_long_full)

##                            country year      GDI
## EU 28.1995                   EU 28 1995       NA
## Euro area 19.1995     Euro area 19 1995       NA
## Belgium.1995               Belgium 1995 140734.1
## Bulgaria.1995             Bulgaria 1995   1036.0
## Czech Republic.1995 Czech Republic 1995 894042.0
## Denmark.1995               Denmark 1995 566363.0

We easily see the solution to our 4 improvements:

specify idvar = "country" argument, as this column identifies the subjects in the rows
specify v.names = "GDI" argument, as this will rename the column with values (our values are gross disposable income)
specify timevar = "year" argument, as our time is actually years (the data is measure on a yearly basis)
specify times = 1995:2016 argument, this is shown just for completion, we could for example do times = -21:0 if we want the years to be measured based on 2016 instead of actual years

Basic long to wide reshape

Now that have the wide to long reshape done, the reshape from long to wide format is a formality. It works exactly the same way, we just switch the arguments around a bit:

gdi_wide <- reshape(gdi_long_full      # data.frame in long format to be reshaped  
                  , direction = "wide" # going from long to wide this time
                  , idvar = "country"  # identifying the subject in rows
                  , timevar = "year"   # column with values that will change to columns
                  , v.names = "GDI"    # column with the values
                  )
head(gdi_wide[, 1:7, drop = FALSE])

##                            country GDI.1995  GDI.1996  GDI.1997  GDI.1998
## EU 28.1995                   EU 28       NA        NA        NA        NA
## Euro area 19.1995     Euro area 19       NA        NA        NA        NA
## Belgium.1995               Belgium 140734.1  141599.4  145023.2  149705.2
## Bulgaria.1995             Bulgaria   1036.0    1468.1   12367.4   14921.1
## Czech Republic.1995 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0
## Denmark.1995               Denmark 566363.0  578102.0  591416.0  621236.0
##                      GDI.1999  GDI.2000
## EU 28.1995          5982392.8 6425313.4
## Euro area 19.1995   4393727.3 4598956.1
## Belgium.1995         153804.0  161753.6
## Bulgaria.1995         16052.8   17676.4
## Czech Republic.1995 1280040.0 1359309.0
## Denmark.1995         614893.0  639955.0

Advanced reshape

Let us now examine a bit more advanced reshape with some more data. First, we will look at the generic setup. We now have data not just for the GDI, but for 3 measurements in the columns:

ConspC - in columns X1995ConspC .. X2016ConspC
AGDIpC - in columns X1995AGDIpC .. X2016AGDIpC
GrossSaving - in columns X1995GrossSaving .. X2016GrossSaving

more_notpretty <- read.csv(
  stringsAsFactors = FALSE
, file = "https://jozef.io/post/data/ESA2010_not_pretty.csv"
)
head(more_notpretty[, 1:5, drop = FALSE])

##          country X1995ConspC X1996ConspC X1997ConspC X1998ConspC
## 1          EU 28          NA          NA          NA          NA
## 2   Euro area 19          NA          NA          NA          NA
## 3        Belgium    18168.83    18634.68    18867.78    19334.14
## 4       Bulgaria          NA     3777.06     3163.05     3326.24
## 5 Czech Republic   148721.29   159428.17   162742.83   161855.85
## 6        Denmark   176096.32   179576.05   182940.60   187630.27

Since these data do not have column names that R would be able to guess automatically, we will have to provide quite a few arguments:

varying as a list of vectors, each specifying the columns for one varying variable
v.names as a vector of names for those variables

more_notpretty_long <- reshape(data = more_notpretty
                             , direction = "long"
                             , varying = list(2:23
                                            , 24:45
                                            , 46:67
                                            )
                             , timevar = "year"
                             , times = 1995:2016
                             , idvar = "country"
                             , v.names = c("ConspC"
                                         , "AGDIpC"
                                         , "GrossSaving"
                                         )
                             )
head(more_notpretty_long)

##                            country year    ConspC    AGDIpC GrossSaving
## EU 28.1995                   EU 28 1995        NA        NA          NA
## Euro area 19.1995     Euro area 19 1995        NA        NA          NA
## Belgium.1995               Belgium 1995  18168.83  21577.92     27350.1
## Bulgaria.1995             Bulgaria 1995        NA        NA       448.4
## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46    116646.0
## Denmark.1995               Denmark 1995 176096.32 179741.30     42398.0

Now let us showcase how easy the reshape is if we adhere to R’s favourite column naming with the same data:

more_pretty <- read.csv(
  stringsAsFactors = FALSE
, file = "https://jozef.io/post/data/ESA2010_pretty.csv"
)
head(more_pretty[, 1:5, drop = FALSE])

##          country ConspC.1995 ConspC.1996 ConspC.1997 ConspC.1998
## 1          EU 28          NA          NA          NA          NA
## 2   Euro area 19          NA          NA          NA          NA
## 3        Belgium    18168.83    18634.68    18867.78    19334.14
## 4       Bulgaria          NA     3777.06     3163.05     3326.24
## 5 Czech Republic   148721.29   159428.17   162742.83   161855.85
## 6        Denmark   176096.32   179576.05   182940.60   187630.27

We tell R only the information it necessarily needs, same as with the simple reshape:

more_pretty_long <- reshape(data = more_pretty
                           , direction = "long"
                           , varying = 2:67
                           , idvar = "country"
                           )
head(more_pretty_long)

##                            country time    ConspC    AGDIpC GrossSaving
## EU 28.1995                   EU 28 1995        NA        NA          NA
## Euro area 19.1995     Euro area 19 1995        NA        NA          NA
## Belgium.1995               Belgium 1995  18168.83  21577.92     27350.1
## Bulgaria.1995             Bulgaria 1995        NA        NA       448.4
## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46    116646.0
## Denmark.1995               Denmark 1995 176096.32 179741.30     42398.0

That was really easy and we got the desired result!

Now as the very last example, we may want to get the data into an even longer form, if we also

consider the actual variables we are measuring as varying
their names will turn into times
with measurement being the name for timevar

more_longer <- reshape(data = more_pretty_long
                    , direction = "long"
                    , varying = 3:5
                    , timevar = "measurement"
                    , times = names(more_pretty_long[, 3:5])
                    , v.names = "Value"
                    )
head(more_longer)

##                 country time measurement     Value id
## 1.ConspC          EU 28 1995      ConspC        NA  1
## 2.ConspC   Euro area 19 1995      ConspC        NA  2
## 3.ConspC        Belgium 1995      ConspC  18168.83  3
## 4.ConspC       Bulgaria 1995      ConspC        NA  4
## 5.ConspC Czech Republic 1995      ConspC 148721.29  5
## 6.ConspC        Denmark 1995      ConspC 176096.32  6

Alternatives to base R

There are many alternatives to the base functionality, each with their own pros and cons, here is a selection of three in no particular order:

melt and cast from the reshape2 package
gather and spread from the tidyR package
melt and dcast from the data.table package

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

Exercises

At the beginning of the article, our data had countries in rows and yearly data as columns. Reshape the data such that the countries will be in columns and the years are in rows.
reshape(reshape(gdi_long_full)) gives us a data.frame equivalent to gdi_long_full, even though we call the function twice with no extra arguments, just the data. What kind of sorcery is this? Why don’t we need to provide at least the direction, or the varying arguments?

References

Exercise answers

At the bottom of the code for the article

R:case4base - about the series

Sat, 24 Mar 2018 00:00:00 +0000

What is does this series offer?

This is the introduction to the R:case4base series. The aim of the series is to elaborate on very useful features of base R that are lesser known and many times substituted with custom functionality of external packages. The motivation behind the series is to provide useful and easy to read information on the usage of these functionalities from the basic to the advanced topics related to them.

Usually one article in the series will

contain content on 1 such functionality
follow a learning system starting from the basics and continuing with more advanced topics, with examples and simple explanations, at the cost of rigorousness
come with accompanying peace of fully portable R code that can be downloaded and played with no additional setup or packages needed
come with a few exercises for those wanting to examine the code a bit more
provide a list of references for further reading
provide a list of alternatives to the base functionality in no particular order

What is considered base R

The list of packages considered as base can be retrieved with some basic info calling the following:

installed.packages(priority = "base")[, c(5, 6)]

##           Depends Imports                     
## base      NA      NA                          
## compiler  NA      NA                          
## datasets  NA      NA                          
## graphics  NA      "grDevices"                 
## grDevices NA      NA                          
## grid      NA      "grDevices, utils"          
## methods   NA      "utils, stats"              
## parallel  NA      "tools, compiler"           
## splines   NA      "graphics, stats"           
## stats     NA      "utils, grDevices, graphics"
## stats4    NA      "graphics, methods, stats"  
## tcltk     NA      "utils"                     
## tools     NA      NA                          
## utils     NA      NA

Jozef's Rblog

Optimizing partitioning for Apache Spark database loads via JDBC for performance

Introduction

Contents

Getting test data into a MySQL database

Partitioning columns with Spark’s JDBC reading capabilities

Partitioning options

Partitioning examples using the interactive Spark shell

Comparing the performance of different partitioning options

Understanding the partitioning implementation

Setting up partitioning for JDBC via Spark from R with sparklyr

TL;DR, just tell me roughly how to partition

Running the code in this article

References

A guide to retrieval and processing of data from relational database systems using Apache Spark and JDBC with R and sparklyr

Introduction

Contents

Getting test data into a MySQL database

Using JDBC to connect to database systems from Spark

Getting a JDBC driver and using it with Spark and sparklyr

Downloading and extracting the connector jar

Connecting using the jar

Retrieving data from a database with sparklyr

Setting the options argument of spark_read_jdbc()

Loading a specific database table

Executing a query instead

Other RDBM Systems

Oracle

Oracle JDBC Driver

Using fully qualified host identification

Using tnsnames.ora

Parsing special data types

MS SQL Server

MS SQL Server JDBC Driver

MS SQL Server connection options

Even more RDBM Systems

Some notes on performance

The memory argument

Partitioning

Running the code in this article

References

A review of my experience with the Big Data Analysis with Scala and Spark course

Introduction

Contents

Disclaimer, what to expect

Course organization, pre-course preparatory work

Organization

Pre-course setup

Week 1

Content

Assignment

Week 2

Content

Assignment

Week 3

Content

Assignment

Week 4

Content

Assignment

TL;DR - Just give me the overview

Exploring and plotting positional ice hockey data on goals, penalties and more from R with the {nhlapi} package

Introduction

Contents

Installing the {nhlapi} package

Retrieving basic game information

Getting detailed events data for a game

More involved data retrieval - many games in parallel

Processing and plotting positional data

Some examples of rendered images

References

A review of my experience with the Functional Programming Principles in Scala course

Introduction

Contents

Disclaimer, what to expect

Course organization, pre-course preparatory work

Organization

Pre-course setup

Week 1

Content

Setting the `options` argument of `spark_read_jdbc()`

The `memory` argument

Setup considerations, issues, and tweaks: creating the `languageserversetup` package

Solving it with 2 R commands - the `languageserversetup` package

Installing languageserversetup and using `languageserver_install()`

Initializing the functionality with `languageserver_add_to_rprofile()`