<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Jozef&#39;s Rblog</title>
    <link>https://jozef.io/</link>
    <description>Recent content on Jozef&#39;s Rblog</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>Jozef</copyright>
    <lastBuildDate>Sat, 26 Dec 2020 12:00:00 +0000</lastBuildDate>
    
        <atom:link href="https://jozef.io/index.xml" rel="self" type="application/rss+xml" />
    
    
    
    <item>
      <title>Optimizing partitioning for Apache Spark database loads via JDBC for performance</title>
      <link>https://jozef.io/r926-spark-jdbc-partitioning/</link>
      <pubDate>Sat, 26 Dec 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r926-spark-jdbc-partitioning/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. A very common task in working with Spark apart from using HDFS-based data storage is also interfacing with traditional RDMBS systems such as Oracle, MS SQL Server, and others. There is a lot of performance that can be gained by efficiently partitioning data for these types of data loads.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will explore the partitioning options that are available for Spark’s JDBC reading capabilities and investigate how partitioning is implemented in Spark itself to choose the options such that we get the best performing results. We will also show how to use those options from R using the sparklyr package.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#getting-test-data-into-a-mysql-database&#34;&gt;Getting test data into a MySQL database&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#partitioning-columns-with-sparks-jdbc-reading-capabilities&#34;&gt;Partitioning columns with Spark’s JDBC reading capabilities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#partitioning-options&#34;&gt;Partitioning options&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#partitioning-examples-using-the-interactive-spark-shell&#34;&gt;Partitioning examples using the interactive Spark shell&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#comparing-the-performance-of-different-partitioning-options&#34;&gt;Comparing the performance of different partitioning options&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#understanding-the-partitioning-implementation&#34;&gt;Understanding the partitioning implementation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#setting-up-partitioning-for-jdbc-via-spark-from-r-with-sparklyr&#34;&gt;Setting up partitioning for JDBC via Spark from R with sparklyr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr-just-tell-me-roughly-how-to-partition&#34;&gt;TL;DR, just tell me roughly how to partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#running-the-code-in-this-article&#34;&gt;Running the code in this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-test-data-into-a-mysql-database&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting test data into a MySQL database&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are interested only in partitioning content, feel free to &lt;a href=&#34;#partitioning-columns-with-sparks-jdbc-reading-capabilities&#34;&gt;skip this paragraph&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For a fully reproducible example, we will use a local MySQL server instance as due to its open-source nature it is very accessible. Let’s populate a database table with some randomly generated data that will be useful to show different partitioning strategies and their impact on performance. We will write this data frame into the MySQL database using R’s &lt;code&gt;{DBI}&lt;/code&gt; package and call the newly created table &lt;code&gt;test_table&lt;/code&gt;. For the timings below we used a table with 10 million records.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Set this to 1e7L for timings similar to those on pictures
rows &amp;lt;- 1e5L
groups &amp;lt;- 8L
set.seed(1)

mkNum &amp;lt;- function(x) vapply(x, function(s) sum(utf8ToInt(s)), numeric(1))
mkStr &amp;lt;- function() paste(sample(labels(eurodist), 3L), collapse = &amp;quot;&amp;quot;)

unif &amp;lt;- floor(runif(rows, min = 0L, max = groups))
state_name &amp;lt;- sample(state.name, rows, replace = TRUE)
state_str  &amp;lt;- replicate(rows, mkStr())

test_df &amp;lt;- data.frame(
  id = seq_len(rows),
  grp_unif = unif,
  grp_skwd = pmin(floor(rexp(rows)), groups - 1L),
  grp_unif_range = (unif + 1L) ^ (unif + 1L),
  state_name = state_name,
  state_value = mkNum(state_name) * (1 + runif(rows)),
  state_srt_1 = state_str,
  state_srt_2 = sample(state_str),
  state_srt_3 = sample(state_str),
  state_srt_4 = sample(state_str),
  state_srt_5 = sample(state_str),
  stringsAsFactors = FALSE
)

con &amp;lt;- DBI::dbConnect(drv = RMySQL::MySQL(), db = &amp;quot;testdb&amp;quot;, password = &amp;quot;pass&amp;quot;)
DBI::dbWriteTable(con, &amp;quot;test_table&amp;quot;, test_df, overwrite = TRUE)
DBI::dbDisconnect(con)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;partitioning-columns-with-sparks-jdbc-reading-capabilities&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Partitioning columns with Spark’s JDBC reading capabilities&lt;/h1&gt;
&lt;p&gt;For this paragraph, we assume that the reader has some knowledge of Spark’s JDBC reading capabilities. We discussed the topic in more detail in the &lt;a href=&#34;https://jozef.io/r925-spark-jdbc-loading-data&#34;&gt;related previous article&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The partitioning options are provided to the &lt;a href=&#34;https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html&#34;&gt;DataFrameReader&lt;/a&gt; similarly to other options. We will focus on the key 4 options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;partitionColumn&lt;/code&gt; - The name of the column used for partitioning. It must be a numeric, date, or timestamp column from the table in question.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;numPartitions&lt;/code&gt; - The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lowerBound&lt;/code&gt; and &lt;code&gt;upperBound&lt;/code&gt;- bounds used to decide the partition stride. We will talk more about the &lt;code&gt;stride&lt;/code&gt; a bit later in the article&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A few important notes need to be made:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If no partitioning options are specified, Spark will use a single executor and create a single non-empty partition. Reading the data will be neither distributed nor parallelized. This can cause significant performance loss in cases where parallelized reading is preferable.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;lowerBound&lt;/code&gt; and &lt;code&gt;upperBound&lt;/code&gt; options are only used to define &lt;em&gt;how&lt;/em&gt; the data is partitioned, not &lt;em&gt;which&lt;/em&gt; data is read in. There is a common misconception that using the wrong bounds will filter the data which is &lt;em&gt;not&lt;/em&gt; the case.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;partitioning-options&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Partitioning options&lt;/h1&gt;
&lt;p&gt;Now with that in mind and the testing table prepared, let us investigate 2 columns that are relevant for partitioning and how the values are distributed. We will then see how using each of the columns for partitioning can impact the performance of the reading process&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The green histogram shows the distribution of values in the &lt;code&gt;grp_unif&lt;/code&gt; column, in which the values are evenly distributed between the values 0 to 7&lt;/li&gt;
&lt;li&gt;The blue histogram shows the distribution of values in the &lt;code&gt;grp_skwd&lt;/code&gt; column, in which the values are heavily skewed towards the smaller values, 0 being by far the most prevalent and 7 very rare&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r926-01-record-counts.png&#34; alt=&#34;Distribution of record counts for the 2 partitioning columns&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Distribution of record counts for the 2 partitioning columns&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;partitioning-examples-using-the-interactive-spark-shell&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Partitioning examples using the interactive Spark shell&lt;/h1&gt;
&lt;p&gt;To show the partitioning and make example timings, we will use the interactive local Spark shell. We can run the Spark shell and provide it the needed jars using the &lt;code&gt;--jars&lt;/code&gt; option and allocate the memory needed for our driver:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;/usr/local/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-shell \
  --jars /home/$USER/jars/mysql-connector-java-8.0.21/mysql-connector-java-8.0.21.jar \
  --driver-memory 7g&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now within the Spark shell, we can execute Scala expressions for three scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;no partitioning options provided (baseline)&lt;/li&gt;
&lt;li&gt;partitioning using the uniformly distributed column&lt;/li&gt;
&lt;li&gt;partitioning using the skewed column&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After running these, we can compare the speed and see the benefit we gained by the different partitioning approaches versus the baseline.&lt;/p&gt;
&lt;pre class=&#34;scala&#34;&gt;&lt;code&gt;// First, setup the data frame without partitioning
val reader_no_partitioning = spark.read.
  format(&amp;quot;jdbc&amp;quot;).
  option(&amp;quot;url&amp;quot;, &amp;quot;jdbc:mysql://localhost:3306/testdb&amp;quot;).
  option(&amp;quot;user&amp;quot;, &amp;quot;rstudio&amp;quot;).
  option(&amp;quot;password&amp;quot;, &amp;quot;pass&amp;quot;).
  option(&amp;quot;driver&amp;quot;, &amp;quot;com.mysql.cj.jdbc.Driver&amp;quot;).
  option(&amp;quot;dbtable&amp;quot;, &amp;quot;test_table&amp;quot;)

val df_no_partitioning = reader_no_partitioning.load()
df_no_partitioning.cache().count()
df_no_partitioning.unpersist()
  
// Now use the skewed column to partition
val reader_partitioning_skewed = reader_no_partitioning.
  option(&amp;quot;partitionColumn&amp;quot;, &amp;quot;grp_skwd&amp;quot;).
  option(&amp;quot;numPartitions&amp;quot;, 8).
  option(&amp;quot;lowerBound&amp;quot;, 0).
  option(&amp;quot;upperBound&amp;quot;, 4)
val df_partitioning_skewed = reader_partitioning_skewed.load()
df_partitioning_skewed.cache().count()
df_partitioning_skewed.unpersist()

// Now use the uniform column to partition
val reader_partitioning_unif = reader_no_partitioning.
  option(&amp;quot;partitionColumn&amp;quot;, &amp;quot;grp_unif&amp;quot;).
  option(&amp;quot;numPartitions&amp;quot;, 8).
  option(&amp;quot;lowerBound&amp;quot;, 0).
  option(&amp;quot;upperBound&amp;quot;, 8)
val df_partitioning_unif = reader_partitioning_unif.load()
df_partitioning_unif.cache().count()
df_partitioning_unif.unpersist()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;comparing-the-performance-of-different-partitioning-options&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Comparing the performance of different partitioning options&lt;/h1&gt;
&lt;p&gt;Now let us look at how fast each of the read operations was. This is of course by no means a relevant benchmark for real-life data loads but can provide some insight into optimizing the partitioning. In our experience, the benefits of proper partitioning can be extremely relevant, especially with real-life use cases where the databases sit on external servers and support many concurrent connections.&lt;/p&gt;
&lt;p&gt;First, let’s see the total time for the 3 options&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;JobId 0 - no partitioning - total time of 2.9 minutes&lt;/li&gt;
&lt;li&gt;JobId 1 - partitioning using the &lt;code&gt;grp_skwd&lt;/code&gt; column and 8 partitions - 2.1 minutes&lt;/li&gt;
&lt;li&gt;JobId 2 - partitioning using the &lt;code&gt;grp_unif&lt;/code&gt; column and 8 partitions - 59 seconds&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r926-02-spark-partitioning-timing.png&#34; alt=&#34;Timing of reading using different partitioning options&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Timing of reading using different partitioning options&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;To understand better why the partitioning using the &lt;code&gt;grp_unif&lt;/code&gt; column was so much faster, let us look at the performance per partition, with the partitioning using &lt;code&gt;grp_skewed&lt;/code&gt; to the left the &lt;code&gt;grp_unif&lt;/code&gt; to the right:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r926-03-spark-partitioning-stages.png&#34; alt=&#34;Investigating timing for each partition&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Investigating timing for each partition&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We can see that the Durations for each of the partitions for &lt;code&gt;grp_unif&lt;/code&gt; is almost identical, whereas for &lt;code&gt;grp_skewed&lt;/code&gt; the longest time is much larger than the biggest time. This is heavily correlated with the sizes of each of the partitions, which points us toward our conclusion when looking at the actual implementation.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;understanding-the-partitioning-implementation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Understanding the partitioning implementation&lt;/h1&gt;
&lt;p&gt;The implementation of the partitioning within Apache Spark can be found &lt;a href=&#34;https://github.com/apache/spark/blob/7bbcbb84c266b6ff418cd2c3361aa7350299d0ae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L129&#34;&gt;in this piece of source code&lt;/a&gt;. The most notable single row that is key to understanding the partitioning process and the performance implications is the following:&lt;/p&gt;
&lt;pre class=&#34;scala&#34;&gt;&lt;code&gt;val stride: Long = upperBound / numPartitions - lowerBound / numPartitions&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In combination with the &lt;code&gt;while&lt;/code&gt; loop:&lt;/p&gt;
&lt;pre class=&#34;scala&#34;&gt;&lt;code&gt;while (i &amp;lt; numPartitions) {
  val lBoundValue = boundValueToString(currentValue)
  val lBound = if (i != 0) s&amp;quot;$column &amp;gt;= $lBoundValue&amp;quot; else null
  currentValue += stride
  val uBoundValue = boundValueToString(currentValue)
  val uBound = if (i != numPartitions - 1) s&amp;quot;$column &amp;lt; $uBoundValue&amp;quot; else null
  val whereClause =
    if (uBound == null) {
      lBound
    } else if (lBound == null) {
      s&amp;quot;$uBound or $column is null&amp;quot;
    } else {
      s&amp;quot;$lBound AND $uBound&amp;quot;
    }
  ans += JDBCPartition(whereClause, i)
  i = i + 1
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can see that the data to be read is partitioned by splitting the values in the &lt;code&gt;partitionColumn&lt;/code&gt; into &lt;code&gt;numPartitions&lt;/code&gt; groups using the &lt;code&gt;stride&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Based on this information, we can optimize the column that we choose for the partitioning as well as the values for &lt;code&gt;upperBound&lt;/code&gt; and &lt;code&gt;lowerBound&lt;/code&gt; such that the intervals for the values of &lt;code&gt;partitionColumn&lt;/code&gt; will end up with roughly the same size.&lt;/p&gt;
&lt;p&gt;In our example, the
- &lt;code&gt;grp_unif&lt;/code&gt; column was purposefully generated such that this is the case with the most basic partitioning options, each partition having around 1.25 million records
- &lt;code&gt;grp_skwd&lt;/code&gt; column had partitions with very different sizes, the biggest one with more than 6.3 million, whereas the smallest one with only around 9 thousand records&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;setting-up-partitioning-for-jdbc-via-spark-from-r-with-sparklyr&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Setting up partitioning for JDBC via Spark from R with sparklyr&lt;/h1&gt;
&lt;p&gt;As we have shown in detail in the &lt;a href=&#34;https://jozef.io/r925-spark-jdbc-loading-data/&#34;&gt;previous article&lt;/a&gt;, we can use sparklyr’s function &lt;code&gt;spark_read_jdbc()&lt;/code&gt; to perform the data loads using JDBC within Spark from R. The key to using partitioning is to correctly adjust the &lt;code&gt;options&lt;/code&gt; argument with elements named:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;numPartitions&lt;/li&gt;
&lt;li&gt;partitionColumn&lt;/li&gt;
&lt;li&gt;lowerBound&lt;/li&gt;
&lt;li&gt;upperBound&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are mapped one-to-one to the options as described above. Once we have done that, we pass the created options to the call to &lt;code&gt;spark_read_jdbc()&lt;/code&gt; along with the other connection options in the &lt;code&gt;options&lt;/code&gt; argument.&lt;/p&gt;
&lt;p&gt;An oversimplified example of a full load could look like so:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Setup jars and connect to Spark ----
jars &amp;lt;- dir(&amp;quot;~/jars&amp;quot;, pattern = &amp;quot;jar$&amp;quot;, recursive = TRUE, full.names = TRUE)
config &amp;lt;- sparklyr::spark_config()
config$sparklyr.jars.default &amp;lt;- jars
config[[&amp;quot;sparklyr.shell.driver-memory&amp;quot;]] &amp;lt;- &amp;quot;6G&amp;quot;
sc &amp;lt;- sparklyr::spark_connect(&amp;quot;local&amp;quot;, config = config)

# Create basic JDBC connection options ----
jdbcOpts &amp;lt;- list(
  user = &amp;quot;rstudio&amp;quot;,
  password = &amp;quot;pass&amp;quot;,
  server = &amp;quot;localhost&amp;quot;,
  driver = &amp;quot;com.mysql.cj.jdbc.Driver&amp;quot;,
  fetchsize = &amp;quot;100000&amp;quot;,
  dbtable = &amp;quot;test_table&amp;quot;,
  url = &amp;quot;jdbc:mysql://localhost:3306/testdb&amp;quot;
)

# Create the partitioning options ----
partitioningOpts &amp;lt;- list(
  numPartitions = 8L,
  partitionColumn = &amp;quot;grp_unif&amp;quot;,
  lowerBound = 0L,
  upperBound = 8L
)

# Use the options combined to read a table ----
test_tbl &amp;lt;- sparklyr::spark_read_jdbc(
  sc,
  &amp;quot;test_table&amp;quot;,
  options = c(jdbcOpts, partitioningOpts),
  smemory = FALSE
)

# Print a few records ----
test_tbl

# Disconnect ----
sparklyr::spark_disconnect(sc)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr-just-tell-me-roughly-how-to-partition&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR, just tell me roughly how to partition&lt;/h1&gt;
&lt;p&gt;At the risk of oversimplifying and omitting some corner cases, to partition reading from Spark via JDBC, we can provide our &lt;code&gt;DataFrameReader&lt;/code&gt; with the options:&lt;/p&gt;
&lt;pre class=&#34;scala&#34;&gt;&lt;code&gt;option(&amp;quot;partitionColumn&amp;quot;, column_to_partition)
option(&amp;quot;numPartitions&amp;quot;, n)
option(&amp;quot;lowerBound&amp;quot;, x)
option(&amp;quot;upperBound&amp;quot;, y)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Such that when the &lt;code&gt;stride&lt;/code&gt; is calculated as &lt;code&gt;stride = y/n - x/n&lt;/code&gt; and the partitions are created by splitting the values of &lt;code&gt;partitionColumn&lt;/code&gt; roughly like so:&lt;/p&gt;
&lt;pre class=&#34;scala&#34;&gt;&lt;code&gt;Partition 1: Rows where column_to_partition ∈ &amp;lt;x, x+stride)
Partition 2: Rows where column_to_partition ∈ &amp;lt;x+stride, x+2*stride)
...
Partition n: Rows where column_to_partition ∈ &amp;lt;y-stride, y)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We try to set up the values of &lt;code&gt;column_to_partition, n, x, y&lt;/code&gt; such that each of the created partitions is of roughly the same size.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;running-the-code-in-this-article&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Running the code in this article&lt;/h1&gt;
&lt;p&gt;If you have Docker available, running the following should yield a Docker container with RStudio Server exposed on port 8787, so you can open your web browser at &lt;code&gt;http://localhost:8787&lt;/code&gt; to access it and experiment with the code. The user name is &lt;code&gt;rstudio&lt;/code&gt; and the password is as you choose below:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run -d -p 8787:8787 -e PASSWORD=pass jozefhajnala/jozefio&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/r925-spark-jdbc-loading-data/&#34;&gt;A guide to retrieval and processing of data&lt;/a&gt; from relational database systems using Apache Spark and JDBC with R and sparklyr&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html&#34;&gt;JDBC To Other Databases&lt;/a&gt; in Spark documentation&lt;/li&gt;
&lt;li&gt;Discussion on the JDBC partitioning topic at &lt;a href=&#34;https://stackoverflow.com/questions/43150694/partitioning-in-spark-while-reading-from-rdbms-via-jdbc&#34;&gt;StackOverflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://medium.com/@radek.strnad/tips-for-using-jdbc-in-apache-spark-sql-396ea7b2e3d3&#34;&gt;Tips for using JDBC&lt;/a&gt; in Apache Spark SQL by Radek Strnad&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html&#34;&gt;Class DataFrameReader&lt;/a&gt; as Spark’s Scala API Doc&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala&#34;&gt;DataFrameReader implementation&lt;/a&gt; at GitHub&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>A guide to retrieval and processing of data from relational database systems using Apache Spark and JDBC with R and sparklyr</title>
      <link>https://jozef.io/r925-spark-jdbc-loading-data/</link>
      <pubDate>Sat, 15 Aug 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r925-spark-jdbc-loading-data/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;{sparklyr}&lt;/code&gt; package lets us connect and use Apache Spark for high-performance, highly parallelized, and distributed computations. We can also use Spark’s capabilities to improve and streamline our data processing pipelines, as Spark supports reading and writing from many popular sources such as Parquet, Orc, etc. and most database systems via JDBC drivers.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will explore using R to perform data loads to Spark and optionally R from relational database management systems such as MySQL, Oracle, and MS SQL Server and show how such processes can be simplified. We will also provide reproducible code via a Docker image, such that interested readers can experiment with it easily.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#getting-test-data-into-a-mysql-database&#34;&gt;Getting test data into a MySQL database&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-jdbc-to-connect-to-database-systems-from-spark&#34;&gt;Using JDBC to connect to database systems from Spark&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#getting-a-jdbc-driver-and-using-it-with-spark-and-sparklyr&#34;&gt;Getting a JDBC driver and using it with Spark and sparklyr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#retrieving-data-from-a-database-with-sparklyr&#34;&gt;Retrieving data from a database with sparklyr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#setting-the-options-argument-of-spark_read_jdbc&#34;&gt;Setting the options argument of spark_read_jdbc()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#loading-a-specific-database-table&#34;&gt;Loading a specific database table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#executing-a-query-instead&#34;&gt;Executing a query instead&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#other-rdbm-systems&#34;&gt;Other RDBM Systems&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#oracle&#34;&gt;Oracle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#ms-sql-server&#34;&gt;MS SQL Server&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#even-more-rdbm-systems&#34;&gt;Even more RDBM Systems&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#some-notes-on-performance&#34;&gt;Some notes on performance&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#the-memory-argument&#34;&gt;The memory argument&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#partitioning&#34;&gt;Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#running-the-code-in-this-article&#34;&gt;Running the code in this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-test-data-into-a-mysql-database&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting test data into a MySQL database&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are interested only in the Spark loading part, feel free to &lt;a href=&#34;#using-jdbc-to-connect-to-database-systems-from-spark&#34;&gt;skip this paragraph&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For a fully reproducible example, we will use a local MySQL server instance as due to its open-source nature it is very accessible. We will use the &lt;code&gt;{DBI}&lt;/code&gt; and &lt;code&gt;{RMySQL}&lt;/code&gt; packages to connect to the server directly from R and populate a database with data provided by the &lt;code&gt;{nycflights13}&lt;/code&gt; package that we will later use for our Spark loads.&lt;/p&gt;
&lt;p&gt;Let us write the &lt;code&gt;flights&lt;/code&gt; data frame into the MySQL database using &lt;code&gt;{DBI}&lt;/code&gt; and call the newly created table &lt;code&gt;test_table&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test_df &amp;lt;- nycflights13::flights

# Create a connection to database `testdb`
con &amp;lt;- DBI::dbConnect(
  drv = RMySQL::MySQL(),
  host = &amp;quot;localhost&amp;quot;,
  dbname = &amp;quot;testdb&amp;quot;,
  user = &amp;quot;rstudio&amp;quot;,
  password = &amp;quot;pass&amp;quot;
)

# Write our `test_df` into a table called `test_table`
DBI::dbWriteTable(con, &amp;quot;test_table&amp;quot;, test_df, overwrite = TRUE)

# Close the connection
DBI::dbDisconnect(con)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have our table available and we can focus on the main part of the article.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-jdbc-to-connect-to-database-systems-from-spark&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using JDBC to connect to database systems from Spark&lt;/h1&gt;
&lt;div id=&#34;getting-a-jdbc-driver-and-using-it-with-spark-and-sparklyr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting a JDBC driver and using it with Spark and sparklyr&lt;/h2&gt;
&lt;p&gt;Since Spark runs via a JVM, the natural way to establish connections to database systems is using Java Database Connectivity (JDBC). To do that, we will need a JDBC driver which will enable us to interact with the database system of our choice. For this example, we are using MySQL, but we provide details on other RDBMS later in the article.&lt;/p&gt;
&lt;div id=&#34;downloading-and-extracting-the-connector-jar&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Downloading and extracting the connector jar&lt;/h3&gt;
&lt;p&gt;With a bit of online search, we can download the driver and extract the contents of the zip file:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;mkdir -p $HOME/jars
wget -q -t 3 \
  -O $HOME/jars/mysql-connector.zip \
  https://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-8.0.21.zip 
unzip -q -o \
  -d $HOME/jars \
  $HOME/jars/mysql-connector.zip&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the file we are most interested in for our use case the &lt;code&gt;.jar&lt;/code&gt; file that contains classes necessary to establish the connection. Using R, we can locate the extracted jar file(s), for example using the &lt;code&gt;dir()&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;jars &amp;lt;- dir(&amp;quot;~/jars&amp;quot;, pattern = &amp;quot;jar$&amp;quot;, recursive = TRUE, full.names = TRUE)
basename(jars)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;mysql-connector-java-8.0.21.jar&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;connecting-using-the-jar&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Connecting using the jar&lt;/h3&gt;
&lt;p&gt;Next we need to tell &lt;code&gt;{sparklyr}&lt;/code&gt; to use that resource when establishing a Spark connection, for example by adding a &lt;code&gt;sparklyr.jars.default&lt;/code&gt; element with the paths to the necessary jar files to the &lt;code&gt;config&lt;/code&gt; list and finally establish the Spark connection using our &lt;code&gt;config&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;config &amp;lt;- list(sparklyr.jars.default = jars)
sc &amp;lt;- sparklyr::spark_connect(&amp;quot;local&amp;quot;, config = config)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;retrieving-data-from-a-database-with-sparklyr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Retrieving data from a database with sparklyr&lt;/h2&gt;
&lt;p&gt;With the Spark connection established, we can connect to our MySQL database from Spark and retrieve the data. &lt;code&gt;{sparklyr}&lt;/code&gt; provides a handy &lt;code&gt;spark_read_jdbc()&lt;/code&gt; function for this exact purpose. The API maps closely to the Scala API, but it is not very explicit in how to set up the connection. The key here is the &lt;code&gt;options&lt;/code&gt; argument to &lt;code&gt;spark_read_jdbc()&lt;/code&gt;, which will specify all the connection details we need.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;setting-the-options-argument-of-spark_read_jdbc&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Setting the &lt;code&gt;options&lt;/code&gt; argument of &lt;code&gt;spark_read_jdbc()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;First, let us create a &lt;code&gt;jdbcConnectionOpts&lt;/code&gt; list with the basic connection properties. These are the connection URL and the driver. Below we also explictly specify the &lt;code&gt;user&lt;/code&gt; and &lt;code&gt;password&lt;/code&gt;, but these can usually also be provided as part of the URL:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Connection options
jdbcConnectionOpts &amp;lt;- list(
  url = &amp;quot;jdbc:mysql://localhost:3306/testdb&amp;quot;,
  driver = &amp;quot;com.mysql.cj.jdbc.Driver&amp;quot;,
  user = &amp;quot;rstudio&amp;quot;, 
  password = &amp;quot;pass&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The last bit of information we need to provide is the identification of the data we want to extract once the connection is established. For this, we can use one of two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dbtable&lt;/code&gt; - in case we want to create a Spark DataFrame by extracting contents of a specific table&lt;/li&gt;
&lt;li&gt;&lt;code&gt;query&lt;/code&gt; - in case we want to create a Spark DataFrame by executing a SQL query&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;loading-a-specific-database-table&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Loading a specific database table&lt;/h2&gt;
&lt;p&gt;First let us go with the option to load a database table that we populated with the flights earlier and named &lt;code&gt;test_table&lt;/code&gt;, putting it all together and loading the data using &lt;code&gt;spark_read_jdbc()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Other options specific to the action
jdbcDataOpts &amp;lt;- list(dbtable = &amp;quot;test_table&amp;quot;)

# Use spark_read_jdbc() to load the data
test_tbl &amp;lt;- sparklyr::spark_read_jdbc(
  sc = sc,
  name = &amp;quot;test_table&amp;quot;,
  options = append(jdbcConnectionOpts, jdbcDataOpts),
  memory = FALSE
)

# Print some records
test_tbl&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;test_table&amp;gt; [?? x 20]
##    row_names  year month   day dep_time sched_dep_time dep_delay arr_time
##    &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;          &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1 1          2013     1     1      517            515         2      830
##  2 2          2013     1     1      533            529         4      850
##  3 3          2013     1     1      542            540         2      923
##  4 4          2013     1     1      544            545        -1     1004
##  5 5          2013     1     1      554            600        -6      812
##  6 6          2013     1     1      554            558        -4      740
##  7 7          2013     1     1      555            600        -5      913
##  8 8          2013     1     1      557            600        -3      709
##  9 9          2013     1     1      557            600        -3      838
## 10 10         2013     1     1      558            600        -2      753
## # … with more rows, and 12 more variables: sched_arr_time &amp;lt;dbl&amp;gt;,
## #   arr_delay &amp;lt;dbl&amp;gt;, carrier &amp;lt;chr&amp;gt;, flight &amp;lt;dbl&amp;gt;, tailnum &amp;lt;chr&amp;gt;,
## #   origin &amp;lt;chr&amp;gt;, dest &amp;lt;chr&amp;gt;, air_time &amp;lt;dbl&amp;gt;, distance &amp;lt;dbl&amp;gt;, hour &amp;lt;dbl&amp;gt;,
## #   minute &amp;lt;dbl&amp;gt;, time_hour &amp;lt;chr&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We provided the following arguments:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sc&lt;/code&gt; is the Spark connection that we established using the config that includes necessary jars&lt;/li&gt;
&lt;li&gt;&lt;code&gt;name&lt;/code&gt; is a character string with the name to be assigned to the newly generated table within Spark SQL, &lt;em&gt;not&lt;/em&gt; the name of the source table we want to read from our database&lt;/li&gt;
&lt;li&gt;&lt;code&gt;options&lt;/code&gt; is a list with both the connection options and the data-related options, so we use &lt;code&gt;append()&lt;/code&gt; to combine the &lt;code&gt;jdbcConnectionOpts&lt;/code&gt; and &lt;code&gt;jdbcDataOpts&lt;/code&gt; lists into one&lt;/li&gt;
&lt;li&gt;&lt;code&gt;memory&lt;/code&gt; is a logical that tells Spark whether we want to cache the table into memory. A bit more on that and some performance implications below&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;executing-a-query-instead&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Executing a query instead&lt;/h2&gt;
&lt;p&gt;We mentioned above that apart from just loading a table, we can also choose to execute a SQL query and use its result as the source for our Spark DtaFrame. Here is a simple example of that.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Use `query` instead of `dbtable`
jdbcDataOpts &amp;lt;- list(
  query = &amp;quot;SELECT * FROM test_table WHERE tailnum = &amp;#39;N14228&amp;#39;&amp;quot;
)

# Use spark_read_jdbc() to load the data
test_qry &amp;lt;- sparklyr::spark_read_jdbc(
  sc = sc,
  name = &amp;quot;test_table&amp;quot;,
  options = append(jdbcConnectionOpts, jdbcDataOpts),
  memory = FALSE
)

# Print some records
test_qry&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;test_table&amp;gt; [?? x 20]
##    row_names  year month   day dep_time sched_dep_time dep_delay arr_time
##    &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;          &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1 1          2013     1     1      517            515         2      830
##  2 6570       2013     1     8     1435           1440        -5     1717
##  3 7111       2013     1     9      717            700        17      812
##  4 7349       2013     1     9     1143           1144        -1     1425
##  5 10593      2013     1    13      835            824        11     1030
##  6 13775      2013     1    16     1829           1730        59     2117
##  7 18967      2013     1    22     1902           1808        54     2214
##  8 19417      2013     1    23     1050           1056        -6     1143
##  9 19648      2013     1    23     1533           1529         4     1641
## 10 21046      2013     1    25      724            720         4     1000
## # … with more rows, and 12 more variables: sched_arr_time &amp;lt;dbl&amp;gt;,
## #   arr_delay &amp;lt;dbl&amp;gt;, carrier &amp;lt;chr&amp;gt;, flight &amp;lt;dbl&amp;gt;, tailnum &amp;lt;chr&amp;gt;,
## #   origin &amp;lt;chr&amp;gt;, dest &amp;lt;chr&amp;gt;, air_time &amp;lt;dbl&amp;gt;, distance &amp;lt;dbl&amp;gt;, hour &amp;lt;dbl&amp;gt;,
## #   minute &amp;lt;dbl&amp;gt;, time_hour &amp;lt;chr&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that the only element that changed is the &lt;code&gt;jdbcDataOpts&lt;/code&gt; list, which now contains a &lt;code&gt;query&lt;/code&gt; element instead of a &lt;code&gt;dbtable&lt;/code&gt; element.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;other-rdbm-systems&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Other RDBM Systems&lt;/h1&gt;
&lt;p&gt;Our toy example with MySQL worked fine, but in practice, we might need to access data in other popular RDBM systems, such as Oracle, MS SQL Server, and others. The pattern we have shown above however remains, as the API design is the same regardless of the system in question.&lt;/p&gt;
&lt;p&gt;In general, we will need 3 elements to successfully connect:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;A JDBC driver specified and the resources provided to &lt;code&gt;{sparklyr}&lt;/code&gt; in the &lt;code&gt;config&lt;/code&gt; argument of &lt;code&gt;spark_connect()&lt;/code&gt;, usually in the form of paths to .jar files containing the necessary resources&lt;/li&gt;
&lt;li&gt;A connection URL that will depend on the system and other setup specifics&lt;/li&gt;
&lt;li&gt;Last but not least, all the technical and infrastructural prerequisites such as credentials with the proper access rights, the host being accessible from the Spark cluster, etc.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now for some examples that we have worked with in the past and had success with.&lt;/p&gt;
&lt;div id=&#34;oracle&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Oracle&lt;/h2&gt;
&lt;div id=&#34;oracle-jdbc-driver&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Oracle JDBC Driver&lt;/h3&gt;
&lt;p&gt;The drivers can be downloaded (after login) from &lt;a href=&#34;https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html&#34;&gt;Oracle’s website&lt;/a&gt; and the driver name usually is &lt;code&gt;&amp;quot;oracle.jdbc.driver.OracleDriver&amp;quot;&lt;/code&gt;. Make sure you use the appropriate version.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-fully-qualified-host-identification&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Using fully qualified host identification&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;hostName &amp;lt;- &amp;quot;0.0.0.0&amp;quot;
portNumber &amp;lt;- &amp;quot;1521&amp;quot;
serviceName &amp;lt;- &amp;quot;service_name&amp;quot;

jdbcConnectionOpts &amp;lt;- list(
  user = &amp;quot;username&amp;quot;,
  password = &amp;quot;password&amp;quot;,
  driver = &amp;quot;oracle.jdbc.driver.OracleDriver&amp;quot;,
  fetchsize = &amp;quot;100000&amp;quot;,
  url = paste0(
    &amp;quot;jdbc:oracle:thin:@//&amp;quot;,
    hostName, &amp;quot;:&amp;quot;, portNumber,
    &amp;quot;/&amp;quot;, serviceName
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-tnsnames.ora&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Using tnsnames.ora&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;tnsnames.ora&lt;/code&gt; file is a configuration file that contains network service names mapped to connect descriptors for the local naming method, or net service names mapped to listener protocol addresses. With this in place, we can use just the service name instead of fully qualified host, port, and service identification, for example:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;serviceName &amp;lt;- &amp;quot;service_name&amp;quot;

jdbcConnectionOpts &amp;lt;- list(
  user = &amp;quot;username&amp;quot;,
  password = &amp;quot;password&amp;quot;,
  driver = &amp;quot;oracle.jdbc.driver.OracleDriver&amp;quot;,
  fetchsize = &amp;quot;100000&amp;quot;,
  url = paste0(&amp;quot;jdbc:oracle:thin:@&amp;quot;, serviceName)
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;parsing-special-data-types&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Parsing special data types&lt;/h3&gt;
&lt;p&gt;Note that the JDBC driver on its own may not be enough to parse all data types in an Oracle database. For instance, parsing the &lt;code&gt;XMLType&lt;/code&gt; will very likely require &lt;code&gt;xmlparserv2.jar&lt;/code&gt;, and &lt;code&gt;xdb.jar&lt;/code&gt; along with the proper &lt;code&gt;ojdbc*.jar&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;ms-sql-server&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;MS SQL Server&lt;/h2&gt;
&lt;div id=&#34;ms-sql-server-jdbc-driver&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;MS SQL Server JDBC Driver&lt;/h3&gt;
&lt;p&gt;The drivers for different JRE versions can be downloaded from the &lt;a href=&#34;https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver15&#34;&gt;Download Microsoft JDBC Driver for SQL Server&lt;/a&gt; website. Again, make sure that the JRE version matches the one you use in your environments.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ms-sql-server-connection-options&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;MS SQL Server connection options&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;serverName &amp;lt;- &amp;quot;0.0.0.0&amp;quot;
portNumber  &amp;lt;- &amp;quot;1433&amp;quot;
databaseName &amp;lt;- &amp;quot;db_name&amp;quot;

jdbcConnectionOpts &amp;lt;- list(
  user = &amp;quot;username&amp;quot;,
  password = &amp;quot;password&amp;quot;,
  driver = &amp;quot;com.microsoft.sqlserver.jdbc.SQLServerDriver&amp;quot;,
  fetchsize = &amp;quot;100000&amp;quot;,
  url = paste0(
    &amp;quot;jdbc:sqlserver://&amp;quot;,
    serverName, &amp;quot;:&amp;quot;, portNumber,
    &amp;quot;;databaseName=&amp;quot;, databaseName
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;even-more-rdbm-systems&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Even more RDBM Systems&lt;/h2&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r925-01-r-spark-jdbc-rdbms-logos.png&#34; alt=&#34;Logos of R, sparklyr, Spark and selected RDBMS systems&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Logos of R, sparklyr, Spark and selected RDBMS systems&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Vlad Mihalcea wrote a very useful article on &lt;a href=&#34;https://vladmihalcea.com/jdbc-driver-connection-url-strings/&#34;&gt;JDBC Driver Connection URL strings&lt;/a&gt; which has the connection URL details for several other common database systems.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;some-notes-on-performance&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Some notes on performance&lt;/h1&gt;
&lt;div id=&#34;the-memory-argument&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The &lt;code&gt;memory&lt;/code&gt; argument&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;memory&lt;/code&gt; argument to &lt;code&gt;spark_read_jdbc()&lt;/code&gt; can prove very important when performance is of interest. What happens when using the default &lt;code&gt;memory = TRUE&lt;/code&gt; is that the table in the Spark SQL context is cached using &lt;code&gt;CACHE TABLE&lt;/code&gt; and a &lt;code&gt;SELECT count(*) FROM&lt;/code&gt; query is executed on the cached table. This forces Spark to perform the action of loading the entire table into memory.&lt;/p&gt;
&lt;p&gt;Depending on our use case, it might be much more beneficial to use &lt;code&gt;memory = FALSE&lt;/code&gt; and only cache into Spark memory the parts of the table (or processed results) that we need, as the most time-costly operations usually are data transfers over the network. Transferring as little data as possible from the database into Spark memory may bring significant performance benefits.&lt;/p&gt;
&lt;p&gt;This is a bit difficult to show with our toy example, as everything is physically happening inside the same container (and therefore the same file system), but differences can be observed even with this setup and our small dataset:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;microbenchmark::microbenchmark(
  times = 10,
  setup = {
    library(dplyr)
    library(sparklyr)
    sparklyr::spark_disconnect_all()
    sc &amp;lt;- sparklyr::spark_connect(&amp;quot;local&amp;quot;, config = config)
  },
  
  # with memory=TRUE (the default)
  eager = {
    one &amp;lt;- sparklyr::spark_read_jdbc(
      sc = sc,
      name = &amp;quot;test&amp;quot;,
      options = append(jdbcConnectionOpts, list(dbtable = &amp;quot;test_table&amp;quot;))
    ) %&amp;gt;%
      filter(tailnum == &amp;quot;N14228&amp;quot;) %&amp;gt;%
      select(tailnum, distance) %&amp;gt;%
      compute(&amp;quot;test&amp;quot;)
  },

  # with memory=FALSE
  lazy = {
    two &amp;lt;- sparklyr::spark_read_jdbc(
      sc = sc,
      name = &amp;quot;test&amp;quot;,
      options = append(jdbcConnectionOpts, list(dbtable = &amp;quot;test_table&amp;quot;)),
      memory = FALSE
    ) %&amp;gt;% 
      filter(tailnum == &amp;quot;N14228&amp;quot;) %&amp;gt;%
      select(tailnum, distance) %&amp;gt;%
      compute(&amp;quot;test&amp;quot;)
  }
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Unit: seconds
#  expr       min       lq     mean   median       uq      max neval
# eager 15.460844 16.24838 17.07560 17.03592 17.88299 18.73005    10
#  lazy  9.821039 10.12435 10.40718 10.42766 10.70024 10.97283    10&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see that the “lazy” approach that does not cache the entire table into memory has yielded the result around 41% faster. This is of course by no means a relevant benchmark for real-life data loads but can provide some insight into optimizing the loads.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;partitioning&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Partitioning&lt;/h2&gt;
&lt;p&gt;Partitioning the data can bring a very significant performance boost and we will look into setting it up and optimizing it in detail in a separate article.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;running-the-code-in-this-article&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Running the code in this article&lt;/h1&gt;
&lt;p&gt;If you have Docker available, running the following should yield a Docker container with RStudio Server exposed on port 8787, so you can open your web browser at &lt;code&gt;http://localhost:8787&lt;/code&gt; to access it and experiment with the code. The user name is &lt;code&gt;rstudio&lt;/code&gt; and the password is as you choose below:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run -d -p 8787:8787 -e PASSWORD=pass jozefhajnala/jozefio&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;JDBC Driver &lt;a href=&#34;https://vladmihalcea.com/jdbc-driver-connection-url-strings/&#34;&gt;Connection URL strings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;MS SQL Server: Programming Guide for JDBC - &lt;a href=&#34;https://docs.microsoft.com/en-us/sql/connect/jdbc/building-the-connection-url?view=sql-server-ver15&#34;&gt;Building the Connection URL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Oracle: Database JDBC Developer’s Guide and Reference - &lt;a href=&#34;https://docs.oracle.com/cd/B19306_01/java.102/b14355/urls.htm#JJDBC20000&#34;&gt;Data Sources and URLs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>A review of my experience with the Big Data Analysis with Scala and Spark course</title>
      <link>https://jozef.io/r924-big-data-spark-scala-review/</link>
      <pubDate>Sat, 25 Jul 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r924-big-data-spark-scala-review/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Apache Spark is an open-source distributed cluster-computing framework implemented in Scala that first came out in 2014 and has since then become popular for many computing applications including machine learning thanks to among other aspects its user-friendly APIs. The popularity also gave rise to many online courses of varied quality.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, I share my personal experience with completing the &lt;a href=&#34;https://www.coursera.org/learn/scala-spark-big-data&#34;&gt;Big Data Analysis with Scala and Spark&lt;/a&gt; course on Coursera in May 2020, briefly walk through the content and write about the course assignments. I wrote down each of the paragraphs as I went through the course, so it is not a retrospective evaluation but more of a “review-style diary” of the process of completing the course.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#contents&#34;&gt;Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#disclaimer-what-to-expect&#34;&gt;Disclaimer, what to expect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#course-organization-pre-course-preparatory-work&#34;&gt;Course organization, pre-course preparatory work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-1&#34;&gt;Week 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-2&#34;&gt;Week 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-3&#34;&gt;Week 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-4&#34;&gt;Week 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-give-me-the-overview&#34;&gt;TL;DR - Just give me the overview&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;disclaimer-what-to-expect&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Disclaimer, what to expect&lt;/h1&gt;
&lt;p&gt;First off, this post &lt;em&gt;does not mean to be an objective review&lt;/em&gt; as your experience will most likely be very different from mine. Before this course, I also completed one of the prerequisites - the &lt;a href=&#34;https://www.coursera.org/learn/progfun1?specialization=scala&#34;&gt;Functional Programming Principles in Scala course&lt;/a&gt;, which I &lt;a href=&#34;https://jozef.io/r923-fun-prog-in-scala-review/&#34;&gt;reviewed here&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This is not a paid review and I have no affiliation nor any benefit whatsoever from Coursera or other parties from writing this review.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;course-organization-pre-course-preparatory-work&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Course organization, pre-course preparatory work&lt;/h1&gt;
&lt;div id=&#34;organization&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Organization&lt;/h2&gt;
&lt;p&gt;The course is organized into video sessions split into 4 weeks, but since it is fully online you can choose your own pace. I completed the course in one week while being on a standard working schedule. Each week apart from week 3 has a programming assignment that is submitted to Coursera and automatically graded.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I found the assignments executed very well from a technical perspective and had no issues at all with downloading, compiling, running, and submitting them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Similarly to the other courses in the specialization, you can submit each assignment as many times as you want, so there is no stress making the submission right on the first try. Once the course is completed, you get a certificate.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;pre-course-setup&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Pre-course setup&lt;/h2&gt;
&lt;p&gt;Since I have prior experience with Scala and sbt and I already completed a previous course in the specialization, there was no extra setup overhead.&lt;/p&gt;
&lt;p&gt;If you are an R user used to conveniently opening RStudio and easily installing packages, you may be surprised by the difficulty of the whole setup. The course does provide setup videos for major platforms, so with a bit of patience, you should be good to go.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-1&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 1&lt;/h1&gt;
&lt;div id=&#34;content&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Very practically introduces Spark, the motivation behind Spark, and comparison to Hadoop, especially for data science type applications and workflows. Presents the main collections class that Spark works with - RDD and provides a very useful comparison between the RDD API and Scala collections API. This builds upon the topics covered in the previous courses, mainly the Functional Programming Principles in Scala course. It also very nicely covers the differences between transformations and actions on RDDs and how that relates to the differences in expression evaluation between the sequential collections and the lazy evaluation of transformations on RDDs.&lt;/p&gt;
&lt;p&gt;The content also covers cluster topology, how the driver and worker nodes are related, and what gets executed where. The importance of having data parallelized in such a way that there is little shuffling between the nodes is also highlighted.&lt;/p&gt;
&lt;p&gt;I found especially useful the video session on Latency, where the speeds on different operations e.g. referencing memory, reading from disk and sending packets over networks are compared in very understandable terms, which motivates good practices in partitioning data and designing processes to minimize those operations that are time expensive.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This was fantastic content and I binged it in one evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;The assignment is quite fun and practical, the goal is to use full-text data from Wikipedia to produce a very simple metric of how popular a programming language is.&lt;/p&gt;
&lt;p&gt;The only issue I had was that a lot of methods that should have been used were only introduced in the content of Week 2, so I had to study their documentation myself to implement the assignment. Had I known that they are introduced in detail in Week 2, I would have watched those sessions first before working on this assignment.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-2&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 2&lt;/h1&gt;
&lt;div id=&#34;content-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Starts with explaining &lt;code&gt;foldLeft&lt;/code&gt;, &lt;code&gt;fold&lt;/code&gt;, and &lt;code&gt;aggregate&lt;/code&gt;. Very good explanations. It would be great to have them for the 1st assignment. Even a structure similar to the first assignment is mentioned along with distributed key-value pairs (pair RDDs), which support &lt;code&gt;reduceByKey&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The later sessions introduce different available joins on pair RDDs, again showing examples, so the concepts are easy to understand. The explanations are very clear and detailed.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;This time the goal is to look at StackOverflow questions and answers data and apply k-means to cluster the content by languages. This was a very interesting and fun assignment.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Implementing it let me appreciate how R is amazing for exploratory and interactive data science work. Compared to R, debugging the Scala code was challenging, and writing data wrangling code to get the data into proper format took me hours.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For a comparison, here is the Scala code I wrote to get the data in requested formats:&lt;/p&gt;
&lt;pre class=&#34;scala&#34;&gt;&lt;code&gt;val langs = List(
  &amp;quot;JavaScript&amp;quot;, &amp;quot;Java&amp;quot;, &amp;quot;PHP&amp;quot;, &amp;quot;Python&amp;quot;, &amp;quot;C#&amp;quot;, &amp;quot;C++&amp;quot;, &amp;quot;Ruby&amp;quot;, &amp;quot;CSS&amp;quot;,
  &amp;quot;Objective-C&amp;quot;, &amp;quot;Perl&amp;quot;, &amp;quot;Scala&amp;quot;, &amp;quot;Haskell&amp;quot;, &amp;quot;MATLAB&amp;quot;, &amp;quot;Clojure&amp;quot;, &amp;quot;Groovy&amp;quot;
)
def langSpread = 50000

val lines = sc.textFile(&amp;quot;src/main/resources/stackoverflow/stackoverflow.csv&amp;quot;)
val raw   = rawPostings(lines)

/** Parse lines into proper structure */
def rawPostings(lines: RDD[String]): RDD[Posting] =
  lines.map(line =&amp;gt; {
    val arr = line.split(&amp;quot;,&amp;quot;)
    Posting(
      postingType =    arr(0).toInt,
      id =             arr(1).toInt,
      acceptedAnswer = if (arr(2) == &amp;quot;&amp;quot;) None else Some(arr(2).toInt),
      parentId =       if (arr(3) == &amp;quot;&amp;quot;) None else Some(arr(3).toInt),
      score =          arr(4).toInt,
      tags =           if (arr.length &amp;gt;= 6) Some(arr(5).intern()) else None
    )
})


/** Group the questions and answers together */
def groupedPostings(
  postings: RDD[Posting]
): RDD[(QID, Iterable[(Question, Answer)])] = {
  val questions = postings.
    filter(thisPosting =&amp;gt; thisPosting.postingType == 1).
    map(thisQuestion =&amp;gt; (thisQuestion.id, thisQuestion))
  val answers = postings.
    filter(thisPosting =&amp;gt; thisPosting.postingType == 2).
    map(thisAnswer =&amp;gt; (thisAnswer.parentId.get, thisAnswer))
  questions.join(answers).groupByKey()
}

/** Compute the maximum score for each posting */
def scoredPostings(
  grouped: RDD[(QID, Iterable[(Question, Answer)])]
): RDD[(Question, HighScore)] = {

  def answerHighScore(as: Array[Answer]): HighScore = {
    var highScore = 0
    var i = 0
    while (i &amp;lt; as.length) {
      val score = as(i).score
      if (score &amp;gt; highScore) highScore = score
      i += 1
    }
    highScore
  }

  grouped.map{
    case (_, qaList) =&amp;gt; (
        qaList.head._1,
        answerHighScore(qaList.map(x =&amp;gt; x._2).toArray)
    )
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r924-01-stackoverflow.png&#34; alt=&#34;Editing Scala in VS Code&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Editing Scala in VS Code&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;And here is data.table code that can reach very similar results:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(data.table)

# Read Data -----
so &amp;lt;- fread(&amp;quot;http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow.csv&amp;quot;)
colNames &amp;lt;- c(&amp;quot;postTypeId&amp;quot;, &amp;quot;id&amp;quot;, &amp;quot;acceptedAnswer&amp;quot;, &amp;quot;parentId&amp;quot;, &amp;quot;score&amp;quot;, &amp;quot;tag&amp;quot;)
setnames(so, colNames)

# Select questions and answers -----
que &amp;lt;- so[postTypeId == 1, .(queId = id, queTag = tag)]
ans &amp;lt;- so[postTypeId == 2, .(ansId = id, queId = parentId, ansScore = score)]
langSpread &amp;lt;- 50000L

langs = data.frame(
  index = (0:14) * langSpread,
  queTag = c(
    &amp;quot;JavaScript&amp;quot;, &amp;quot;Java&amp;quot;, &amp;quot;PHP&amp;quot;, &amp;quot;Python&amp;quot;, &amp;quot;C#&amp;quot;, &amp;quot;C++&amp;quot;, &amp;quot;Ruby&amp;quot;, &amp;quot;CSS&amp;quot;,
    &amp;quot;Objective-C&amp;quot;, &amp;quot;Perl&amp;quot;, &amp;quot;Scala&amp;quot;, &amp;quot;Haskell&amp;quot;, &amp;quot;MATLAB&amp;quot;, &amp;quot;Clojure&amp;quot;, &amp;quot;Groovy&amp;quot;
  )
)

# Merge into final object -----
mg &amp;lt;- merge(que, ans,  by = &amp;quot;queId&amp;quot;)
mg &amp;lt;- mg[, .(maxAnsScore = max(ansScore)), by = .(queId, queTag)]
mg &amp;lt;- merge(mg, langs)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Some tweaks were also needed to make the grader happy and since the grader output is not that detailed and there were no local unit tests provided, it took me quite a few submissions to get this right. All-in-all, it was a fun assignment and it highlighted how much simpler R is for this type of usage.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-3&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 3&lt;/h1&gt;
&lt;div id=&#34;content-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;This week focuses on partitioning and shuffling. The video lectures explain the concepts very well and even provide a practical hands-on example of how preventing shuffles can significantly improve the performance of operations on RDDs.&lt;/p&gt;
&lt;p&gt;It also looks at optimizing Spark operations with partitioners and look at key differences between wide and narrow dependencies in the context of fail-safety. Again a concrete example is provided along with the explanations, which I find very helpful.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;There is no assignment in Week 3.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-4&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 4&lt;/h1&gt;
&lt;div id=&#34;content-3&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Once again an extremely useful set of sessions that introduce the &lt;code&gt;DataFrame&lt;/code&gt;, &lt;code&gt;DataSet&lt;/code&gt;, and Spark SQL APIs. Especially for R and Python users, this week’s content is great as the untyped APIs are those that pyspark and SparkR (and sparklyr) users will interact with the vast majority of the time. The sessions explain how these more high-level APIs relate to the typed RDD API and how the 2 main optimization tools - catalyst and tungsten work to optimize the code that users send via the high-level APIs.&lt;/p&gt;
&lt;p&gt;There is also a benchmarking comparison of different RDD approaches that are not directly optimized so we can see performance drops versus the Spark SQL API which optimizes the SQL query such that even a query written inefficiently by the user executes very fast.&lt;/p&gt;
&lt;p&gt;Once again, a fantastic content session to wrap up the course.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment-3&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;The final assignment of the course focuses on comparing the DataSet API with the DataFrame and Spark SQL APIs in a very practical manner. Based on data on how people spend their time split across categories such as primary needs, work, and spare time activities, we compute some aggregated statistics using the untyped DataFrame and SQL APIs and the typed DataSet API. I feel this assignment really shows the differences between the APIs well in a practical sense and also allows the student to implement each of the tasks more freely.&lt;/p&gt;
&lt;p&gt;Since I had previous experience with the DataFrame and Spark SQL APIs from working with them, I found this assignment much less challenging, but still seeing the three APIs in comparison was useful.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-overview&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the overview&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The course introduces Apache Spark and the key concepts in a very understandable and practical way&lt;/li&gt;
&lt;li&gt;The feel of the course was very hands-on and well-executed, the explanations very clear, making use of practical examples&lt;/li&gt;
&lt;li&gt;The assignments are fun, each of them working with a real-life set of data and exploring different Spark concepts and APIs&lt;/li&gt;
&lt;li&gt;Overall I was very happy with the course and would love to see a more in-depth sequel&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Exploring and plotting positional ice hockey data on goals, penalties and more from R with the {nhlapi} package</title>
      <link>https://jozef.io/r400-nhlapi-positional-data/</link>
      <pubDate>Sat, 04 Jul 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r400-nhlapi-positional-data/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The National Hockey League (NHL) is considered to be the premier professional ice hockey league in the world, founded 102 years ago in 1917. Like many other sports, the data about teams, players, games, and more are a great resource to dive in and analyze using modern software tools. Thanks to the open NHL API, the data is accessible to everyone and the &lt;code&gt;{nhlapi}&lt;/code&gt; R package aims to make that data readily available for analysis to R users.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will use the &lt;code&gt;{nhlapi}&lt;/code&gt; R package to explore the positional data on in-game events, which will provide us with information on the plays that happened in matches and where they happened in terms of the position on the rink. We will also show ways to plot that information using 2D density charts with &lt;code&gt;{ggplot2}&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#installing-the-nhlapi-package&#34;&gt;Installing the {nhlapi} package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#retrieving-basic-game-information&#34;&gt;Retrieving basic game information&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#getting-detailed-events-data-for-a-game&#34;&gt;Getting detailed events data for a game&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#more-involved-data-retrieval---many-games-in-parallel&#34;&gt;More involved data retrieval - many games in parallel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#processing-and-plotting-positional-data&#34;&gt;Processing and plotting positional data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#some-examples-of-rendered-images&#34;&gt;Some examples of rendered images&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;installing-the-nhlapi-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Installing the {nhlapi} package&lt;/h1&gt;
&lt;p&gt;We can install &lt;code&gt;{nhlapi}&lt;/code&gt; from CRAN. It has only 1 recursive dependency, so the installation is very light and swift. Alternatively, we can also install the latest development version from the master branch on GitHub using the &lt;code&gt;{remotes}&lt;/code&gt; or &lt;code&gt;{devtools}&lt;/code&gt; package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Current CRAN version:
install.packages(&amp;quot;nhlapi&amp;quot;)

# Development version from GitHub
#  devtools::install_github(&amp;quot;jozefhajnala/nhlapi&amp;quot;)
#  remotes::install_github(&amp;quot;jozefhajnala/nhlapi&amp;quot;)

library(nhlapi)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we attach the package using &lt;code&gt;library()&lt;/code&gt; or &lt;code&gt;require()&lt;/code&gt; and can start exploring the data. All the relevant functions start with the &lt;code&gt;nhl_&lt;/code&gt; prefix so they are easy to find and are well documented, so we can get help by using the &lt;code&gt;help()&lt;/code&gt; function in R. For example, in this post we will look at the detailed games’ data, so running &lt;code&gt;help(nhl_games)&lt;/code&gt; will provide us with detailed information on the available functions.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;retrieving-basic-game-information&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Retrieving basic game information&lt;/h1&gt;
&lt;p&gt;To look at a quick example, we will explore the very first game in the regular season 2017/2018, in which the Toronto Maple Leafs played against the Winnipeg Jets. First, let’s look at the very basic game results using the &lt;code&gt;nhl_games_linescore()&lt;/code&gt; function which retrieves a very limited amount of high-level information:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;linescore &amp;lt;- nhlapi::nhl_games_linescore(gameIds = 2017020001)[[1]]

# Look at quick info on periods
linescore$periods&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;  periodType            startTime              endTime num ordinalNum
1    REGULAR 2017-10-04T23:17:19Z 2017-10-04T23:58:23Z   1        1st
2    REGULAR 2017-10-05T00:16:56Z 2017-10-05T00:54:10Z   2        2nd
3    REGULAR 2017-10-05T01:12:37Z 2017-10-05T01:50:38Z   3        3rd
  home.goals home.shotsOnGoal home.rinkSide away.goals away.shotsOnGoal
1          0               17         right          3               11
2          0               10          left          1                8
3          2               10         right          3               12
  away.rinkSide
1          left
2         right
3          left&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-detailed-events-data-for-a-game&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting detailed events data for a game&lt;/h1&gt;
&lt;p&gt;Now to something more interesting, lets investigate what plays were made during the game and where on the ice they happened. We can use &lt;code&gt;nhl_games_feed()&lt;/code&gt; to get the most detailed game data available in the API. To get a picture of the amount of detail, we can print the structure of the retrieved object limited to 3 levels of depth:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gameIds &amp;lt;- 2017020001
gameFeed &amp;lt;- nhlapi::nhl_games_feed(gameIds = gameIds)[[1]]
str(gameFeed, max.level = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;List of 6
 $ copyright: chr &amp;quot;NHL and the NHL Shield are registered trademarks of the National Hockey League. NHL and NHL team marks are the &amp;quot;| __truncated__
 $ gamePk   : int 2017020001
 $ link     : chr &amp;quot;/api/v1/game/2017020001/feed/live&amp;quot;
 $ metaData :List of 2
  ..$ wait     : int 10
  ..$ timeStamp: chr &amp;quot;20171006_173713&amp;quot;
 $ gameData :List of 6
  ..$ game    :List of 3
  .. ..$ pk    : int 2017020001
  .. ..$ season: chr &amp;quot;20172018&amp;quot;
  .. ..$ type  : chr &amp;quot;R&amp;quot;
  ..$ datetime:List of 2
  .. ..$ dateTime   : chr &amp;quot;2017-10-04T23:00:00Z&amp;quot;
  .. ..$ endDateTime: chr &amp;quot;2017-10-05T01:50:41Z&amp;quot;
  ..$ status  :List of 5
  .. ..$ abstractGameState: chr &amp;quot;Final&amp;quot;
  .. ..$ codedGameState   : chr &amp;quot;7&amp;quot;
  .. ..$ detailedState    : chr &amp;quot;Final&amp;quot;
  .. ..$ statusCode       : chr &amp;quot;7&amp;quot;
  .. ..$ startTimeTBD     : logi FALSE
  ..$ teams   :List of 2
  .. ..$ away:List of 16
  .. ..$ home:List of 16
  ..$ players :List of 45
  .. ..$ ID8474709:List of 22
  .. ..$ ID8473618:List of 22
  .. ..$ ID8471218:List of 22
  .. ..$ ID8470828:List of 21
  .. ..$ ID8477939:List of 22
  .. ..$ ID8476945:List of 22
  .. ..$ ID8473412:List of 22
  .. ..$ ID8475716:List of 21
  .. ..$ ID8476941:List of 22
  .. ..$ ID8476469:List of 21
  .. ..$ ID8477359:List of 22
  .. ..$ ID8479339:List of 21
  .. ..$ ID8479318:List of 22
  .. ..$ ID8476410:List of 22
  .. ..$ ID8475883:List of 21
  .. ..$ ID8474574:List of 22
  .. ..$ ID8477940:List of 21
  .. ..$ ID8473463:List of 21
  .. ..$ ID8477464:List of 22
  .. ..$ ID8473461:List of 22
  .. ..$ ID8476392:List of 22
  .. ..$ ID8466139:List of 22
  .. ..$ ID8470834:List of 22
  .. ..$ ID8468575:List of 22
  .. ..$ ID8477429:List of 22
  .. ..$ ID8468493:List of 22
  .. ..$ ID8474037:List of 22
  .. ..$ ID8475786:List of 22
  .. ..$ ID8470611:List of 22
  .. ..$ ID8476853:List of 22
  .. ..$ ID8477448:List of 21
  .. ..$ ID8477504:List of 22
  .. ..$ ID8479458:List of 21
  .. ..$ ID8477015:List of 22
  .. ..$ ID8475179:List of 21
  .. ..$ ID8476885:List of 22
  .. ..$ ID8475279:List of 22
  .. ..$ ID8473574:List of 22
  .. ..$ ID8476460:List of 22
  .. ..$ ID8475098:List of 22
  .. ..$ ID8474581:List of 22
  .. ..$ ID8478483:List of 22
  .. ..$ ID8475172:List of 22
  .. ..$ ID8480158:List of 21
  .. ..$ ID8479293:List of 22
  ..$ venue   :List of 3
  .. ..$ id  : int 5058
  .. ..$ name: chr &amp;quot;Bell MTS Place&amp;quot;
  .. ..$ link: chr &amp;quot;/api/v1/venues/5058&amp;quot;
 $ liveData :List of 4
  ..$ plays    :List of 5
  .. ..$ allPlays     :&amp;#39;data.frame&amp;#39;:    312 obs. of  28 variables:
  .. ..$ scoringPlays : int [1:9] 93 108 112 157 225 269 284 286 290
  .. ..$ penaltyPlays : int [1:12] 21 43 66 86 117 148 167 183 247 253 ...
  .. ..$ playsByPeriod:&amp;#39;data.frame&amp;#39;:    3 obs. of  3 variables:
  .. ..$ currentPlay  :List of 3
  ..$ linescore:List of 10
  .. ..$ currentPeriod             : int 3
  .. ..$ currentPeriodOrdinal      : chr &amp;quot;3rd&amp;quot;
  .. ..$ currentPeriodTimeRemaining: chr &amp;quot;Final&amp;quot;
  .. ..$ periods                   :&amp;#39;data.frame&amp;#39;:   3 obs. of  11 variables:
  .. ..$ shootoutInfo              :List of 2
  .. ..$ teams                     :List of 2
  .. ..$ powerPlayStrength         : chr &amp;quot;Even&amp;quot;
  .. ..$ hasShootout               : logi FALSE
  .. ..$ intermissionInfo          :List of 3
  .. ..$ powerPlayInfo             :List of 3
  ..$ boxscore :List of 2
  .. ..$ teams    :List of 2
  .. ..$ officials:&amp;#39;data.frame&amp;#39;:    4 obs. of  4 variables:
  ..$ decisions:List of 5
  .. ..$ winner    :List of 3
  .. ..$ loser     :List of 3
  .. ..$ firstStar :List of 3
  .. ..$ secondStar:List of 3
  .. ..$ thirdStar :List of 3
 - attr(*, &amp;quot;url&amp;quot;)= chr &amp;quot;https://statsapi.web.nhl.com/api/v1/game/2017020001/feed/live&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now lets finally look at the data on plays. We can access those via the &lt;code&gt;allPlays&lt;/code&gt; data.frame inside the element &lt;code&gt;plays&lt;/code&gt; of &lt;code&gt;liveData&lt;/code&gt;. The below code chunk will store those in a separate data.frame called &lt;code&gt;plays&lt;/code&gt;. We can then filter based on &lt;code&gt;result.event&lt;/code&gt; to look for instance only at goals.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plays &amp;lt;- gameFeed$liveData$plays$allPlays
goals &amp;lt;- plays[plays$result.event == &amp;quot;Goal&amp;quot;, ]

# Selecting limited columns to keep the print reasonable
goals[, c(2, 5, 6, 12, 15, 18, 26, 23, 24)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     result.event
## 94          Goal
## 109         Goal
## 113         Goal
## 158         Goal
## 226         Goal
## 270         Goal
## 285         Goal
## 287         Goal
## 291         Goal
##                                                                     result.description
## 94        Nazem Kadri (1) Wrist Shot, assists: James van Riemsdyk (1), Tyler Bozak (1)
## 109                        James van Riemsdyk (1) Wrist Shot, assists: Tyler Bozak (2)
## 113   William Nylander (1) Wrist Shot, assists: Jake Gardiner (1), Auston Matthews (1)
## 158    Patrick Marleau (1) Backhand, assists: Auston Matthews (2), Mitchell Marner (1)
## 226          Patrick Marleau (2) Wrist Shot, assists: Nazem Kadri (1), Leo Komarov (1)
## 270 Mitchell Marner (1) Wrist Shot, assists: James van Riemsdyk (2), Morgan Rielly (1)
## 285      Mark Scheifele (1) Snap Shot, assists: Patrik Laine (1), Dustin Byfuglien (1)
## 287       Auston Matthews (1) Tip-In, assists: Connor Carrick (1), Andreas Borgman (1)
## 291                        Mathieu Perreault (1) Wrist Shot, assists: Bryan Little (1)
##     result.secondaryType result.strength.name about.period
## 94            Wrist Shot           Power Play            1
## 109           Wrist Shot                 Even            1
## 113           Wrist Shot                 Even            1
## 158             Backhand                 Even            2
## 226           Wrist Shot                 Even            3
## 270           Wrist Shot           Power Play            3
## 285            Snap Shot                 Even            3
## 287               Tip-In                 Even            3
## 291           Wrist Shot                 Even            3
##     about.periodTime           team.name coordinates.x coordinates.y
## 94             15:45 Toronto Maple Leafs            84            -6
## 109            17:40 Toronto Maple Leafs            62             5
## 113            18:23 Toronto Maple Leafs            84           -22
## 158            08:32 Toronto Maple Leafs           -82             2
## 226            00:36 Toronto Maple Leafs            68            12
## 270            08:07 Toronto Maple Leafs            85            -6
## 285            11:31       Winnipeg Jets           -82             8
## 287            11:57 Toronto Maple Leafs            84            -3
## 291            12:57       Winnipeg Jets           -80             1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can see that there are many columns, among them &lt;code&gt;coordinates.x&lt;/code&gt; and &lt;code&gt;coordinates.y&lt;/code&gt; which tell us the location of the play on the rink, where &lt;code&gt;[0, 0]&lt;/code&gt; is the center of the rink.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;more-involved-data-retrieval---many-games-in-parallel&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;More involved data retrieval - many games in parallel&lt;/h1&gt;
&lt;p&gt;Now we know how to look at the positional data for one match so one very interesting aspect of the data is where plays happen overall. We will now investigate and plot where different plays were happening in the regular season 2017/2018. Looking at &lt;code&gt;?nhl_games&lt;/code&gt; we see that for regular seasons we can usually get all the gameIds in the interval &lt;code&gt;2017020001:2017021271&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Define the game ids
gameIds &amp;lt;- 2017020001:2017021271

# Retrieve the data
gameFeeds &amp;lt;- nhlapi::nhl_games_feed(gameIds)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To retrieve the data a bit faster, we can also use the &lt;code&gt;parallel&lt;/code&gt; package which is part of the base R installation to retrieve the data in parallel, for example, like so.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Define the game ids
gameIds &amp;lt;- 2017020001:2017021271

# Create a local cluster 
cl &amp;lt;- parallel::makeCluster(parallel::detectCores() / 2)

# Retrieve the data using nhlapi::nhl_games_feed()
gameFeeds &amp;lt;- parallel::parLapplyLB(cl, gameIds, nhlapi::nhl_games_feed)

# Stop the cluster
parallel::stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have the data retrieved in a list called &lt;code&gt;gameFeeds&lt;/code&gt;. It might be wise to store it on disk such that we do not have to do the long retrieval all the time, for example using &lt;code&gt;saveRDS()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;saveRDS(gameFeeds, file.path(&amp;quot;~&amp;quot;, &amp;quot;gamefeeds_regular_2017.rds&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;processing-and-plotting-positional-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Processing and plotting positional data&lt;/h1&gt;
&lt;p&gt;Now that the data is safely retrieved, we can process and prepare the data on plays for plotting.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Retrieve the data frames with plays from the data
getPlaysDf &amp;lt;- function(gm) {
  playsRes &amp;lt;- try(gm[[1L]][[&amp;quot;liveData&amp;quot;]][[&amp;quot;plays&amp;quot;]][[&amp;quot;allPlays&amp;quot;]])
  if (inherits(playsRes, &amp;quot;try-error&amp;quot;)) data.frame() else playsRes
}
plays &amp;lt;- lapply(gameFeeds, getPlaysDf)

# Bind the list into a single data frame
plays &amp;lt;- nhlapi:::util_rbindlist(plays)

# Keep only the records that have coordinates
plays &amp;lt;- plays[!is.na(plays$coordinates.x), ]

# Move the coordinates to non-negative values before plotting
plays$coordx &amp;lt;- plays$coordinates.x + abs(min(plays$coordinates.x))
plays$coordy &amp;lt;- plays$coordinates.y + abs(min(plays$coordinates.y))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have the data ready in a &lt;code&gt;plays&lt;/code&gt; data.frame, finally we can create some cool plots. As an example, in the following chunk the popular &lt;code&gt;ggplot2&lt;/code&gt; package is used to plot densities and events that would yield results similar to the ones shown below:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)

# Look at goals only
goals &amp;lt;- plays[result.event == &amp;quot;Goal&amp;quot;]

ggplot(goals, aes(x = coordx, y = coordy)) +
  labs(title = &amp;quot;Where are goals scored from&amp;quot;) +
  geom_point(alpha = 0.1, size = 0.2) +
  xlim(0, 198) + ylim(0, 84) +
  geom_density_2d_filled(alpha = 0.35, show.legend = FALSE) +
  theme_void()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;some-examples-of-rendered-images&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Some examples of rendered images&lt;/h1&gt;
&lt;p&gt;With a bit of effort, we can also add a background image of the ice hockey rink to make the density plots more relatable and arrive at some quite informative plots:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;../img/r400-01-nhl-penalties-2016-2018.png&#34; alt=&#34;Penalties - NHL Regular Seasons 2016/207 and 2017/2018&#34; /&gt;
&lt;img src=&#34;../img/r400-02-nhl-shots-2016-2018.png&#34; alt=&#34;Shots - NHL Regular Seasons 2016/207 and 2017/2018&#34; /&gt;
&lt;img src=&#34;../img/r400-03-nhl-goals-2016-2018.png&#34; alt=&#34;Goals - NHL Regular Seasons 2016/207 and 2017/2018&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Happy exploring!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://cran.r-project.org/package=nhlapi&#34;&gt;&lt;code&gt;{nhlapi}&lt;/code&gt; package&lt;/a&gt; on CRAN&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://github.com/jozefhajnala/nhlapi&#34;&gt;&lt;code&gt;{nhlapi}&lt;/code&gt; package&lt;/a&gt; on GitHub&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>A review of my experience with the Functional Programming Principles in Scala course</title>
      <link>https://jozef.io/r923-fun-prog-in-scala-review/</link>
      <pubDate>Sat, 13 Jun 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r923-fun-prog-in-scala-review/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Functional programming is a programming paradigm where programs are constructed by applying and composing functions and it quite popular in the data science application because of some of its useful properties that can help for example with scaling computations. One well-known resource to get into functional programming is the Functional Programming Principles in Scala course by École Polytechnique Fédérale de Lausanne.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, I share my personal experience with completing the Functional programming in Scala course on Coursera in May 2020, briefly walk through the content and write about the course assignments. I wrote down each of the paragraphs as I went through the course, so it is not a retrospective evaluation but more of a “review-style diary” of the process of completing the course. Since this blog is oriented towards R, I will also try to make parallels with the R environment that can be relatable to R users and developers.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#disclaimer-what-to-expect&#34;&gt;Disclaimer, what to expect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#course-organization-pre-course-preparatory-work&#34;&gt;Course organization, pre-course preparatory work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-1&#34;&gt;Week 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-2&#34;&gt;Week 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-3&#34;&gt;Week 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#week-4&#34;&gt;Week 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#weeks-5-and-6&#34;&gt;Weeks 5 and 6&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#course-execution-and-technical-notes&#34;&gt;Course execution and technical notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-give-me-the-overview&#34;&gt;TL;DR - Just give me the overview&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;disclaimer-what-to-expect&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Disclaimer, what to expect&lt;/h1&gt;
&lt;p&gt;First off, this post &lt;em&gt;does not mean to be an objective review&lt;/em&gt; as your experience will most likely be very different to mine, based on both&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;your expectations coming into the course and&lt;/li&gt;
&lt;li&gt;your prior background&lt;/li&gt;
&lt;li&gt;your prior experience with Scala and sbt&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I expected to get deeper and more structured knowledge for practical use in Scala with relation to data science and functional programming, as my prior exposure to Scala was mostly maintaining/fixing in an already established code base and creating Spark extensions to work with Spark from R.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I wrote these comments as I went through the content and assignments instead of after finishing, so you might find the tone of the entire article change as the weeks change.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;course-organization-pre-course-preparatory-work&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Course organization, pre-course preparatory work&lt;/h1&gt;
&lt;div id=&#34;organization&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Organization&lt;/h2&gt;
&lt;p&gt;The course is organized into video sessions split into 6 weeks, but since it is fully online you can choose your own pace. I completed the course in two weeks while being on a standard working schedule. Each week apart from week 5 has a programming assignment that is submitted to Coursera and automatically graded.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I found the assignments executed very well from a technical perspective and had no issues at all with downloading, compiling, running, and submitting them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can submit each assignment as many times as you want, so there is no stress making the submission right on the first try. Once the course is completed, you get a certificate.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;pre-course-setup&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Pre-course setup&lt;/h2&gt;
&lt;p&gt;Since I have prior experience with Scala and sbt, the preparation was not difficult and I was able to get going quickly.&lt;/p&gt;
&lt;p&gt;If you are an R user used to conveniently opening RStudio and easily installing packages, you may be surprised by the difficulty of the whole setup. The course does provide setup videos for major platforms, so with a bit of patience, you should be good to go.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-1&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 1&lt;/h1&gt;
&lt;div id=&#34;content&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Briefly present basic programming paradigms and concepts, model of evaluation of expressions, call by name and call by value strategies, and focuses on recursion, also introducing tail recursion.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;The assignment is purely recursion oriented - &lt;a href=&#34;https://en.wikipedia.org/wiki/Pascal%27s_triangle&#34;&gt;Pascal’s triangle&lt;/a&gt; and such. I was able to complete the assignment easily, even though I found the final exercise challenging. My issue with the content was that this felt more like school homework and I was coming into the course looking to improve and gain practical skill with Scala.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-2&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 2&lt;/h1&gt;
&lt;div id=&#34;content-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;The lectures won me over with constructing a custom class for working with rational numbers. This immediately clicked with me and also was very useful, because it walked through creating classes, defining methods and operators, constructors, requirements, and assertions in a very concise and practical way.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;A different story. The introduction to the problems goes something like&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“We represent a set of integers by its characteristic function and define a type alias for this representation.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had a bit of intuition around this (and if you come from a CS background this may come as second nature), but if you have neither of those, you might have some terminology to study before you even understand what the assignment is about. This is fine if it is in line with your expectations of the course, but if you came for something practical, thinking about programming recursive transformations on integer sets represented via &lt;a href=&#34;https://en.wikipedia.org/wiki/Indicator_function&#34;&gt;characteristic functions&lt;/a&gt; may not seem like the best investment of your time.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-3&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 3&lt;/h1&gt;
&lt;div id=&#34;content-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Has a nice explanation of singleton objects and finally, we look at organizing classes, traits, and objects into packages, very nice. Until we are back to recursion again, this time on binary trees.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;Writing recursive methods on &lt;a href=&#34;https://en.wikipedia.org/wiki/Tree_traversal#Depth-first_search_of_binary_tree&#34;&gt;binary trees&lt;/a&gt;. I spent way more time thinking about how to do it than programming. The methods are very short once done, but complex to think about, especially if you are not used to thinking about recursive traversal of binary trees.&lt;/p&gt;
&lt;p&gt;Also, it got frustrating. The assignment tests were failing because of some predefined timeout that is hard to reproduce locally (you don’t know what tests the grader runs) and you only know when you submit. The issue was to make a recursive method more efficient by placing brackets better around infix operators, which I honestly would not figure out without reading through the course’s discussion board.&lt;/p&gt;
&lt;p&gt;Especially frustrating about this is that the video content only covered trivial cases and the assignment asked for far more complex problems. I was close to just quitting the course at this point.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;week-4&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Week 4&lt;/h1&gt;
&lt;div id=&#34;content-3&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Starts by rewriting Boolean and Integer types as abstract classes - &lt;a href=&#34;https://en.wikipedia.org/wiki/Peano_axioms#Arithmetic&#34;&gt;Peano numbers&lt;/a&gt; in case of non-negative integers. Quickly flies over subtyping, generics, and pattern matching and shows only trivial examples. The more interesting example, well, go do it yourself! The final video shows a bit of practice with lists programming a recursive &lt;a href=&#34;https://en.wikipedia.org/wiki/Insertion_sort&#34;&gt;insertion sort&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment-3&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;Implement &lt;a href=&#34;https://en.wikipedia.org/wiki/Huffman_coding&#34;&gt;Huffman coding&lt;/a&gt; methods via binary trees using pattern matching. I had no idea what Huffman coding is, so I had to first research a topic I had no interest in to even understand the assignment. To give a taste, this is one of the exercises:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Define the function &lt;code&gt;decode&lt;/code&gt; which decodes a list of bits which were already encoded using a Huffman tree, given the corresponding coding tree.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Also, one of the hints, &lt;em&gt;“hint: very simple”&lt;/em&gt; was simply priceless.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;weeks-5-and-6&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Weeks 5 and 6&lt;/h1&gt;
&lt;div id=&#34;content-4&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Week 5 looks at methods available for lists and gives more details behind the intuition. The later sessions even show how we can prove some properties of the recursive methods, which I found interesting. In week 6 we go deeper into the collections is Scala and look at for expressions. We solve the &lt;a href=&#34;https://en.wikipedia.org/wiki/Eight_queens_puzzle&#34;&gt;n queens problem&lt;/a&gt; with Sets and for expressions. One session is dedicated to Maps, Options, and methods such as &lt;code&gt;groupBy&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;As an example, we implement addition of polynomials using these concepts. The conclusion session nicely brings things together by walking through one implementation of the conversion of telephone numbers to sentences. The session looking at Map, Filter, and Reduce methods was very relatable to R’s &lt;code&gt;Map()&lt;/code&gt;, &lt;code&gt;Filter()&lt;/code&gt;, &lt;code&gt;Reduce()&lt;/code&gt; and &lt;code&gt;Position()&lt;/code&gt; functions as their design is similar to the corresponding methods in Scala (look at &lt;code&gt;?Reduce&lt;/code&gt; in R for more).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;assignment-4&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Assignment&lt;/h2&gt;
&lt;p&gt;There is no assignment in week 5. Week 6 assignment is to compute anagrams of sentences. Compared to the previous assignments I found this one much more fun and interesting, so it felt like a positive ending to the course. Apart from the very first one, this is the only assignment I worked to get a 100% score as I found it motivating. If more of the course had at least this level of practicality, I would have enjoyed it much more.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r923-01-anagrams.png&#34; alt=&#34;Scala in VS Code&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Scala in VS Code&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;course-execution-and-technical-notes&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Course execution and technical notes&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Each assignment comes with a pre-prepared sbt project that can be compiled and partially tested, so it is easy to start working on the assignment&lt;/li&gt;
&lt;li&gt;The submission and grading process work conveniently and automatically&lt;/li&gt;
&lt;li&gt;The reading materials themselves often refer to Java constructs to explain Scala constructs, which may tell you nothing if you do not have prior experience with Java. For instance: “Traits are like interfaces in Java, but they can also contain concrete members, i.e. method implementations or field definitions.”&lt;/li&gt;
&lt;li&gt;Some video lectures are placed in the wrong weeks (the narration says Week 3 but they are actually in Week 2) so it can get a bit confusing&lt;/li&gt;
&lt;li&gt;During the first 4 weeks the videos stop for the viewer to fill in the examples, which I guess was meant to be interactive but I found it distracting and always skipped. In the final weeks, this feature was not there, which I found nice.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-overview&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the overview&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The course introduces key concepts of functional programming in Scala with a strong focus on recursion and walks the students through methods on immutable Scala objects. It also introduces pattern matching, for expressions, subtyping, and generics&lt;/li&gt;
&lt;li&gt;The feel of the course was school-like as opposed to more practice-oriented courses&lt;/li&gt;
&lt;li&gt;The assignments are challenging and I found them school-like, which I did not prefer as I was looking for more of a practical course&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Automating R package checks across platforms with GitHub Actions and Docker in a portable way</title>
      <link>https://jozef.io/r922-github-actions-r-packages/</link>
      <pubDate>Sat, 18 Apr 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r922-github-actions-r-packages/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Automating the execution, testing and deployment of R code is a powerful tool to ensure the reproducibility, quality and overall robustness of the code that we are building. A relatively recent feature in GitHub - GitHub actions - allows us to do just that without using additional tools such as Travis or Jenkins for our repositories stored on GitHub.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will examine using GitHub actions and Docker to test our R packages across platforms in a portable way and show how this setup works for the CRAN package languageserversetup.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#many-different-tools-many-different-syntaxes.-and-low-portability&#34;&gt;Many different tools, many different syntaxes. And low portability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#containerizing-and-shell-scripting-our-way-to-portable-setups&#34;&gt;Containerizing and shell scripting our way to portable setups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#continuous-integration-for-r-based-applications-with-github-actions&#34;&gt;Continuous integration for R-based applications with GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-concrete-example---checking-an-r-package-automatically-using-r-hub-in-4-steps&#34;&gt;A concrete example - Checking an R package automatically using R Hub in 4 steps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-show-me-the-code&#34;&gt;TL;DR - just show me the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;many-different-tools-many-different-syntaxes.-and-low-portability&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Many different tools, many different syntaxes. And low portability&lt;/h1&gt;
&lt;p&gt;The motivation behind this post stems mostly from my experience with many different automation tools which we could in simplified terms refer to as &lt;a href=&#34;https://en.wikipedia.org/wiki/Comparison_of_continuous_integration_software&#34;&gt;CI/CD tools&lt;/a&gt;. Some of them offer a wide variety of features such as Jenkins, Bamboo or Travis, others, such as GitLab CI and GitHub Actions are perhaps less feature-rich but offer simplicity and very good out-of-the-box integration with the repository hosting.&lt;/p&gt;
&lt;p&gt;What all these tools share apart from the usefulness of the features is however a bit less appealing for teams trying to build portable CI/CD pipelines - their own syntax.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One good example is the amazing work done to &lt;a href=&#34;https://docs.travis-ci.com/user/languages/r/&#34;&gt;integrate R with Travis&lt;/a&gt;. Thanks to this integration, we can work with R relative well with Travis. It would likely require a similar effort to enable such integration on the other CI/CD tools.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What this means for development teams thinking about CI/CD pipelines is that building portable setups using tool-native syntax can quickly become an endeavor on its own - we have written about some examples of Jenkins-based solutions &lt;a href=&#34;https://jozef.io/r918-jenkins-pipelines/&#34;&gt;with regards to environments here&lt;/a&gt; and &lt;a href=&#34;https://jozef.io/r919-jenkins-pipelines-parallel/&#34;&gt;with regards to parallelization here&lt;/a&gt;. Porting such a setup built using a specific tool to another tool becomes increasingly difficult.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;containerizing-and-shell-scripting-our-way-to-portable-setups&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Containerizing and shell scripting our way to portable setups&lt;/h1&gt;
&lt;p&gt;Because of the experience described above, when setting up CI pipelines for R packages I find it beneficial and efficient to choose a route of portability instead. When setting up with GitLab CI a few years ago, the approach was:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;create a Docker image in which R-related commands will run&lt;/li&gt;
&lt;li&gt;write a simple shell script that wraps around it&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This process is described in detail in there 2 posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/r106-r-package-gitlab-ci/&#34;&gt;How to easily automate R analysis, modeling and development work using CI/CD, with working examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/r107-multiplatform-gitlabci-rhub/&#34;&gt;Setting up continuous multi-platform R package building, checking and testing with R-Hub, Docker and GitLab CI/CD for free, with a working example&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Perhaps the biggest advantage of such an approach is that we can simply pick that shell script up and place it to a different tool and, assuming that the new tool supports Docker.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Everything will run just fine, apart from a few details that still stay tool-based, such as working with environment variables and authentication secrets.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;continuous-integration-for-r-based-applications-with-github-actions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Continuous integration for R-based applications with GitHub Actions&lt;/h1&gt;
&lt;p&gt;When creating the &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup&#34;&gt;&lt;code&gt;languageserversetup&lt;/code&gt;&lt;/a&gt; package, it was very important to test each change across many platforms automatically and since I opted to host the open-source code on GitHub instead of GitLab this time, GitHub Actions seemed like a natural choice for a CI/CD setup.&lt;/p&gt;
&lt;p&gt;The current GitHub action for a CRAN-like checks looks as follows:&lt;/p&gt;
&lt;pre class=&#34;yaml&#34;&gt;&lt;code&gt;name: check_cran
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v1
    - name: Check for CRAN
      env:
        DOCKER_LOGIN_TOKEN: ${{ secrets.DOCKER_LOGIN_TOKEN }}
        LANGSERVERSETUP_RUN_DEPLOY: false
      run: sh ci/docker_stage.sh ci/check_rhub.R &amp;quot;cran&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we can see, apart from the skeleton that you get for free, the only line that does some work is the very last one. It tells the GitHub Actions executor to run the shell script &lt;code&gt;docker_stage.sh&lt;/code&gt; with 2 arguments:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;sh ci/docker_stage.sh ci/check_rhub.R &amp;quot;cran&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;This setup is very portable. You could take it almost verbatim and use it within Jenkins, GitLab CI and probably most other CI/CD tools.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;the-docker-wrapping-shell-script&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Docker-wrapping shell script&lt;/h2&gt;
&lt;p&gt;What is the &lt;code&gt;docker_stage.sh&lt;/code&gt; script used for? In our case, it is 3-fold:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Run CRAN-like checks automatically&lt;/li&gt;
&lt;li&gt;Run containerized deployments&lt;/li&gt;
&lt;li&gt;Run and report test coverage&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What they have in common is that they all happen in a Docker container and are described with an R script that can be executed via &lt;code&gt;Rscript&lt;/code&gt;. That means that this shell script is just a helper that will:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pull the needed Docker image&lt;/li&gt;
&lt;li&gt;Create a container from that image&lt;/li&gt;
&lt;li&gt;Copy the code checked-out by the (GitHub Actions) runner into the container&lt;/li&gt;
&lt;li&gt;Execute the R script provided as the first command-line argument (&lt;code&gt;ci/check_rhub.R&lt;/code&gt; above) and other arguments if needed&lt;/li&gt;
&lt;li&gt;Stop and remove the container when done&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;the-r-scripts-executed-within-the-docker-container&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The R scripts executed within the Docker container&lt;/h2&gt;
&lt;p&gt;Now the R scripts that are executed within the container can do almost any actions that you require, from checking the package, running unit tests to the execution of your data science models.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One fully automated example using this exact approach is how the &lt;a href=&#34;https://www.sparkfromr.com%5D&#34;&gt;sparkfromr.com&lt;/a&gt; book is deployed. The repositories are open-sourced, you can read more in &lt;a href=&#34;https://jozef.io/r206-spark-r-releasing-bookdown/&#34;&gt;this post&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The only important condition is that your Docker container can run that script successfully. In the R world, that mostly entails having R, all the R packages and their system dependencies installed. This is made amazingly easy by the &lt;a href=&#34;https://www.rocker-project.org/&#34;&gt;Rocker Project&lt;/a&gt;, which provides versioned base R images, but also images with RStudio. For tidyverse fans, they even have an image with the entire tidyverse ready for use.&lt;/p&gt;
&lt;p&gt;This is however very easily testable, as the setup using &lt;code&gt;sh ci/docker_stage.sh ci/check_rhub.R &amp;quot;cran&amp;quot;&lt;/code&gt; will not only run via the CI/CD tools, but also on your development machine.
&lt;small&gt;Note that on Windows, you might need to enable the &lt;a href=&#34;https://docs.microsoft.com/en-us/windows/wsl/install-win10&#34;&gt;Windows Subsystem for Linux&lt;/a&gt; for that to be fully true.&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Setting up the process this way may nudge you to a containerized development process, where you develop the project within a container. In that case, the fact that everything works is just an automatic consequence of the development process and the containerization has no overhead, because we can use that very same image for CI/CD purposes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;the-github-actions-yaml-environment-variables-and-secrets&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The GitHub actions yaml, environment variables and secrets&lt;/h2&gt;
&lt;p&gt;Of the few elements of the setup that are not fully portable, notable are environment variables and secrets. For GitHub Actions, we can do it with the &lt;code&gt;env:&lt;/code&gt; clause, for example:&lt;/p&gt;
&lt;pre class=&#34;yaml&#34;&gt;&lt;code&gt;      env:
        DOCKER_LOGIN_TOKEN: ${{ secrets.DOCKER_LOGIN_TOKEN }}
        LANGSERVERSETUP_RUN_DEPLOY: false&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The above will set the &lt;code&gt;LANGSERVERSETUP_RUN_DEPLOY&lt;/code&gt; environment variable to &lt;code&gt;false&lt;/code&gt; and the will expose the encrypted secret named &lt;code&gt;DOCKER_LOGIN_TOKEN&lt;/code&gt; to an environment variable of the same name. The secrets can be created via your repository’s Settings -&amp;gt; Secrets menu on GitHub.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;a-concrete-example---checking-an-r-package-automatically-using-r-hub-in-4-steps&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A concrete example - Checking an R package automatically using R Hub in 4 steps&lt;/h1&gt;
&lt;p&gt;Now with all the information above, let us look at a quick walk-through of a setup that will let us check your R package on multiple platforms using R Hub. We need:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;An R script that will run and evaluate the check via R Hub - For the package languageserver setup, this looks as follows: &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/ci/check_rhub.R&#34;&gt;ci/check_rhub.R&lt;/a&gt;. Note that this script is years old and quite possibly needlessly long and complicated.&lt;/li&gt;
&lt;li&gt;A shell script that will run the R script, such as &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/ci/docker_stage.sh&#34;&gt;ci/docker_stage.sh&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A docker container in which the R script can run. We have covered this in some detail in &lt;a href=&#34;https://jozef.io/r107-multiplatform-gitlabci-rhub/#preparing-a-private-docker-image-to-use-with-r-hub&#34;&gt;Preparing a private docker image to use with R-hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A .yaml file in the &lt;code&gt;.github/workflows&lt;/code&gt; directory of your repository, for example, &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/.github/workflows/check_cran.yml&#34;&gt;.github/workflows/check_cran.yml&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And that is it. Now we will have our package checked each time we push a commit to our repository:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r922-01-github-action-log.png&#34; alt=&#34;GitHub Action log for package check via R Hub&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;GitHub Action log for package check via R Hub&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-uses---test-coverage-reporting-and-script-based-deployments&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Other uses - Test coverage reporting and script-based deployments&lt;/h2&gt;
&lt;p&gt;Since the languageserversetup repository is completely open, you can also look at the other GitHub actions setup for that repository. Note that all of the GitHub actions use the very same &lt;code&gt;docker_stage.sh&lt;/code&gt; script, the only thing that changes are the R scripts per purpose:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Test coverage reporting with covr and codecov.io
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/ci/test_coverage.R&#34;&gt;R script&lt;/a&gt; running the coverage computation with covr and publishing it to codecov.io&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/.github/workflows/coverage.yml&#34;&gt;GitHub Action&lt;/a&gt; definition&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Debian-based script deployments
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/ci/test_deploy.R&#34;&gt;R script&lt;/a&gt; running an example deployment and some tests&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/.github/workflows/test_debian_cran.yml&#34;&gt;GitHub Action&lt;/a&gt; definition&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-show-me-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - just show me the code&lt;/h1&gt;
&lt;p&gt;An example implementation of package testing with the CRAN package languageserversetup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GitHub Actions &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/tree/master/.github/workflows&#34;&gt;workflows for the languageserversetup&lt;/a&gt; package&lt;/li&gt;
&lt;li&gt;Docker-based &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/ci/docker_stage.sh&#34;&gt;shell script&lt;/a&gt; to execute R scripts&lt;/li&gt;
&lt;li&gt;R script for &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/ci/check_rhub.R&#34;&gt;package checks with R Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R script for reporting &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/ci/test_coverage.R&#34;&gt;test coverage&lt;/a&gt; using Codecov.io and covr&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An example implementation of bookdown publication publishing with &lt;a href=&#34;https://www.sparkfromr.com&#34;&gt;sparkfromr.com&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GitHub Actions &lt;a href=&#34;https://github.com/jozefhajnala/sparkfromr/blob/master/.github/workflows/main.yml&#34;&gt;workflows for the book deployment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Docker-based &lt;a href=&#34;https://github.com/jozefhajnala/sparkfromr/blob/master/build/sparkfromr_auto_deploy.sh&#34;&gt;shell script&lt;/a&gt; to deploy the book. Note that there is no need for a separate R script because the action to be done is trivial.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rocker-project.org/images/&#34;&gt;Docker images for R&lt;/a&gt; on the Rocker Project&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.docker.com/get-started/&#34;&gt;Get started with Docker&lt;/a&gt; official documentation&lt;/li&gt;
&lt;li&gt;GitHub Actions: &lt;a href=&#34;https://help.github.com/en/actions/configuring-and-managing-workflows/creating-and-storing-encrypted-secrets&#34;&gt;Creating and storing encrypted secrets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;GitHub Actions: &lt;a href=&#34;https://help.github.com/en/actions&#34;&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Setting up R with Visual Studio Code quickly and easily with the languageserversetup package</title>
      <link>https://jozef.io/r300-language-server-setup/</link>
      <pubDate>Sat, 21 Mar 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r300-language-server-setup/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Over the past years, R has been gaining popularity, bringing to life new tools to with ith it. Thanks to the amazing work by contributors implementing the Language Server Protocol for R and writing Visual Studio Code Extensions for R, the most popular development environment amongst developers across the world now has very strong support for R as well.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will look at the &lt;code&gt;languageserversetup&lt;/code&gt; package that aims to make the setup of the R Language Server robust and easy to use by installing it into a separate, independent library and adjusting R startup in a way that initializes the language server when relevant.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#visual-studio-code-and-r&#34;&gt;Visual Studio Code and R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#setup-considerations-issues-and-tweaks-creating-the-languageserversetup-package&#34;&gt;Setup considerations, issues, and tweaks: creating the languageserversetup package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#solving-it-with-2-r-commands---the-languageserversetup-package&#34;&gt;Solving it with 2 R commands - the languageserversetup package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#in-action-with-vs-code&#34;&gt;In action with VS Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#technical-details&#34;&gt;Technical details&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;visual-studio-code-and-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Visual Studio Code and R&lt;/h1&gt;
&lt;p&gt;According to &lt;a href=&#34;https://insights.stackoverflow.com/survey/2019#development-environments-and-tools&#34;&gt;the 2019 StackOverflow developer survey&lt;/a&gt;, Visual Studio Code is the most popular development environment across the board, with amazing support for many languages and extensions ranging from improved code editing to advanced version control support and Docker integration.&lt;/p&gt;
&lt;p&gt;Until recently the support for R in Visual Studio Code was in my view not comprehensive enough to justify switching from other tools such as RStudio (Server) to using VS Code exclusively. This has changed with the work done by the team implementing the following 3 tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://marketplace.visualstudio.com/items?itemName=Ikuyadeu.r&#34;&gt;R extension&lt;/a&gt; for VS Code&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp&#34;&gt;R LSP Client extension&lt;/a&gt; for VS Code&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://github.com/REditorSupport/languageserver&#34;&gt;languageserver package&lt;/a&gt;: An implementation of the Language Server Protocol for R&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;The features now include all that we need to work efficiently, including auto-complete, definition provider, code formatting, code linting, information on functions on hover, color provider, code sections and more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you are interested in more steps around the setup and the overview of features I recommend the &lt;a href=&#34;https://renkun.me/2019/12/11/writing-r-in-vscode-a-fresh-start/&#34;&gt;Writing R in VSCode: A Fresh Start&lt;/a&gt; blogpost by Kun Ren. I also recommend that you &lt;a href=&#34;https://twitter.com/renkun_ken&#34;&gt;follow Kun&lt;/a&gt; on Twitter if you are interested in the latest developments.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;setup-considerations-issues-and-tweaks-creating-the-languageserversetup-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Setup considerations, issues, and tweaks: creating the &lt;code&gt;languageserversetup&lt;/code&gt; package&lt;/h1&gt;
&lt;p&gt;With my current team, we have almost fully embraced Visual Studio Code as an IDE for our work in R, which is especially great as the work is multi-language and multi-environment in nature and we can do our development in Scala, R and more, including implementing and testing Jenkins pipelines and designing Docker images without leaving VS Code.&lt;/p&gt;
&lt;p&gt;Setting up for the team on multiple systems and platforms we have found the following interesting points which were my motivation to write a small R package, &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup&#34;&gt;languageserversetup&lt;/a&gt;, that should make the installation and setup of the R language server as easy and painless as possible.&lt;/p&gt;
&lt;div id=&#34;managing-package-libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Managing package libraries&lt;/h2&gt;
&lt;p&gt;One of the specifics of R is that all extensions (packages) are installed into package libraries, be it the packages we develop and use for our applications or the tools we use mostly as means to make our development life easier. We can therefore often end in a situation where we need to use different versions of R packages for different purposes. For example, the &lt;code&gt;languageserver&lt;/code&gt; package currently needs &lt;code&gt;R6 (&amp;gt;= 2.4.1)&lt;/code&gt;, &lt;code&gt;stringr (&amp;gt;= 1.4.0)&lt;/code&gt; and more, in total it recursively requires 75 other R packages to be installed. When installing and running the package we can run into conflicting versions of what our current applications need versus what the &lt;code&gt;languageserver&lt;/code&gt; package requires to function properly.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;managing-library-paths&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Managing library paths&lt;/h2&gt;
&lt;p&gt;The second consideration, related to the first one is that if we simply install the language server into the default library with for instance &lt;code&gt;install.packages&lt;/code&gt; it will change the library to a state that is possibly not desired. We can also run into unexpected crashes, where the &lt;code&gt;languageserver&lt;/code&gt; will function properly for a time until one of the non-triggered dependencies with a hidden conflict gets triggered.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;a-solution---complete-library-separation-and-smart-initialization&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A solution - Complete library separation and smart initialization&lt;/h2&gt;
&lt;p&gt;One possible solution to the above issues is to:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Keep the package libraries of the &lt;code&gt;languageserver&lt;/code&gt; and the other libraries that the user uses (perhaps apart from the main system library containing the base and recommended packages that come with the R installation itself) completely separated, including all non-base dependencies&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Initialize that library only when the R process in question is triggered by the language server, otherwise, keep the process untouched and use the user libraries as usual&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;solving-it-with-2-r-commands---the-languageserversetup-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Solving it with 2 R commands - the &lt;code&gt;languageserversetup&lt;/code&gt; package&lt;/h1&gt;
&lt;p&gt;To make the above solution easily accessible, I have created a small R package called &lt;code&gt;languageserversetup&lt;/code&gt; that will do all the work for you. It can be installed from CRAN and it has no dependencies on other R packages:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;quot;languageserversetup&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the entire setup has only 2 steps:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Install the &lt;code&gt;languageserver&lt;/code&gt; package and all of its dependencies into a separate independent library (Will ask for confirmation before taking action) using:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;languageserversetup::languageserver_install()&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Add code to &lt;code&gt;.Rprofile&lt;/code&gt; to automatically align the library paths for the language server functionality if the process is an instance of the &lt;code&gt;languageserver&lt;/code&gt;, otherwise, the R session will run as usual with library paths unaffected. This is achieved by running (will also ask for confirmation):&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;languageserversetup::languageserver_add_to_rprofile()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That’s it. Now you can enjoy the functionality without caring about the setup of libraries or any package version conflicts. Thanks to the full separation of libraries, the removal is as trivial as deleting the library directory.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;in-action-with-vs-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;In action with VS Code&lt;/h1&gt;
&lt;div id=&#34;installing-languageserversetup-and-using-languageserver_install&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Installing languageserversetup and using &lt;code&gt;languageserver_install()&lt;/code&gt;&lt;/h2&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://user-images.githubusercontent.com/23148397/75627074-5888b900-5bcd-11ea-8abf-8008ef0719df.gif&#34; alt=&#34;Installing the language server&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Installing the language server&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;initializing-the-functionality-with-languageserver_add_to_rprofile&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Initializing the functionality with &lt;code&gt;languageserver_add_to_rprofile()&lt;/code&gt;&lt;/h2&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://user-images.githubusercontent.com/23148397/75627078-5aeb1300-5bcd-11ea-9752-448f842ac29d.gif&#34; alt=&#34;Adding the language server to startup&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Adding the language server to startup&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;All done, now enjoy the awesomeness!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;technical-details&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Technical details&lt;/h1&gt;
&lt;p&gt;If you are interested in more technical details,&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;please visit the package’s openly accessible &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup&#34;&gt;GitHub repository&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;the &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/blob/master/README.md&#34;&gt;README.md&lt;/a&gt; has information on options configuration, installation, uninstallation, platforms and more&lt;/li&gt;
&lt;li&gt;the help files for the functions can be accessed from R with &lt;code&gt;?languageserver_install&lt;/code&gt;, &lt;code&gt;?languageserver_startup&lt;/code&gt;, &lt;code&gt;?languageserver_add_to_rprofile&lt;/code&gt; and &lt;code&gt;?languageserver_remove_from_rprofile&lt;/code&gt; for more details on their arguments and customization&lt;/li&gt;
&lt;li&gt;for testing, &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup/actions&#34;&gt;GitHub actions are set up&lt;/a&gt; for multiple platforms and to run all CRAN checks on the package on each commit&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://marketplace.visualstudio.com/items?itemName=Ikuyadeu.r&#34;&gt;R extension&lt;/a&gt; for VS Code Marketplace&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp&#34;&gt;R LSP Client extension&lt;/a&gt; on VS Code Marketplace&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://github.com/REditorSupport/languageserver&#34;&gt;languageserver package&lt;/a&gt; on GitHub&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://github.com/jozefhajnala/languageserversetup&#34;&gt;languageserversetup package&lt;/a&gt; on GitHub&lt;/li&gt;
&lt;li&gt;Kun Ren’s &lt;a href=&#34;https://renkun.me/2019/12/11/writing-r-in-vscode-a-fresh-start/&#34;&gt;Writing R in VSCode: A Fresh Start&lt;/a&gt; blogpost&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years</title>
      <link>https://jozef.io/r921-happy-birthday-r/</link>
      <pubDate>Sat, 22 Feb 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r921-happy-birthday-r/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;It is almost the 29th of February 2020! A day that is very interesting for R, because it marks 20 years from the release of R v1.0.0, the first official public release of the R programming language.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will look back on the 20 years of R with a bit of history and 3 interesting perspectives - how much faster did R get over the years, how many R packages were being released since 2000 and how did the number of package downloads grow.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#the-first-release-of-r-29th-february-2000&#34;&gt;The first release of R, 29th February 2000&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#further-down-in-history-to-1977&#34;&gt;Further down in history, to 1977&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#faster---how-performant-is-r-today-versus-20-years-ago&#34;&gt;Faster - How performant is R today versus 20 years ago?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#stronger---how-many-packages-were-released-over-the-years&#34;&gt;Stronger - How many packages were released over the years?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bigger---how-did-downloads-of-r-packages-grow&#34;&gt;Bigger - How did downloads of R packages grow?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#thank-you-for-the-20-years&#34;&gt;Thank you for the 20 years&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#resources&#34;&gt;Resources&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;the-first-release-of-r-29th-february-2000&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The first release of R, 29th February 2000&lt;/h1&gt;
&lt;p&gt;The first official public release of R happened on the 29th of February, 2000. In the release announcement, Peter Dalgaard notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“The release of a current major version indicates that we believe that R has reached a level of stability and maturity that makes it suitable for production use. Also, the release of 1.0.0 marks that the base language and the API for extension writers will remain stable for the foreseeable future. In addition we have taken the opportunity to tie up as many loose ends as we could.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Today, 20 years later, it is quite amazing how true the statement around the API remaining stable has proven. The original release &lt;a href=&#34;https://stat.ethz.ch/pipermail/r-announce/2000/000127.html&#34;&gt;announcement&lt;/a&gt; and &lt;a href=&#34;http://developer.r-project.org/R-release-1.0.0.txt&#34;&gt;full release statement&lt;/a&gt; are still available online.&lt;/p&gt;
&lt;p&gt;You can also still download the very first public version of R. For instance, for Windows you can find it on the &lt;a href=&#34;https://cran.r-project.org/bin/windows/base/old/&#34;&gt;Previous Releases of R for Windows&lt;/a&gt; page. And it is quite runnable, even under Windows 10.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;further-down-in-history-to-1977&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Further down in history, to 1977&lt;/h1&gt;
&lt;p&gt;Now to give R justice in terms of age, we need to go even further into history. In the &lt;a href=&#34;http://developer.r-project.org/R-release-1.0.0.txt&#34;&gt;full release statement of R v1.0.0&lt;/a&gt;, we can find that&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;R implements a dialect of the award-winning language S, developed at Bell Laboratories by John Chambers et al.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With some digging we can use the Wayback Machine Internet Archive to find interesting &lt;a href=&#34;https://web.archive.org/web/20150626130902/http://ect.bell-labs.com/sl/S/version1.html&#34;&gt;notes on Version 1 of S&lt;/a&gt; itself written by John Chambers, where he writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Over the summer of 1976, some actual implementation began. The paper record has a gap over this period (maybe we were too busy coding to write things down). My recollection is that by early autumn, a language was available for local use on the Honeywell system in use at Murray Hill. Certainly by early 1977 there was software and a first version of a user’s manual.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As we can see the ideas and principles behind R are actually much older than 20 years and even 40 years. If you are interested in the history, I recommend watching the very interesting &lt;a href=&#34;https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/Forty-years-of-S/player&#34;&gt;40 years of S talk&lt;/a&gt; from userR 2016.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;faster---how-performant-is-r-today-versus-20-years-ago&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Faster - How performant is R today versus 20 years ago?&lt;/h1&gt;
&lt;p&gt;With the 20th birthday of R approaching, I was curious as to how much faster did the implementation of R get with increasing versions. I wrote a very simple benchmarking code to solve the &lt;a href=&#34;https://projecteuler.net/problem=14&#34;&gt;Longest Collatz sequence&lt;/a&gt; problem for the first 1 million numbers with a brute-force-ish algorithm.&lt;/p&gt;
&lt;p&gt;Then executed it on the same hardware using 20 different versions of R, starting with the very original 1.0, through 2.0, 3.0 all the way to today’s development version.&lt;/p&gt;
&lt;div id=&#34;benchmarking-code&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Benchmarking code&lt;/h2&gt;
&lt;p&gt;Below is the code snippet with the implementation to be benchmarked:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;col_len &amp;lt;- function(n) {
  len &amp;lt;- 0
  while (n &amp;gt; 1) {
    len &amp;lt;- len + 1
    if ((n %% 2) == 0)
      n &amp;lt;- n / 2
    else {
      n &amp;lt;- (n * 3 + 1) / 2
      len &amp;lt;- len + 1
    }
  }
  len
}

res &amp;lt;- lapply(
  1:10,
  function(i) {
    gc()
    system.time(
      max(sapply(seq(from = 1, to = 999999), col_len))
    )
  }
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;Now to the interesting part, the results - the below chart shows the boxplots of time required to execute the code in seconds, with R versions on the horizontal axis.&lt;/p&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#r920-01-r-speed-boxplot&#39;).highcharts({
  title: {     
    text: &#34;Execution time by R version&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;time (seconds)&#34;     
    },     
    min: 0,     
    max: 1200     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    boxplot: {     
      fillColor: &#34;#C9E4FF&#34;,     
      lineWidth: 1,     
      medianWidth: 1,     
      stemDashStyle: &#34;dot&#34;,     
      stemWidth: 1,     
      whiskerLength: &#34;40%&#34;,     
      whiskerWidth: 1.5     
    }     
  },     
  chart: {     
    type: &#34;column&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      g2: null,     
      data: [     
        {     
          name: &#34;1.0.0&#34;,     
          low: 1057.54,     
          q1: 1058.49,     
          median: 1060.3,     
          q3: 1063.62,     
          high: 1066.54     
        },     
        {     
          name: &#34;1.4.1&#34;,     
          low: 276.57,     
          q1: 277.43,     
          median: 278.17,     
          q3: 279.14,     
          high: 279.14     
        },     
        {     
          name: &#34;2.0.0&#34;,     
          low: 167.56,     
          q1: 167.72,     
          median: 168.185,     
          q3: 169.51,     
          high: 172.1     
        },     
        {     
          name: &#34;2.10.0&#34;,     
          low: 282.6,     
          q1: 285.23,     
          median: 286.465,     
          q3: 287.39,     
          high: 288.29     
        },     
        {     
          name: &#34;2.11.0&#34;,     
          low: 272.91,     
          q1: 273.69,     
          median: 275.43,     
          q3: 277.07,     
          high: 278.01     
        },     
        {     
          name: &#34;2.12.0&#34;,     
          low: 270.52,     
          q1: 271.06,     
          median: 271.38,     
          q3: 272.2,     
          high: 272.6     
        },     
        {     
          name: &#34;2.13.0&#34;,     
          low: 270.94,     
          q1: 271.49,     
          median: 271.735,     
          q3: 272.25,     
          high: 272.25     
        },     
        {     
          name: &#34;2.14.0&#34;,     
          low: 265.25,     
          q1: 266.28,     
          median: 266.52,     
          q3: 267.09,     
          high: 268.12     
        },     
        {     
          name: &#34;2.15.0&#34;,     
          low: 248.18,     
          q1: 249.2,     
          median: 249.51,     
          q3: 249.94,     
          high: 250.01     
        },     
        {     
          name: &#34;2.4.0&#34;,     
          low: 262.86,     
          q1: 263.15,     
          median: 263.585,     
          q3: 264.29,     
          high: 265.2     
        },     
        {     
          name: &#34;2.6.0&#34;,     
          low: 284,     
          q1: 284.28,     
          median: 284.42,     
          q3: 285.17,     
          high: 285.32     
        },     
        {     
          name: &#34;2.8.0&#34;,     
          low: 291.02,     
          q1: 291.05,     
          median: 292.065,     
          q3: 292.61,     
          high: 292.67     
        },     
        {     
          name: &#34;3.0.0&#34;,     
          low: 250.21,     
          q1: 250.54,     
          median: 251.19,     
          q3: 251.84,     
          high: 252.13     
        },     
        {     
          name: &#34;3.1.0&#34;,     
          low: 144.16,     
          q1: 144.61,     
          median: 145.05,     
          q3: 146.02,     
          high: 147.46     
        },     
        {     
          name: &#34;3.2.0&#34;,     
          low: 138.53,     
          q1: 141.06,     
          median: 142.75,     
          q3: 143.67,     
          high: 145.13     
        },     
        {     
          name: &#34;3.3.0&#34;,     
          low: 136.84,     
          q1: 137.22,     
          median: 137.785,     
          q3: 138.95,     
          high: 139.23     
        },     
        {     
          name: &#34;3.4.0&#34;,     
          low: 27.05,     
          q1: 27.33,     
          median: 27.52,     
          q3: 28.21,     
          high: 28.92     
        },     
        {     
          name: &#34;3.5.0&#34;,     
          low: 29.53,     
          q1: 29.75,     
          median: 30.21,     
          q3: 30.95,     
          high: 31.3     
        },     
        {     
          name: &#34;3.6.0&#34;,     
          low: 27.97,     
          q1: 28.27,     
          median: 28.37,     
          q3: 29.31,     
          high: 29.69     
        },     
        {     
          name: &#34;devel&#34;,     
          low: 25.01,     
          q1: 25.15,     
          median: 25.28,     
          q3: 25.51,     
          high: 25.51     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null,     
      name: &#34;Execution time of collatz code chunk by R version&#34;,     
      groupPadding: 0     
    }     
  ]     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;r920-01-r-speed-boxplot&#34;&gt;

&lt;/div&gt;
&lt;p&gt;We can see that the median time to execute the above code to find the longest Collatz sequence amongst the first million numbers was:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;February 2000: More than 17 minutes with the first R version, 1.0.0&lt;/li&gt;
&lt;li&gt;January 2002: A large performance boost came already with the 1.4.1 release, decreasing the time by almost 4x, to around 4.5 minutes&lt;/li&gt;
&lt;li&gt;October 2004: Even more interestingly, my measurements have seen another big improvement with version 2.0.0 - to just 168 seconds, less than 3 minutes. I was not however able to get such good results for any of the later 2.x versions&lt;/li&gt;
&lt;li&gt;April 2014 - Another speed improvement came 10 years later, with version 3.1 decreasing the time to around 145 seconds&lt;/li&gt;
&lt;li&gt;April 2017 - Finally, the 3.4 release has seen another significant performance boost, from this version on the time needed to perform this calculation is less than 30 seconds.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;some-details-and-notes&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Some details and notes&lt;/h2&gt;
&lt;p&gt;The above is by no means a proper benchmarking solution and was ran purely out of interest. The benchmarks were run on a&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Windows-based PC with Intel Core (TM) i5-4590 Processor and 8 GB DDR3 1600 MHz RAM.&lt;/li&gt;
&lt;li&gt;using 32-bit versions of R, with no additional packages installed&lt;/li&gt;
&lt;li&gt;the following options were used with R 1.0.0: &lt;code&gt;--vsize=900M --nsize=20000k&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some interesting notes on running the same code with a 20-year-old version of R:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;There was no &lt;code&gt;message()&lt;/code&gt; function available&lt;/li&gt;
&lt;li&gt;Integer literals using the &lt;code&gt;L&lt;/code&gt; suffix were not accepted&lt;/li&gt;
&lt;li&gt;The function &lt;code&gt;do.call()&lt;/code&gt; needed a character function name as the first argument&lt;/li&gt;
&lt;li&gt;Did not accept &lt;code&gt;=&lt;/code&gt; for assignment. It did accept &lt;code&gt;_&lt;/code&gt; though ;-)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Other than that, the code ran with no issues across all the tested versions.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;stronger---how-many-packages-were-released-over-the-years&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Stronger - How many packages were released over the years?&lt;/h1&gt;
&lt;p&gt;The power of R comes by no small part from the fact that it is easily extensible and the extensions are easily accessible using The Comprehensive R Archive Network, known to most simply as &lt;a href=&#34;https://cran.r-project.org/&#34;&gt;CRAN&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Next on the list of interesting numbers was to look at how CRAN has grown to the powerhouse with more than 15 000 available packages today. Namely, I looked at the numbers of new packages (first releases to CRAN), and total releases (including newer versions of existing packages) over the years using the &lt;code&gt;pkgsearch&lt;/code&gt; package.&lt;/p&gt;
&lt;div id=&#34;results-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#r920-02-r-package_releases&#39;).highcharts({
  title: {     
    text: &#34;Package releases to CRAN over the years&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;Number of packages&#34;     
    },     
    min: 0,     
    softMax: 11000,     
    tickInterval: 2000     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    }     
  },     
  xAxis: {     
    categories: [1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]     
  },     
  series: [     
    {     
      data: [5, 68, 128, 298, 417, 545, 824, 1135, 1581, 1954, 2055, 3074, 3279, 4264, 5238, 5512, 6309, 7134, 8602, 9819, 10831, 10620],     
      name: &#34;All Releases (Including updates)&#34;,     
      type: &#34;area&#34;     
    },     
    {     
      data: [5, 39, 24, 37, 51, 63, 98, 145, 185, 227, 231, 359, 403, 544, 892, 881, 1161, 1432, 1872, 2089, 1791, 1656],     
      name: &#34;First Releases (New Package)&#34;,     
      type: &#34;column&#34;,     
      pointPadding: 0.01,     
      groupPadding: 0.01     
    }     
  ]     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;r920-02-r-package_releases&#34;&gt;

&lt;/div&gt;
&lt;p&gt;Once again, the numbers speak for themselves&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In 2000-2004 the number of newly released packages was less than a 100&lt;/li&gt;
&lt;li&gt;In 2010 CRAN has seen more than 400 new packages&lt;/li&gt;
&lt;li&gt;In 2014 more than 1000 packages had their first release&lt;/li&gt;
&lt;li&gt;In 2017 over 2000 new packages were added to CRAN&lt;/li&gt;
&lt;li&gt;In 2018 and 2019, the number of total CRAN releases was more than 10 000&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I would like to take this opportunity to thank the team behind CRAN to make this amazing growth possible.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;bigger---how-did-downloads-of-r-packages-grow&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Bigger - How did downloads of R packages grow?&lt;/h1&gt;
&lt;p&gt;The size of the user and developer bases of programming languages is difficult to estimate, but we can use a simple proxy to get a picture in terms of growth. RStudio’s CRAN mirror provides a &lt;a href=&#34;https://github.com/metacran/cranlogs.app/blob/master/README.md&#34;&gt;REST API&lt;/a&gt; from which we can look at and visualize the number of monthly downloads of R packages in the past 7 years:&lt;/p&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#r920-01-monthly-r-package-downloads&#39;).highcharts({
  title: {     
    text: &#34;Monthly R package downloads (RStudio&#39;s CRAN mirror)&#34;     
  },     
  yAxis: {     
    title: {     
      text: null     
    }     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    }     
  },     
  series: [     
    {     
      data: [1121929, 1254236, 1585446, 1836591, 1748652, 1696828, 1835270, 1926914, 2615003, 2771729, 2822179, 2378941, 3135423, 3253037, 4130467, 4907718, 4029389, 5060976, 4823936, 5588692, 5396871, 6534368, 9818519, 5984781, 7695022, 7738873, 9937860, 11274461, 10511679, 11046909, 10525599, 11823243, 13925401, 15168714, 16994726, 14767617, 14946786, 16313612, 16110482, 15034673, 18135754, 16964338, 16100541, 16989975, 21212512, 23259892, 27638156, 21852630, 26957904, 28609044, 32522448, 32597694, 32389075, 30214297, 28431161, 28865282, 35438878, 37808497, 39820735, 31812256, 39303078, 38353540, 45966646, 51566293, 52283540, 44977277, 45182577, 43520781, 59734508, 87387892, 74764832, 58582203, 74357130, 73179786, 85358517, 83888879, 93303036, 77304734, 90505852, 89384184, 112344135, 125132842, 120728107, 109839361, 128265394],     
      name: &#34;Monthly R package downloads (RStudio&#39;s CRAN mirror)&#34;,     
      marker: {     
        enabled: false     
      },     
      lineWidth: 1,     
      type: &#34;area&#34;     
    }     
  ],     
  xAxis: {     
    categories: [&#34;2013-01&#34;, &#34;2013-02&#34;, &#34;2013-03&#34;, &#34;2013-04&#34;, &#34;2013-05&#34;, &#34;2013-06&#34;, &#34;2013-07&#34;, &#34;2013-08&#34;, &#34;2013-09&#34;, &#34;2013-10&#34;, &#34;2013-11&#34;, &#34;2013-12&#34;, &#34;2014-01&#34;, &#34;2014-02&#34;, &#34;2014-03&#34;, &#34;2014-04&#34;, &#34;2014-05&#34;, &#34;2014-06&#34;, &#34;2014-07&#34;, &#34;2014-08&#34;, &#34;2014-09&#34;, &#34;2014-10&#34;, &#34;2014-11&#34;, &#34;2014-12&#34;, &#34;2015-01&#34;, &#34;2015-02&#34;, &#34;2015-03&#34;, &#34;2015-04&#34;, &#34;2015-05&#34;, &#34;2015-06&#34;, &#34;2015-07&#34;, &#34;2015-08&#34;, &#34;2015-09&#34;, &#34;2015-10&#34;, &#34;2015-11&#34;, &#34;2015-12&#34;, &#34;2016-01&#34;, &#34;2016-02&#34;, &#34;2016-03&#34;, &#34;2016-04&#34;, &#34;2016-05&#34;, &#34;2016-06&#34;, &#34;2016-07&#34;, &#34;2016-08&#34;, &#34;2016-09&#34;, &#34;2016-10&#34;, &#34;2016-11&#34;, &#34;2016-12&#34;, &#34;2017-01&#34;, &#34;2017-02&#34;, &#34;2017-03&#34;, &#34;2017-04&#34;, &#34;2017-05&#34;, &#34;2017-06&#34;, &#34;2017-07&#34;, &#34;2017-08&#34;, &#34;2017-09&#34;, &#34;2017-10&#34;, &#34;2017-11&#34;, &#34;2017-12&#34;, &#34;2018-01&#34;, &#34;2018-02&#34;, &#34;2018-03&#34;, &#34;2018-04&#34;, &#34;2018-05&#34;, &#34;2018-06&#34;, &#34;2018-07&#34;, &#34;2018-08&#34;, &#34;2018-09&#34;, &#34;2018-10&#34;, &#34;2018-11&#34;, &#34;2018-12&#34;, &#34;2019-01&#34;, &#34;2019-02&#34;, &#34;2019-03&#34;, &#34;2019-04&#34;, &#34;2019-05&#34;, &#34;2019-06&#34;, &#34;2019-07&#34;, &#34;2019-08&#34;, &#34;2019-09&#34;, &#34;2019-10&#34;, &#34;2019-11&#34;, &#34;2019-12&#34;, &#34;2020-01&#34;]     
  }     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;r920-01-monthly-r-package-downloads&#34;&gt;

&lt;/div&gt;
&lt;p&gt;Note the numbers above represent just one of many CRAN mirrors and therefore the true number of package downloads is much higher, the informational value of the chart is mostly in the growth, which is quite impressive:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;January 2013 has seen around 1.1 million&lt;/li&gt;
&lt;li&gt;January 2015 it was 7.7 million&lt;/li&gt;
&lt;li&gt;January 2017 it was 26.9 million&lt;/li&gt;
&lt;li&gt;January 2020 more than 128 million downloads&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;thank-you-for-the-20-years&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Thank you for the 20 years&lt;/h1&gt;
&lt;p&gt;And here is to 20 more.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://i.giphy.com/media/yziuK6WtDFMly/giphy.gif&#34; alt=&#34;Cheers!&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Cheers!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;resources&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The release announcement on &lt;a href=&#34;https://stat.ethz.ch/pipermail/r-announce/2000/000127.html&#34;&gt;stat.ethz.ch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The full release statement at &lt;a href=&#34;http://developer.r-project.org/R-release-1.0.0.txt&#34;&gt;developer.r-project.org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The older version &lt;a href=&#34;https://cran.r-project.org/bin/windows/base/old/&#34;&gt;R Installers for Windows&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Releasing and open-sourcing the Using Spark from R for performance with arbitrary code series</title>
      <link>https://jozef.io/r206-spark-r-releasing-bookdown/</link>
      <pubDate>Sat, 04 Jan 2020 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r206-spark-r-releasing-bookdown/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Over the past months, we published and refined a series of posts on Using Spark from R for performance with arbitrary code. Since the posts have grown in size and scope the blogposts were no longer the best medium to share the content in the way most useful to the readers, we decided to compile a publication instead and open-source it for all readers to use freely.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we present &lt;a href=&#34;https://sparkfromr.com/&#34;&gt;Using Spark from R for performance&lt;/a&gt;, an open-source online publication that will serve as a medium to communicate the current and future installments of the series comprehensively, including instructions on how to use it and a Docker image with all the prerequisites needed to run the code examples.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#who-is-this-book-for&#34;&gt;Who is this book for?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#what-are-the-main-topics-currently-covered&#34;&gt;What are the main topics currently covered?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#are-the-sources-also-available&#34;&gt;Are the sources also available?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#where-can-issues-be-raised&#34;&gt;Where can issues be raised?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#acknowledgments-and-thank-yous&#34;&gt;Acknowledgments and thank yous&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;who-is-this-book-for&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Who is this book for?&lt;/h1&gt;
&lt;p&gt;The book is published at &lt;a href=&#34;https://sparkfromr.com/&#34;&gt;sparkfromr.com&lt;/a&gt; and it focuses on users who are interested in practical insights into using the &lt;code&gt;sparklyr&lt;/code&gt; interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages. This publication focuses on exploring the &lt;strong&gt;different interfaces&lt;/strong&gt; available for communication between R and Spark using the sparklyr package.&lt;/p&gt;
&lt;p&gt;We have also created a Docker image that lets you use the code in the book without caring for setting up all the necessary software requirements such as Java, Spark, and all the necessary R packages. A guide to using the book with that image is &lt;a href=&#34;https://sparkfromr.com/using-a-ready-made-docker-image.html&#34;&gt;included as a separate chapter&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-are-the-main-topics-currently-covered&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;What are the main topics currently covered?&lt;/h1&gt;
&lt;p&gt;The main topics are summarized in the following chapters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://sparkfromr.com/communication-between-spark-and-sparklyr.html&#34;&gt;Communication between Spark and sparklyr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://sparkfromr.com/non-translated-functions-with-spark-apply.html&#34;&gt;Non-translated functions with spark_apply&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://sparkfromr.com/constructing-functions-by-piping-dplyr-verbs.html&#34;&gt;Constructing functions by piping dplyr verbs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://sparkfromr.com/constructing-sql-and-executing-it-with-spark.html&#34;&gt;Constructing SQL and executing it with Spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://sparkfromr.com/using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html&#34;&gt;Using the lower-level invoke API to manipulate Spark’s Java objects from R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://sparkfromr.com/exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html&#34;&gt;Exploring the invoke API from R with Java reflection and examining invokes with logs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;are-the-sources-also-available&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Are the sources also available?&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Yes&lt;/strong&gt;. The content is rendered and published automatically from publicly accessible git repositories, you can find the&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Content sources in the &lt;a href=&#34;https://github.com/jozefhajnala/sparkfromr&#34;&gt;sparkfromr GitHub repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Rendered version in the &lt;a href=&#34;https://github.com/jozefhajnala/sparkfromr_deployed&#34;&gt;sparkfrom_deployed GitHub repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Automatically built Docker image used to render the book on &lt;a href=&#34;https://hub.docker.com/repository/docker/jozefhajnala/sparkfromr&#34;&gt;DockerHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Sources used to build the Docker images in the &lt;a href=&#34;https://github.com/jozefhajnala/sparkfromr_docker&#34;&gt;sparkfrom_docker GitHub repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All contributions to the above are of course most welcome.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;where-can-issues-be-raised&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Where can issues be raised?&lt;/h1&gt;
&lt;p&gt;In case you find any errors and other issues with the book, or simply have requests for improvements or more content features the ideal place to raise them is directly in the GitHub repositories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For issues in the content of the book, please &lt;a href=&#34;https://github.com/jozefhajnala/sparkfromr/issues&#34;&gt;raise an issue here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;For issues related to the Docker image, please &lt;a href=&#34;https://github.com/jozefhajnala/sparkfromr_docker&#34;&gt;raise an issue here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;acknowledgments-and-thank-yous&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Acknowledgments and thank yous&lt;/h1&gt;
&lt;p&gt;Creation of this book would not be possible without many openly available resources such as the&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;R packages around the &lt;em&gt;rmarkdown&lt;/em&gt; ecosystem created by &lt;a href=&#34;https://yihui.org/en/&#34;&gt;Yihui Xie&lt;/a&gt;, namely the &lt;a href=&#34;https://bookdown.org/&#34;&gt;&lt;em&gt;bookdown&lt;/em&gt;&lt;/a&gt; package via which this publication is rendered&lt;/li&gt;
&lt;li&gt;the project also heavily relies on &lt;a href=&#34;https://www.rocker-project.org/&#34;&gt;the Rocker Project&lt;/a&gt; which provides Docker images for the R environment thanks to &lt;a href=&#34;https://www.carlboettiger.info/&#34;&gt;Carl Boettiger&lt;/a&gt;, &lt;a href=&#34;http://dirk.eddelbuettel.com/&#34;&gt;Dirk Eddelbuettel&lt;/a&gt;, and &lt;a href=&#34;https://www.noamross.net/&#34;&gt;Noam Ross&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;last but not least there would be nothing to write about in this short book if the &lt;a href=&#34;https://cran.r-project.org/web/packages/sparklyr/index.html&#34;&gt;&lt;em&gt;sparklyr&lt;/em&gt;&lt;/a&gt; package was not written by &lt;a href=&#34;https://github.com/javierluraschi/&#34;&gt;Javier Luraschi&lt;/a&gt; et al., the R programming language itself maintained by the &lt;a href=&#34;https://www.r-project.org/contributors.html&#34;&gt;R core&lt;/a&gt; group and the &lt;a href=&#34;https://spark.apache.org/&#34;&gt;Apache Spark&lt;/a&gt; creators and maintainers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My thanks go to the creators and maintainers of all these amazing open-source tools.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r206-01-bookdown-spark-and-r.png&#34; alt=&#34;Logos of bookdown, Apache Spark and R&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Logos of bookdown, Apache Spark and R&lt;/p&gt;
&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Happy reading!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>4 great free tools that can make your R work more efficient, reproducible and robust</title>
      <link>https://jozef.io/r920-christmas-praise-2019/</link>
      <pubDate>Sat, 21 Dec 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r920-christmas-praise-2019/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;It is Christmas time again! And &lt;a href=&#34;https://jozef.io/r907-christmas-praise/&#34;&gt;just like last year&lt;/a&gt;, what better time than this to write about the great tools that are available to all interested in working with R. This post is meant as a praise to a few selected tools and packages that helped me to be more efficient and productive with R in 2019.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will praise free tools that can help your work become more efficient, reproducible and productive, namely the data.table package, the Rocker project for R-based Docker images, the base package parallel, and the R-Hub service for package checking.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#data.table---rs-unsung-powerhouse&#34;&gt;data.table - R’s unsung powerhouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-rocker-project-for-r-based-docker-images&#34;&gt;The Rocker project for R-based Docker images&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#base-package-parallel&#34;&gt;Base package Parallel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#rhub-for-fast-and-automated-multi-platform-r-package-testing&#34;&gt;Rhub for fast and automated multi-platform R package testing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#resources&#34;&gt;Resources&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;data.table---rs-unsung-powerhouse&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;data.table - R’s unsung powerhouse&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;../img/r917-01-datatable-logo.png&#34; alt=&#34;Logo of data.table&#34; class=&#34;leftsmall&#34;&gt; One of the packages I find most under-marketed and under-appreciated in the R package ecosystem is &lt;a href=&#34;https://cran.r-project.org/web/packages/data.table/index.html&#34;&gt;data.table&lt;/a&gt;. If it is mentioned, it is mostly for its speed and memory efficiency, which is certainly well deserved, but I feel dismissing the other benefits and features is not doing it justice. Here are a few points that I like about data.table that do not get that much exposure.&lt;/p&gt;
&lt;div id=&#34;the-concise-and-generic-syntax&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The concise and generic syntax&lt;/h2&gt;
&lt;p&gt;I enjoy the fact that data.table’s syntax is very concise and principle-driven. In effect, all you need for most common use cases is to learn using the &lt;code&gt;[]&lt;/code&gt; brackets and an amazing world of opportunities will follow. Just one small example on taking 2 data tables, joining them on their common columns, filtering on rows, summarizing a variable grouped by an evaluated expression on 1 line:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Prepare the packages and data
library(data.table)
flts &amp;lt;- as.data.table(nycflights13::flights)
wthr &amp;lt;- as.data.table(nycflights13::weather)
byCols &amp;lt;- intersect(names(flts), names(wthr))

# Join, filter, group by and aggregate
wthr[flts, on = byCols][origin == &amp;quot;JFK&amp;quot;, mean(dep_delay, na.rm = TRUE), precip &amp;gt; 0]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    precip       V1
## 1:  FALSE 10.92661
## 2:     NA 13.66543
## 3:   TRUE 29.70753&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;fully-featured-data-wrangling-toolbox&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Fully featured data wrangling toolbox&lt;/h2&gt;
&lt;p&gt;And this is just scratching the surface as data.table also provides functions such as&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dcast()&lt;/code&gt; and &lt;code&gt;melt()&lt;/code&gt; for efficient data reshaping&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rbindlist()&lt;/code&gt; for fast replacement of &lt;code&gt;do.call(&amp;quot;rbind&amp;quot;, l)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fsetdiff()&lt;/code&gt;, &lt;code&gt;fintersect()&lt;/code&gt;, &lt;code&gt;funion()&lt;/code&gt; and &lt;code&gt;fsetequal()&lt;/code&gt; for fast and easy to use operations on data.tables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rollup()&lt;/code&gt;, &lt;code&gt;cube()&lt;/code&gt; and &lt;code&gt;groupingsets()&lt;/code&gt; to create pivot tables, more on that &lt;a href=&#34;https://jozef.io/r912-datatable-grouping-sets/&#34;&gt;in a dedicated article&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;no-dependencies&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;No dependencies&lt;/h2&gt;
&lt;p&gt;All in all, I consider data.table to be a single package that brings speed, efficiency and conciseness to all data wrangling operations. Another benefit that also often stays unmentioned is the fact that data.table has no dependencies on other non-base R packages, which is beneficial for maintenance, stability, reproducibility, size and deployment speeds.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;fast-reading-and-writing-of-compressed-csvs&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Fast reading and writing of (compressed) csvs&lt;/h2&gt;
&lt;p&gt;One additional feature of data.table that I use regularly is the ability to read and write data to and from text files with amazing speeds using the &lt;code&gt;fread()&lt;/code&gt; and &lt;code&gt;fwrite()&lt;/code&gt; functions. On one project, it gave the team I was a part of such a benefit I &lt;a href=&#34;https://jozef.io/r917-fread-comparisons/&#34;&gt;wrote an article&lt;/a&gt; on it.&lt;/p&gt;
&lt;p&gt;Not only is it very fast and convenient, but thanks to a recently added feature, data.table now supports &lt;code&gt;fwrite()&lt;/code&gt; directly to gzipped csvs, which saves significant space when writing large amounts of data.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For getting started with data.table, I recommend the &lt;a href=&#34;https://cloud.r-project.org/web/packages/data.table/vignettes/datatable-intro.html&#34;&gt;Introduction to data.table vignette&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-rocker-project-for-r-based-docker-images&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The Rocker project for R-based Docker images&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;../img/r920-01-docker-logo.png&#34; alt=&#34;Logo of Docker&#34; class=&#34;leftsmall&#34;&gt;Containerization is a powerful and useful tool for many purposes, one of them being reproducibility. In the R world, ensuring that our R library contains the exact versions of packages we need can be achieved by using tools such as &lt;a href=&#34;https://cran.r-project.org/package=packrat&#34;&gt;packrat&lt;/a&gt; or its successor &lt;a href=&#34;https://cran.r-project.org/package=renv&#34;&gt;renv&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Managing the R package versions can however only get us so far, especially when relying on other system dependencies such as pandoc for rendering our R Markdown documents or Java. And when we need to test our R applications against multiple versions of R itself, things can get very tedious and messy very quickly using just one environment, especially on UNIX-based platforms.&lt;/p&gt;
&lt;p&gt;In comes the &lt;a href=&#34;https://www.rocker-project.org/&#34;&gt;Rocker project - Docker Containers for the R Environment&lt;/a&gt;. Thanks to the efforts of Carl Boettiger, Dirk Eddelbuettel, and Noam Ross, spinning a container with a specific version of R, RStudio or even the tidyverse packages is as easy as launching a terminal and running&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run --rm -ti rocker/r-base&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Want to test your R code using an older version of R, say some &lt;em&gt;Very, Very Secure Dishes&lt;/em&gt; from 2016? As easy as&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run --rm -ti rocker/r-ver:3.2.5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Even more usefully, all the sources to build the Docker images are also &lt;a href=&#34;https://github.com/rocker-org/rocker&#34;&gt;available on GitHub&lt;/a&gt;, so we can adapt the images for our own usage. For instance&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the series of articles on &lt;a href=&#34;https://jozef.io/tags/spark/&#34;&gt;Using Spark from R for performance with arbitrary code&lt;/a&gt; on this blog uses a setup adapted from the &lt;code&gt;rocker/r-ver:3.6.1&lt;/code&gt; image&lt;/li&gt;
&lt;li&gt;we have also used the images provided by the Rocker project when setting up continuous &lt;a href=&#34;https://jozef.io/r107-multiplatform-gitlabci-rhub/#preparing-a-private-docker-image-to-use-with-r-hub&#34;&gt;multi-platform R package building, checking and testing&lt;/a&gt; with R-Hub&lt;/li&gt;
&lt;li&gt;even to keep the building of this very website stable and reproducible, a Docker image based on the Rocker project is used&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;On a more generic note, learning Docker is beneficial to R users also when working outside R and there are many great learning resources to do so. For learning Docker I recommend the &lt;a href=&#34;https://docs.docker.com/get-started/&#34;&gt;Get started documentation&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;base-package-parallel&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Base package parallel&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;../img/r920-02-r-logo.svg&#34; alt=&#34;Logo of R&#34; class=&#34;leftsmall&#34;&gt;The internals of the R language are single-threaded, meaning that when writing R code, unless optimized for multi-threaded computation under the hood such as data.table does, our code will only utilize 1 thread, which can pose a challenge to performance even in common daily tasks, especially now that even common, very portable ultrabooks come with processors with 4 or more cores and 8 or more threads.&lt;/p&gt;
&lt;p&gt;The R ecosystem provides many ways to take advantage of the multiple threads available. In this post I would like to give more visibility to the parallelization options that come with the base R installation itself, not requiring any extra external dependencies or packages - via the package &lt;code&gt;parallel&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In a very small showcase, let’s look at how much faster we can execute a brute-force-ish solution to the &lt;a href=&#34;https://projecteuler.net/problem=14&#34;&gt;Longest Collatz sequence&lt;/a&gt; problem for the first 10 million numbers. First, define the function that will compute the sequence length for a given integer &lt;code&gt;n&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;col_len &amp;lt;- function(n) {
  len &amp;lt;- 0L
  while (n &amp;gt; 1) {
    len &amp;lt;- len + 1L
    if ((n %% 2) == 0)
      n &amp;lt;- n / 2
    else {
      n &amp;lt;- (n * 3 + 1) / 2
      len &amp;lt;- len + 1L
    }
  }
  len
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running the function for numbers from 1 to 9,999,999 using &lt;code&gt;sapply()&lt;/code&gt; and measuring the time on this particular laptop shown that the process finished in around 580 seconds - almost 10 minutes:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;max(sapply(seq(from = 1, to = 9999999), col_len))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 8400511&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we will create a simple cluster on the local machine using all available threads and send the function definition to all the created worker processes:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Attach the parallel package
library(parallel)
# Create a cluster using all available threads
cl &amp;lt;- makeCluster(detectCores(), methods = FALSE)
# Send the definition of the col_len function to the workers
clusterExport(cl, &amp;quot;col_len&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we execute the function in parallel using the cluster. It is as simple as just using &lt;code&gt;parSapply()&lt;/code&gt; instead of &lt;code&gt;sapply()&lt;/code&gt; and providing the cluster definition &lt;code&gt;cl&lt;/code&gt; as the first argument:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Execute in parallel using cluster cl
max(parSapply(cl, seq(from = 1, to = 9999999), col_len))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 8400511&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After the process is done, it is good practice to stop the cluster:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Stopping the cluster
stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using all 8 available threads the time needed to execute the code and get the same results went down to around 90 seconds or 1.5 minutes. We can therefore gain significant time savings using base R executing some of your code in parallel, adjusting the code very minimally and using very faimilar syntax.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For more information on using the parallel package, I recommend reading the package’s vignette by running &lt;code&gt;vignette(&amp;quot;parallel&amp;quot;)&lt;/code&gt; or reading &lt;a href=&#34;https://stat.ethz.ch/R-manual/R-patched/library/parallel/doc/parallel.pdf&#34;&gt;online&lt;/a&gt;. For more information on High-Performance and Parallel Computing with R, there is a dedicated &lt;a href=&#34;https://cran.r-project.org/web/views/HighPerformanceComputing.html&#34;&gt;CRAN Task View&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;rhub-for-fast-and-automated-multi-platform-r-package-testing&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Rhub for fast and automated multi-platform R package testing&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;../img/r920-03-rhub-logo.png&#34; alt=&#34;Logo of rhub&#34; class=&#34;leftsmall&#34;&gt;&lt;a href=&#34;https://r-hub.github.io/rhub/index.html&#34;&gt;R-hub&lt;/a&gt; offers free R CMD check as a service on different platforms. This enables R developers to quickly and efficiently check their R packages to make sure they pass all necessary checks on several platforms. As a bonus, the checks seem to be running in a very short time, which means we can have your results at hand in a few minutes.&lt;/p&gt;
&lt;p&gt;Using R-hub interactively is as simple as installing the &lt;a href=&#34;https://cran.r-project.org/package=rhub&#34;&gt;rhub package&lt;/a&gt; from CRAN, validating your e-mail by running &lt;code&gt;rhub::validate_email()&lt;/code&gt; and running:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cr &amp;lt;- rhub::check()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In an interactive session, this will offer a list of platforms to choose from and check our package against them.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r107-02-gitlab-rhub-run.gif&#34; alt=&#34;CI/CD running checks on multiple platforms with R-hub&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;CI/CD running checks on multiple platforms with R-hub&lt;/p&gt;
&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;For more introductory information, we recommend the &lt;a href=&#34;https://r-hub.github.io/rhub/articles/rhub.html&#34;&gt;Get started with rhub&lt;/a&gt; article. We have written about automating and continuously executing multiplatform checks using GitLab CI/CD integration and Docker images in a &lt;a href=&#34;https://jozef.io/r107-multiplatform-gitlabci-rhub/&#34;&gt;separate blog post&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;resources&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r907-christmas-praise/&#34;&gt;Christmas praise&lt;/a&gt; post for 2018&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://cloud.r-project.org/web/packages/data.table/vignettes/datatable-intro.html&#34;&gt;Introduction to data.table&lt;/a&gt; vignette&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://docs.docker.com/get-started/&#34;&gt;Get started&lt;/a&gt; Docker documentation&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://stat.ethz.ch/R-manual/R-patched/library/parallel/doc/parallel.pdf&#34;&gt;Parallel package&lt;/a&gt; vignette&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://r-hub.github.io/rhub/articles/rhub.html&#34;&gt;Get started with rhub&lt;/a&gt; article&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote class=&#34;xmas&#34;&gt;
Thank you for reading and&lt;br /&gt; have a very merry Christmas :o)
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Using Spark from R for performance with arbitrary code - Part 5 - Exploring the invoke API from R with Java reflection and examining invokes with logs</title>
      <link>https://jozef.io/r205-spark-r-invoke-scala-2/</link>
      <pubDate>Sat, 23 Nov 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r205-spark-r-invoke-scala-2/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the previous parts of this series, we have shown how to write functions as both &lt;a href=&#34;https://jozef.io/r202-spark-r-dplyr-verbs/&#34;&gt;combinations of dplyr verbs&lt;/a&gt;, &lt;a href=&#34;https://jozef.io/r203-spark-r-sql/&#34;&gt;SQL query generators&lt;/a&gt; that can be executed by Spark and &lt;a href=&#34;https://jozef.io/r204-spark-r-invoke-scala/&#34;&gt;how to use the lower-level API&lt;/a&gt; to invoke methods on Java object references from R.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this fifth part, we will look into more details around sparklyr’s &lt;code&gt;invoke()&lt;/code&gt; API, investigate available methods for different classes of objects using the Java reflection API and look under the hood of the sparklyr interface mechanism with invoke logging.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#preparation&#34;&gt;Preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#examining-available-methods-from-r&#34;&gt;Examining available methods from R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-the-java-reflection-api-to-list-the-available-methods&#34;&gt;Using the Java reflection API to list the available methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#investigating-dataset-and-sparkcontext-class-methods&#34;&gt;Investigating DataSet and SparkContext class methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#how-sparklyr-communicates-with-spark-invoke-logging&#34;&gt;How sparklyr communicates with Spark, invoke logging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;preparation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Preparation&lt;/h1&gt;
&lt;p&gt;The full setup of Spark and sparklyr is not in the scope of this post, please check the &lt;a href=&#34;https://jozef.io/r201-spark-r-1/#setting-up-spark-with-r-and-sparklyr&#34;&gt;first one&lt;/a&gt; for some setup instructions and a ready-made Docker image.&lt;/p&gt;
&lt;p&gt;If you have docker available, running&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Should make RStudio available by navigating to &lt;a href=&#34;http://localhost:8787&#34;&gt;http://localhost:8787&lt;/a&gt; in your browser. You can then use the user name &lt;code&gt;rstudio&lt;/code&gt; and password &lt;code&gt;pass&lt;/code&gt; to login and continue experimenting with the code in this post.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Connect and copy the flights dataset to the instance
sc &amp;lt;- sparklyr::spark_connect(master = &amp;quot;local&amp;quot;)
tbl_flights &amp;lt;- dplyr::copy_to(sc, nycflights13::flights, &amp;quot;flights&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;examining-available-methods-from-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Examining available methods from R&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;If you did not do so, it is recommended to read the &lt;a href=&#34;https://jozef.io/r204-spark-r-invoke-scala/&#34;&gt;previous part&lt;/a&gt; of this series before this one to get a quick overview of the &lt;code&gt;invoke()&lt;/code&gt; API.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;using-the-java-reflection-api-to-list-the-available-methods&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using the Java reflection API to list the available methods&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;invoke()&lt;/code&gt; interface is powerful, but also a bit hidden from the eyes as we do not immediately know what methods are available for which object classes. We can circumvent that using the &lt;code&gt;getMethods&lt;/code&gt; method which (in short) returns an array of Method objects reflecting public member methods of the class.&lt;/p&gt;
&lt;p&gt;For instance, retrieving a list of methods for the &lt;code&gt;org.apache.spark.SparkContext&lt;/code&gt; class:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mthds &amp;lt;- sc %&amp;gt;% spark_context() %&amp;gt;%
  invoke(&amp;quot;getClass&amp;quot;) %&amp;gt;%
  invoke(&amp;quot;getMethods&amp;quot;)
head(mthds)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## &amp;lt;jobj[55]&amp;gt;
##   java.lang.reflect.Method
##   public org.apache.spark.util.CallSite org.apache.spark.SparkContext.org$apache$spark$SparkContext$$creationSite()
## 
## [[2]]
## &amp;lt;jobj[56]&amp;gt;
##   java.lang.reflect.Method
##   public org.apache.spark.SparkConf org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_conf()
## 
## [[3]]
## &amp;lt;jobj[57]&amp;gt;
##   java.lang.reflect.Method
##   public org.apache.spark.SparkEnv org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_env()
## 
## [[4]]
## &amp;lt;jobj[58]&amp;gt;
##   java.lang.reflect.Method
##   public scala.Option org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_progressBar()
## 
## [[5]]
## &amp;lt;jobj[59]&amp;gt;
##   java.lang.reflect.Method
##   public scala.Option org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_ui()
## 
## [[6]]
## &amp;lt;jobj[60]&amp;gt;
##   java.lang.reflect.Method
##   public org.apache.spark.rpc.RpcEndpointRef org.apache.spark.SparkContext.org$apache$spark$SparkContext$$_heartbeatReceiver()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can see that the &lt;code&gt;invoke()&lt;/code&gt; chain has returned a list of Java object references, each of them of class &lt;code&gt;java.lang.reflect.Method&lt;/code&gt;. This is a good result, but the output is not very user-friendly from the R user perspective. Let us write a small wrapper that will return a some of the method’s details in a more readable fashion, for instance the return type and an overview of parameters:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;getMethodDetails &amp;lt;- function(mthd) {
  returnType &amp;lt;- mthd %&amp;gt;% invoke(&amp;quot;getReturnType&amp;quot;) %&amp;gt;% invoke(&amp;quot;toString&amp;quot;)
  params &amp;lt;- mthd %&amp;gt;% invoke(&amp;quot;getParameters&amp;quot;)
  params &amp;lt;- vapply(params, invoke, &amp;quot;toString&amp;quot;, FUN.VALUE = character(1))
  c(returnType = returnType, params = paste(params, collapse = &amp;quot;, &amp;quot;))
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, to get a nice overview, we can make another helper function that will return a named list of methods for an object’s class, including their return types and overview of parameters:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;getAvailableMethods &amp;lt;- function(jobj) {
  mthds &amp;lt;- jobj %&amp;gt;% invoke(&amp;quot;getClass&amp;quot;) %&amp;gt;% invoke(&amp;quot;getMethods&amp;quot;)
  nms &amp;lt;- vapply(mthds, invoke, &amp;quot;getName&amp;quot;, FUN.VALUE = character(1))
  res &amp;lt;- lapply(mthds, getMethodDetails)
  names(res) &amp;lt;- nms
  res
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;investigating-dataset-and-sparkcontext-class-methods&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Investigating DataSet and SparkContext class methods&lt;/h1&gt;
&lt;p&gt;Using the above defined function we can explore the methods available to a DataFrame reference, showing a few of the names first:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dfMethods &amp;lt;- tbl_flights %&amp;gt;% spark_dataframe() %&amp;gt;%
  getAvailableMethods()

# Show some method names:
dfMethodNames &amp;lt;- sort(unique(names(dfMethods)))
head(dfMethodNames, 20)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;agg&amp;quot;                           &amp;quot;alias&amp;quot;                        
##  [3] &amp;quot;apply&amp;quot;                         &amp;quot;as&amp;quot;                           
##  [5] &amp;quot;cache&amp;quot;                         &amp;quot;checkpoint&amp;quot;                   
##  [7] &amp;quot;coalesce&amp;quot;                      &amp;quot;col&amp;quot;                          
##  [9] &amp;quot;collect&amp;quot;                       &amp;quot;collectAsArrowToPython&amp;quot;       
## [11] &amp;quot;collectAsList&amp;quot;                 &amp;quot;collectToPython&amp;quot;              
## [13] &amp;quot;colRegex&amp;quot;                      &amp;quot;columns&amp;quot;                      
## [15] &amp;quot;count&amp;quot;                         &amp;quot;createGlobalTempView&amp;quot;         
## [17] &amp;quot;createOrReplaceGlobalTempView&amp;quot; &amp;quot;createOrReplaceTempView&amp;quot;      
## [19] &amp;quot;createTempView&amp;quot;                &amp;quot;crossJoin&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we would like to see more details we can now investigate further, for instance show different parameter interfaces for the &lt;code&gt;agg&lt;/code&gt; method, showing that the &lt;code&gt;agg&lt;/code&gt; method has the following parameter interfaces:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sort(vapply(
  dfMethods[names(dfMethods) == &amp;quot;agg&amp;quot;], 
  `[[`, &amp;quot;params&amp;quot;,
  FUN.VALUE = character(1)
))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                                                                                                                                  agg 
##                                                                             &amp;quot;java.util.Map&amp;lt;java.lang.String, java.lang.String&amp;gt; arg0&amp;quot; 
##                                                                                                                                  agg 
##                                                              &amp;quot;org.apache.spark.sql.Column arg0, org.apache.spark.sql.Column... arg1&amp;quot; 
##                                                                                                                                  agg 
##                                           &amp;quot;org.apache.spark.sql.Column arg0, scala.collection.Seq&amp;lt;org.apache.spark.sql.Column&amp;gt; arg1&amp;quot; 
##                                                                                                                                  agg 
##                                                            &amp;quot;scala.collection.immutable.Map&amp;lt;java.lang.String, java.lang.String&amp;gt; arg0&amp;quot; 
##                                                                                                                                  agg 
## &amp;quot;scala.Tuple2&amp;lt;java.lang.String, java.lang.String&amp;gt; arg0, scala.collection.Seq&amp;lt;scala.Tuple2&amp;lt;java.lang.String, java.lang.String&amp;gt;&amp;gt; arg1&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similarly, we can look at a &lt;code&gt;SparkContext&lt;/code&gt; class and show some available methods that can be invoked:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;scMethods &amp;lt;- sc %&amp;gt;% spark_context() %&amp;gt;%
  getAvailableMethods()
scMethodNames &amp;lt;- sort(unique(names(scMethods)))
head(scMethodNames, 60)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;$lessinit$greater$default$3&amp;quot; &amp;quot;$lessinit$greater$default$4&amp;quot;
##  [3] &amp;quot;$lessinit$greater$default$5&amp;quot; &amp;quot;accumulable&amp;quot;                
##  [5] &amp;quot;accumulableCollection&amp;quot;       &amp;quot;accumulator&amp;quot;                
##  [7] &amp;quot;addedFiles&amp;quot;                  &amp;quot;addedJars&amp;quot;                  
##  [9] &amp;quot;addFile&amp;quot;                     &amp;quot;addJar&amp;quot;                     
## [11] &amp;quot;addSparkListener&amp;quot;            &amp;quot;applicationAttemptId&amp;quot;       
## [13] &amp;quot;applicationId&amp;quot;               &amp;quot;appName&amp;quot;                    
## [15] &amp;quot;assertNotStopped&amp;quot;            &amp;quot;binaryFiles&amp;quot;                
## [17] &amp;quot;binaryFiles$default$2&amp;quot;       &amp;quot;binaryRecords&amp;quot;              
## [19] &amp;quot;binaryRecords$default$3&amp;quot;     &amp;quot;broadcast&amp;quot;                  
## [21] &amp;quot;cancelAllJobs&amp;quot;               &amp;quot;cancelJob&amp;quot;                  
## [23] &amp;quot;cancelJobGroup&amp;quot;              &amp;quot;cancelStage&amp;quot;                
## [25] &amp;quot;checkpointDir&amp;quot;               &amp;quot;checkpointDir_$eq&amp;quot;          
## [27] &amp;quot;checkpointFile&amp;quot;              &amp;quot;clean&amp;quot;                      
## [29] &amp;quot;clean$default$2&amp;quot;             &amp;quot;cleaner&amp;quot;                    
## [31] &amp;quot;clearCallSite&amp;quot;               &amp;quot;clearJobGroup&amp;quot;              
## [33] &amp;quot;collectionAccumulator&amp;quot;       &amp;quot;conf&amp;quot;                       
## [35] &amp;quot;createSparkEnv&amp;quot;              &amp;quot;dagScheduler&amp;quot;               
## [37] &amp;quot;dagScheduler_$eq&amp;quot;            &amp;quot;defaultMinPartitions&amp;quot;       
## [39] &amp;quot;defaultParallelism&amp;quot;          &amp;quot;deployMode&amp;quot;                 
## [41] &amp;quot;doubleAccumulator&amp;quot;           &amp;quot;emptyRDD&amp;quot;                   
## [43] &amp;quot;env&amp;quot;                         &amp;quot;equals&amp;quot;                     
## [45] &amp;quot;eventLogCodec&amp;quot;               &amp;quot;eventLogDir&amp;quot;                
## [47] &amp;quot;eventLogger&amp;quot;                 &amp;quot;executorAllocationManager&amp;quot;  
## [49] &amp;quot;executorEnvs&amp;quot;                &amp;quot;executorMemory&amp;quot;             
## [51] &amp;quot;files&amp;quot;                       &amp;quot;getAllPools&amp;quot;                
## [53] &amp;quot;getCallSite&amp;quot;                 &amp;quot;getCheckpointDir&amp;quot;           
## [55] &amp;quot;getClass&amp;quot;                    &amp;quot;getConf&amp;quot;                    
## [57] &amp;quot;getExecutorIds&amp;quot;              &amp;quot;getExecutorMemoryStatus&amp;quot;    
## [59] &amp;quot;getExecutorThreadDump&amp;quot;       &amp;quot;getLocalProperties&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;using-helpers-to-explore-the-methods&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using helpers to explore the methods&lt;/h2&gt;
&lt;p&gt;We can also use the helper functions to investigate more. For instance, we see that there is a &lt;code&gt;getConf&lt;/code&gt; method avaiable to us. Looking at the object reference however does not provide useful information, so we can list the methods for that class and look for &lt;code&gt;&amp;quot;get&amp;quot;&lt;/code&gt; methods that would show us the configuration:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spark_conf &amp;lt;- sc %&amp;gt;% spark_context() %&amp;gt;% invoke(&amp;quot;conf&amp;quot;)
spark_conf_methods &amp;lt;- spark_conf %&amp;gt;% getAvailableMethods() 
spark_conf_get_methods &amp;lt;- spark_conf_methods %&amp;gt;%
  names() %&amp;gt;%
  grep(pattern = &amp;quot;get&amp;quot;, ., value = TRUE) %&amp;gt;%
  sort()
spark_conf_get_methods&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;get&amp;quot;                 &amp;quot;get&amp;quot;                 &amp;quot;get&amp;quot;                
##  [4] &amp;quot;getAll&amp;quot;              &amp;quot;getAllWithPrefix&amp;quot;    &amp;quot;getAppId&amp;quot;           
##  [7] &amp;quot;getAvroSchema&amp;quot;       &amp;quot;getBoolean&amp;quot;          &amp;quot;getClass&amp;quot;           
## [10] &amp;quot;getDeprecatedConfig&amp;quot; &amp;quot;getDouble&amp;quot;           &amp;quot;getenv&amp;quot;             
## [13] &amp;quot;getExecutorEnv&amp;quot;      &amp;quot;getInt&amp;quot;              &amp;quot;getLong&amp;quot;            
## [16] &amp;quot;getOption&amp;quot;           &amp;quot;getSizeAsBytes&amp;quot;      &amp;quot;getSizeAsBytes&amp;quot;     
## [19] &amp;quot;getSizeAsBytes&amp;quot;      &amp;quot;getSizeAsGb&amp;quot;         &amp;quot;getSizeAsGb&amp;quot;        
## [22] &amp;quot;getSizeAsKb&amp;quot;         &amp;quot;getSizeAsKb&amp;quot;         &amp;quot;getSizeAsMb&amp;quot;        
## [25] &amp;quot;getSizeAsMb&amp;quot;         &amp;quot;getTimeAsMs&amp;quot;         &amp;quot;getTimeAsMs&amp;quot;        
## [28] &amp;quot;getTimeAsSeconds&amp;quot;    &amp;quot;getTimeAsSeconds&amp;quot;    &amp;quot;getWithSubstitution&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see that there is a &lt;code&gt;getAll&lt;/code&gt; method that could prove useful, returning a list of tuples and taking no arguments as input:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Returns a list of tuples, takes no arguments:
spark_conf_methods[[&amp;quot;getAll&amp;quot;]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##              returnType                  params 
## &amp;quot;class [Lscala.Tuple2;&amp;quot;                      &amp;quot;&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Invoke the `getAll` method and look at part of the result
spark_confs &amp;lt;- spark_conf %&amp;gt;% invoke(&amp;quot;getAll&amp;quot;)
spark_confs &amp;lt;- vapply(spark_confs, invoke, &amp;quot;toString&amp;quot;, FUN.VALUE = character(1))
sort(spark_confs)[c(2, 3, 12, 14)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;(spark.app.name,sparklyr)&amp;quot;         &amp;quot;(spark.driver.host,localhost)&amp;quot;    
## [3] &amp;quot;(spark.spark.port.maxRetries,128)&amp;quot; &amp;quot;(spark.sql.shuffle.partitions,2)&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Looking at &lt;a href=&#34;https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/SparkConf.html#getAll()&#34;&gt;the Scala documentation for the &lt;code&gt;getAll&lt;/code&gt; method&lt;/a&gt;, we actually see that there is information missing on our data - the classes of the objects in the tuple, which in this case is &lt;code&gt;scala.Tuple2&amp;lt;java.lang.String,java.lang.String&amp;gt;[]&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We could therefore improve our helper to be more detailed in the return value information.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;unexported-helpers-provided-by-sparklyr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Unexported helpers provided by sparklyr&lt;/h2&gt;
&lt;p&gt;The sparklyr package itself provides facilities of nature similar to those above, looking at some of them, even though they are not exported:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sparklyr:::jobj_class(spark_conf)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;SparkConf&amp;quot; &amp;quot;Object&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sparklyr:::jobj_info(spark_conf)$class&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;org.apache.spark.SparkConf&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;capture.output(sparklyr:::jobj_inspect(spark_conf)) %&amp;gt;% head(10)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;&amp;lt;jobj[1645]&amp;gt;&amp;quot;                                                                                                                   
##  [2] &amp;quot;  org.apache.spark.SparkConf&amp;quot;                                                                                                   
##  [3] &amp;quot;  org.apache.spark.SparkConf@7ec389e7&amp;quot;                                                                                          
##  [4] &amp;quot;Fields:&amp;quot;                                                                                                                        
##  [5] &amp;quot;&amp;lt;jobj[2490]&amp;gt;&amp;quot;                                                                                                                   
##  [6] &amp;quot;  java.lang.reflect.Field&amp;quot;                                                                                                      
##  [7] &amp;quot;  private final java.util.concurrent.ConcurrentHashMap org.apache.spark.SparkConf.org$apache$spark$SparkConf$$settings&amp;quot;         
##  [8] &amp;quot;&amp;lt;jobj[2491]&amp;gt;&amp;quot;                                                                                                                   
##  [9] &amp;quot;  java.lang.reflect.Field&amp;quot;                                                                                                      
## [10] &amp;quot;  private transient org.apache.spark.internal.config.ConfigReader org.apache.spark.SparkConf.org$apache$spark$SparkConf$$reader&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;how-sparklyr-communicates-with-spark-invoke-logging&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How sparklyr communicates with Spark, invoke logging&lt;/h1&gt;
&lt;p&gt;Now that we have and overview of the &lt;code&gt;invoke()&lt;/code&gt; interface, we can take a look under the hood of sparklyr and see how it actually communicates with the Spark instance. In fact, the communication is a set of invocations that can be very different depending on which of the approches we choose for our purposes.&lt;/p&gt;
&lt;p&gt;To obtain the information, we use the &lt;code&gt;sparklyr.log.invoke&lt;/code&gt; property. We can choose one of the following 3 values based on our preferences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;TRUE&lt;/code&gt; will use &lt;code&gt;message()&lt;/code&gt; to communicate short info on what is being invoked&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&amp;quot;cat&amp;quot;&lt;/code&gt; will use &lt;code&gt;cat()&lt;/code&gt; to communicate short info on what is being invoked&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&amp;quot;callstack&amp;quot;&lt;/code&gt; will use &lt;code&gt;message()&lt;/code&gt; to communicate short info on what is being invoked and the callstack&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will use &lt;code&gt;TRUE&lt;/code&gt; in our article to keep the output short and easily manageable. First, we will close the previous connection and create a new one with the configuration containing the &lt;code&gt;sparklyr.log.invoke&lt;/code&gt; set to &lt;code&gt;TRUE&lt;/code&gt;, and copy in the flights dataset:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sparklyr::spark_disconnect(sc)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NULL&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;config &amp;lt;- sparklyr::spark_config()
config$sparklyr.log.invoke &amp;lt;- TRUE
suppressMessages({
  sc &amp;lt;- sparklyr::spark_connect(master = &amp;quot;local&amp;quot;, config = config)
  tbl_flights &amp;lt;- dplyr::copy_to(sc, nycflights13::flights, &amp;quot;flights&amp;quot;)
})&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;using-dplyr-verbs-translated-with-dbplyr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using dplyr verbs translated with dbplyr&lt;/h2&gt;
&lt;p&gt;Now that the setup is complete, we use the dplyr verb approach to retrieve the count of rows and look the invocations that this entails:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights %&amp;gt;% dplyr::count()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sql
## Invoking sql&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking columns&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking isStreaming&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sql&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking isStreaming&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sql&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sparklyr.Utils collect&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking columns&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking schema&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking fields&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking dataType&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking toString&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking name&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sql&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking columns&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 1]
##        n
##    &amp;lt;dbl&amp;gt;
## 1 336776&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see multiple invocations do the &lt;code&gt;sql&lt;/code&gt; method and also the &lt;code&gt;columns&lt;/code&gt; method. This makes sense since the dplyr verb approach actually works by translating the commands into Spark SQL via dbplyr and then sends those translated commands to Spark via that interface.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-dbi-to-send-queries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using DBI to send queries&lt;/h2&gt;
&lt;p&gt;Similarly, we can investigate the invocations that happen when we try to retrieve the same results via the DBI interface:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;DBI::dbGetQuery(sc, &amp;quot;SELECT count(1) AS n FROM flights&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sql&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking isStreaming&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sparklyr.Utils collect&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking columns&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking schema&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking fields&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking dataType&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking toString&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking name&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        n
## 1 336776&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see slightly fewer invocations compared to the above dplyr approach, but the output is also less processed.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-the-invoke-interface&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using the invoke interface&lt;/h2&gt;
&lt;p&gt;Looking at the invocations that get executed using the &lt;code&gt;invoke()&lt;/code&gt; interface:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights %&amp;gt;% spark_dataframe() %&amp;gt;% invoke(&amp;quot;count&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking sql&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Invoking count&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 336776&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see that the amount of invocations is much lower, where the top 3 invocations come from the first part of the pipe. The &lt;code&gt;invoke(&amp;quot;count&amp;quot;)&lt;/code&gt; part translated to exactly one invocation to the &lt;code&gt;count&lt;/code&gt; method. We see therefore that the &lt;code&gt;invoke()&lt;/code&gt; interface is indeed a more lower-level interface that invokes methods as we request them, with little to none overhead related to translations and other effects.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;redirecting-the-invoke-logs&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Redirecting the invoke logs&lt;/h2&gt;
&lt;p&gt;When running R applications that use Spark as a calculation engine, it is useful to get detailed invoke logs for debugging and diagnostic purposes. Implementing such mechanisms, we need to take into consideration how R handles the invoke logs produced by sparklyr. In simple terms, the invoke logs produced when using&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;TRUE&lt;/code&gt; and &lt;code&gt;&amp;quot;callstack&amp;quot;&lt;/code&gt; are created using &lt;code&gt;message()&lt;/code&gt;, which means they get sent to the &lt;code&gt;stderr()&lt;/code&gt; connection by default&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&amp;quot;cat&amp;quot;&lt;/code&gt; are created using &lt;code&gt;cat()&lt;/code&gt;, so they get sent to &lt;code&gt;stdout()&lt;/code&gt; connection by default&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This info can prove useful when redirecting the log information from standard output and standard error to different logging targets.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r201-01-spark-and-r.png&#34; alt=&#34;Apache Spark and R logos&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Apache Spark and R logos&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In this part of the series, we have looked at using the Java reflection API with sparklyr’s &lt;code&gt;invoke()&lt;/code&gt; interface to get useful insight on available methods for different object types that can be used in the context of Spark, but also in other contexts. Using invoke logging, we have also shown how the different sparklyr interfacing methods communicate with Spark under the hood.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r201-spark-r-1/&#34;&gt;first part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r202-spark-r-dplyr-verbs/&#34;&gt;second part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r203-spark-r-sql/&#34;&gt;third part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r204-spark-r-invoke-scala/&#34;&gt;fourth part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;A &lt;a href=&#34;https://hub.docker.com/r/jozefhajnala/sparkly&#34;&gt;Docker image&lt;/a&gt; with R, Spark, sparklyr and Arrow available and &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/sparkly/Dockerfile&#34;&gt;its Dockerfile&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Stackoverflow &lt;a href=&#34;https://stackoverflow.com/questions/37628/what-is-reflection-and-why-is-it-useful&#34;&gt;discussion of reflection&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Using Spark from R for performance with arbitrary code - Part 4 - Using the lower-level invoke API to manipulate Spark&#39;s Java objects from R</title>
      <link>https://jozef.io/r204-spark-r-invoke-scala/</link>
      <pubDate>Sat, 09 Nov 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r204-spark-r-invoke-scala/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the previous parts of this series, we have shown how to write functions as both &lt;a href=&#34;https://jozef.io/r202-spark-r-dplyr-verbs/&#34;&gt;combinations of dplyr verbs&lt;/a&gt; and &lt;a href=&#34;https://jozef.io/r203-spark-r-sql/&#34;&gt;SQL query generators&lt;/a&gt; that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this fourth part, we will look at how to write R functions that interface with Spark via a lower-level invocation API that lets us use all the functionality that is exposed by the Scala Spark APIs. We will also show how such R calls relate to Scala code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#contents&#34;&gt;Contents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#preparation&#34;&gt;Preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-invoke-api-of-sparklyr&#34;&gt;The invoke() API of sparklyr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#getting-started-with-the-invoke-api&#34;&gt;Getting started with the invoke API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#grouping-and-aggregation-with-invoke-chains&#34;&gt;Grouping and aggregation with invoke chains&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#wrapping-the-invocations-into-r-functions&#34;&gt;Wrapping the invocations into R functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#reconstructing-variable-normalization&#34;&gt;Reconstructing variable normalization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#where-invoke-can-be-better-than-dplyr-translation-or-sql&#34;&gt;Where invoke can be better than dplyr translation or SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;preparation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Preparation&lt;/h1&gt;
&lt;p&gt;The full setup of Spark and sparklyr is not in the scope of this post, please check the &lt;a href=&#34;https://jozef.io/r201-spark-r-1/#setting-up-spark-with-r-and-sparklyr&#34;&gt;first one&lt;/a&gt; for some setup instructions and a ready-made Docker image.&lt;/p&gt;
&lt;p&gt;If you have docker available, running&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Should make RStudio available by navigating to &lt;a href=&#34;http://localhost:8787&#34;&gt;http://localhost:8787&lt;/a&gt; in your browser. You can then use the user name &lt;code&gt;rstudio&lt;/code&gt; and password &lt;code&gt;pass&lt;/code&gt; to login and continue experimenting with the code in this post.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Prepare the data
weather &amp;lt;- nycflights13::weather %&amp;gt;%
  mutate(id = 1L:nrow(nycflights13::weather)) %&amp;gt;% 
  select(id, everything())

# Connect
sc &amp;lt;- sparklyr::spark_connect(master = &amp;quot;local&amp;quot;)

# Copy the weather dataset to the instance
tbl_weather &amp;lt;- dplyr::copy_to(
  dest = sc, 
  df = weather,
  name = &amp;quot;weather&amp;quot;,
  overwrite = TRUE
)
# Copy the flights dataset to the instance
tbl_flights &amp;lt;- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::flights,
  name = &amp;quot;flights&amp;quot;,
  overwrite = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-invoke-api-of-sparklyr&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The invoke() API of sparklyr&lt;/h1&gt;
&lt;p&gt;So far when interfacing with Spark from R, we have used the sparklyr package in three ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Writing combinations of dplyr verbs that would be translated to Spark SQL via the dbplyr package and the SQL executed by Spark when requested&lt;/li&gt;
&lt;li&gt;Generating Spark SQL code directly and sending it for execution in multiple ways&lt;/li&gt;
&lt;li&gt;Combinations of the above two methods&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What these methods have in common is that they translate operations written in R to Spark SQL and that SQL code is then sent for execution by our Spark instance.&lt;/p&gt;
&lt;p&gt;There is however another approach that we can use with sparklyr, which will be more familiar to users or developers who have worked with &lt;a href=&#34;https://jozef.io/r901-primer-java-from-r-1/&#34;&gt;packages like rJava&lt;/a&gt; or rscala before. Even though arguably less convenient than the APIs provided by the 2 aforementioned packages, sparklyr provides an invocation API that exposes 3 functions:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;invoke(jobj, method, ...)&lt;/code&gt; to execute a method on a Java object reference&lt;/li&gt;
&lt;li&gt;&lt;code&gt;invoke_static(sc, class, method, ...)&lt;/code&gt; to execute a static method associated with a Java class&lt;/li&gt;
&lt;li&gt;&lt;code&gt;invoke_new(sc, class, ...)&lt;/code&gt; to invoke a constructor associated with a Java class&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r201-01-spark-and-r.png&#34; alt=&#34;Apache Spark and R logos&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Apache Spark and R logos&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Let us have a look at how we can use those functions in practice to efficiently work with Spark from R.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-started-with-the-invoke-api&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting started with the invoke API&lt;/h1&gt;
&lt;p&gt;We can start with a few very simple examples of &lt;code&gt;invoke()&lt;/code&gt; usage, for instance getting the number of rows of the &lt;code&gt;tbl_flights&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get the count of rows
tbl_flights %&amp;gt;% spark_dataframe() %&amp;gt;%
  invoke(&amp;quot;count&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 336776&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see one extra operation before invoking the count: &lt;code&gt;spark_dataframe()&lt;/code&gt;. This is because the &lt;code&gt;invoke()&lt;/code&gt; interface works with Java object references and not &lt;code&gt;tbl&lt;/code&gt; objects in remote sources such as &lt;code&gt;tbl_flights&lt;/code&gt;. We, therefore, need to convert &lt;code&gt;tbl_flights&lt;/code&gt; to a Java object reference, for which we use the &lt;code&gt;spark_dataframe()&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;Now, for something more exciting, let us compute a summary of the variables in &lt;code&gt;tbl_flights&lt;/code&gt; using the &lt;code&gt;describe&lt;/code&gt; method:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights_summary &amp;lt;- tbl_flights %&amp;gt;% spark_dataframe() %&amp;gt;%
  invoke(&amp;quot;describe&amp;quot;, as.list(colnames(tbl_flights))) %&amp;gt;%
  sdf_register()
tbl_flights_summary&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 19]
##   summary year  month day   dep_time sched_dep_time dep_delay arr_time
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;   
## 1 count   3367… 3367… 3367… 328521   336776         328521    328063  
## 2 mean    2013… 6.54… 15.7… 1349.10… 1344.25484001… 12.63907… 1502.05…
## 3 stddev  0.0   3.41… 8.76… 488.281… 467.335755734… 40.21006… 533.264…
## 4 min     2013  1     1     1        106            -43.0     1       
## 5 max     2013  12    31    2400     2359           1301.0    2400    
## # … with 11 more variables: sched_arr_time &amp;lt;chr&amp;gt;, arr_delay &amp;lt;chr&amp;gt;,
## #   carrier &amp;lt;chr&amp;gt;, flight &amp;lt;chr&amp;gt;, tailnum &amp;lt;chr&amp;gt;, origin &amp;lt;chr&amp;gt;, dest &amp;lt;chr&amp;gt;,
## #   air_time &amp;lt;chr&amp;gt;, distance &amp;lt;chr&amp;gt;, hour &amp;lt;chr&amp;gt;, minute &amp;lt;chr&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We also one see extra operation after invoking the describe method: &lt;code&gt;sdf_register()&lt;/code&gt;. This is because the &lt;code&gt;invoke()&lt;/code&gt; interface also &lt;em&gt;returns&lt;/em&gt; Java object references and we may like to see a more user-friendly &lt;code&gt;tbl&lt;/code&gt; object instead. This is where &lt;code&gt;sdf_register()&lt;/code&gt; comes in to register a Spark DataFrame and return a &lt;code&gt;tbl_spark&lt;/code&gt; object back to us.&lt;/p&gt;
&lt;p&gt;And indeed, we can see that the wrapper &lt;code&gt;sdf_describe()&lt;/code&gt; provided by the sparklyr package itself works in a very similar fashion:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sparklyr::sdf_describe&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## function (x, cols = colnames(x)) 
## {
##     in_df &amp;lt;- cols %in% colnames(x)
##     if (any(!in_df)) {
##         msg &amp;lt;- paste0(&amp;quot;The following columns are not in the data frame: &amp;quot;, 
##             paste0(cols[which(!in_df)], collapse = &amp;quot;, &amp;quot;))
##         stop(msg)
##     }
##     cols &amp;lt;- cast_character_list(cols)
##     x %&amp;gt;% spark_dataframe() %&amp;gt;% invoke(&amp;quot;describe&amp;quot;, cols) %&amp;gt;% 
##         sdf_register()
## }
## &amp;lt;environment: namespace:sparklyr&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we so wish, for DataFrame related object references, we can also call &lt;code&gt;collect()&lt;/code&gt; to retrieve the results directly, without using &lt;code&gt;sdf_register()&lt;/code&gt; first, for instance retrieving the full content of the &lt;code&gt;origin&lt;/code&gt; column:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights %&amp;gt;% spark_dataframe() %&amp;gt;%
  invoke(&amp;quot;select&amp;quot;, &amp;quot;origin&amp;quot;, list()) %&amp;gt;%
  collect()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 336,776 x 1
##    origin
##    &amp;lt;chr&amp;gt; 
##  1 EWR   
##  2 LGA   
##  3 JFK   
##  4 JFK   
##  5 LGA   
##  6 EWR   
##  7 EWR   
##  8 LGA   
##  9 JFK   
## 10 LGA   
## # … with 336,766 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It can also be helpful to investigate the schema of our flights DataFrame:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights %&amp;gt;% spark_dataframe() %&amp;gt;%
  invoke(&amp;quot;schema&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;lt;jobj[143]&amp;gt;
##   org.apache.spark.sql.types.StructType
##   StructType(StructField(year,IntegerType,true), StructField(month,IntegerType,true), StructField(day,IntegerType,true), StructField(dep_time,IntegerType,true), StructField(sched_dep_time,IntegerType,true), StructField(dep_delay,DoubleType,true), StructField(arr_time,IntegerType,true), StructField(sched_arr_time,IntegerType,true), StructField(arr_delay,DoubleType,true), StructField(carrier,StringType,true), StructField(flight,IntegerType,true), StructField(tailnum,StringType,true), StructField(origin,StringType,true), StructField(dest,StringType,true), StructField(air_time,DoubleType,true), StructField(distance,DoubleType,true), StructField(hour,DoubleType,true), StructField(minute,DoubleType,true), StructField(time_hour,TimestampType,true))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also use the invoke interface on other objects, for instance the &lt;code&gt;SparkContext&lt;/code&gt;. Let’s for instance retrieve the &lt;code&gt;uiWebUrl&lt;/code&gt; of our context:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sc %&amp;gt;% spark_context() %&amp;gt;%
  invoke(&amp;quot;uiWebUrl&amp;quot;) %&amp;gt;%
  invoke(&amp;quot;toString&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;Some(http://localhost:4040)&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;grouping-and-aggregation-with-invoke-chains&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Grouping and aggregation with invoke chains&lt;/h1&gt;
&lt;p&gt;Imagine we would like to do simple aggregations of a Spark DataFrame, such as an average of a column grouped by another column. For reference, we can do this very simply using the dplyr approach. Let’s compute the average departure delay by origin of the flight:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights %&amp;gt;%
  group_by(origin) %&amp;gt;%
  summarise(avg(dep_delay))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 2]
##   origin `avg(dep_delay)`
##   &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;
## 1 EWR                15.1
## 2 JFK                12.1
## 3 LGA                10.3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we will show how to do the same aggregation via the lower level API. Using the Spark shell we would simply do:&lt;/p&gt;
&lt;pre class=&#34;scala&#34;&gt;&lt;code&gt;flights.
  groupBy(&amp;quot;origin&amp;quot;).
  agg(avg(&amp;quot;dep_delay&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Translating that into the lower level &lt;code&gt;invoke()&lt;/code&gt; API provided by sparklyr looks something like this:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights %&amp;gt;%
  spark_dataframe() %&amp;gt;%
  invoke(&amp;quot;groupBy&amp;quot;, &amp;quot;origin&amp;quot;, list()) %&amp;gt;%
  invoke(&amp;quot;agg&amp;quot;, invoke_static(sc, &amp;quot;org.apache.spark.sql.functions&amp;quot;, &amp;quot;expr&amp;quot;, &amp;quot;avg(dep_delay)&amp;quot;), list()) %&amp;gt;%
  sdf_register()&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;what-is-all-that-extra-code&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What is all that extra code?&lt;/h2&gt;
&lt;p&gt;Now, compared to the very simple 2 operations in the Scala version, we have some gotchas to examine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;one of the &lt;code&gt;invoke()&lt;/code&gt; calls is quite long. Instead of just &lt;code&gt;avg(&amp;quot;dep_delay&amp;quot;)&lt;/code&gt; like in the Scala example, we use &lt;code&gt;invoke_static(sc, &amp;quot;org.apache.spark.sql.functions&amp;quot;, &amp;quot;expr&amp;quot;, &amp;quot;avg(dep_delay)&amp;quot;)&lt;/code&gt;. This is because the &lt;code&gt;avg(&amp;quot;dep_delay&amp;quot;)&lt;/code&gt; expression is somewhat of a syntactic sugar provided by Scala, but when calling from R we need to provide the object reference hidden behind that sugar.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;the empty &lt;code&gt;list()&lt;/code&gt; at the end of the &lt;code&gt;&amp;quot;groupBy&amp;quot;&lt;/code&gt; and &lt;code&gt;&amp;quot;agg&amp;quot;&lt;/code&gt; invokes. This is needed as a workaround some Scala methods &lt;a href=&#34;https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@groupBy(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDataset&#34;&gt;take String, String*&lt;/a&gt; as arguments and sparklyr currently does not support variable parameters. We can pass &lt;code&gt;list()&lt;/code&gt; to represent an empty &lt;code&gt;String[]&lt;/code&gt; in Scala as the needed second argument.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;wrapping-the-invocations-into-r-functions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Wrapping the invocations into R functions&lt;/h1&gt;
&lt;p&gt;Seeing the above example, we can quickly write a useful wrapper to ease the pain a little. First, we can create a small function that will generate the aggregation expression we can use with &lt;code&gt;invoke(&amp;quot;agg&amp;quot;, ...)&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;agg_expr &amp;lt;- function(tbl, exprs) {
  sparklyr::invoke_static(
    tbl[[&amp;quot;src&amp;quot;]][[&amp;quot;con&amp;quot;]],
    &amp;quot;org.apache.spark.sql.functions&amp;quot;,
    &amp;quot;expr&amp;quot;,
    exprs
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we can wrap around the entire process to make a more generic aggregation function, using the fact that a remote tibble has the details on &lt;code&gt;sc&lt;/code&gt; within its &lt;code&gt;tbl[[&amp;quot;src&amp;quot;]][[&amp;quot;con&amp;quot;]]&lt;/code&gt; element:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grpagg_invoke &amp;lt;- function(tbl, colName, groupColName, aggOperation) {
  avgColumn &amp;lt;- tbl %&amp;gt;% agg_expr(paste0(aggOperation, &amp;quot;(&amp;quot;, colName, &amp;quot;)&amp;quot;))
  tbl %&amp;gt;%  spark_dataframe() %&amp;gt;% 
    invoke(&amp;quot;groupBy&amp;quot;, groupColName, list()) %&amp;gt;%
    invoke(&amp;quot;agg&amp;quot;, avgColumn, list()) %&amp;gt;% 
    sdf_register()
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And finally use our wrapper to get the same results in a more user-friendly way:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_flights %&amp;gt;% 
  grpagg_invoke(&amp;quot;arr_delay&amp;quot;, groupColName = &amp;quot;origin&amp;quot;, aggOperation = &amp;quot;avg&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 2]
##   origin `avg(arr_delay)`
##   &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;
## 1 EWR                9.11
## 2 JFK                5.55
## 3 LGA                5.78&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;reconstructing-variable-normalization&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Reconstructing variable normalization&lt;/h1&gt;
&lt;p&gt;Now we will attempt to construct the variable normalization that we have shown in the previous parts with dplyr verbs and SQL generation - we will normalize the values of a column by first subtracting the mean value and then dividing the values by the standard deviation:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;normalize_invoke &amp;lt;- function(tbl, colName) {
  sdf &amp;lt;- tbl %&amp;gt;% spark_dataframe()
  stdCol &amp;lt;- agg_expr(tbl, paste0(&amp;quot;stddev_samp(&amp;quot;, colName, &amp;quot;)&amp;quot;))
  avgCol &amp;lt;- agg_expr(tbl, paste0(&amp;quot;avg(&amp;quot;, colName, &amp;quot;)&amp;quot;))
  avgTemp &amp;lt;- sdf %&amp;gt;% invoke(&amp;quot;agg&amp;quot;, avgCol, list()) %&amp;gt;% invoke(&amp;quot;first&amp;quot;)
  stdTemp &amp;lt;- sdf %&amp;gt;% invoke(&amp;quot;agg&amp;quot;, stdCol, list()) %&amp;gt;% invoke(&amp;quot;first&amp;quot;)
  newCol &amp;lt;- sdf %&amp;gt;%
    invoke(&amp;quot;col&amp;quot;, colName) %&amp;gt;%
    invoke(&amp;quot;minus&amp;quot;, as.numeric(avgTemp)) %&amp;gt;%
    invoke(&amp;quot;divide&amp;quot;, as.numeric(stdTemp))
  sdf %&amp;gt;%
    invoke(&amp;quot;withColumn&amp;quot;, colName, newCol) %&amp;gt;%
    sdf_register()
}

tbl_weather %&amp;gt;% normalize_invoke(&amp;quot;temp&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 16]
##       id origin  year month   day  hour   temp  dewp humid wind_dir
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1 EWR     2013     1     1     1 -0.913  26.1  59.4      270
##  2     2 EWR     2013     1     1     2 -0.913  27.0  61.6      250
##  3     3 EWR     2013     1     1     3 -0.913  28.0  64.4      240
##  4     4 EWR     2013     1     1     4 -0.862  28.0  62.2      250
##  5     5 EWR     2013     1     1     5 -0.913  28.0  64.4      260
##  6     6 EWR     2013     1     1     6 -0.974  28.0  67.2      240
##  7     7 EWR     2013     1     1     7 -0.913  28.0  64.4      240
##  8     8 EWR     2013     1     1     8 -0.862  28.0  62.2      250
##  9     9 EWR     2013     1     1     9 -0.862  28.0  62.2      260
## 10    10 EWR     2013     1     1    10 -0.802  28.0  59.6      260
## # … with more rows, and 6 more variables: wind_speed &amp;lt;dbl&amp;gt;,
## #   wind_gust &amp;lt;dbl&amp;gt;, precip &amp;lt;dbl&amp;gt;, pressure &amp;lt;dbl&amp;gt;, visib &amp;lt;dbl&amp;gt;,
## #   time_hour &amp;lt;dttm&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The above implementation is just an example and far from optimal, but it also has a few interesting points about it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using &lt;code&gt;invoke(&amp;quot;first&amp;quot;)&lt;/code&gt; will actually compute and collect the value into the R session&lt;/li&gt;
&lt;li&gt;Those collected values are then sent back during the &lt;code&gt;invoke(&amp;quot;minus&amp;quot;, as.numeric(avgTemp))&lt;/code&gt; and &lt;code&gt;invoke(&amp;quot;divide&amp;quot;, as.numeric(stdTemp))&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means that there is unnecessary overhead when sending those values from the Spark instance into R and back, which will have slight performance penalties.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;where-invoke-can-be-better-than-dplyr-translation-or-sql&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Where invoke can be better than dplyr translation or SQL&lt;/h1&gt;
&lt;p&gt;As we have seen in the above examples, working with the &lt;code&gt;invoke()&lt;/code&gt; API can prove more difficult than using the intuitive syntax of dplyr or SQL queries. In some use cases, the trade-off may still be worth it. In our practice, these are some examples of such situations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When Scala’s Spark API is more flexible, powerful or suitable for a particular task and the translation is not as good&lt;/li&gt;
&lt;li&gt;When performance is crucial and we can produce more optimal solutions using the invocations&lt;/li&gt;
&lt;li&gt;When we know the Scala API well and not want to invest time to learn the dplyr syntax, but it is easier to translate the Scala calls into a series of &lt;code&gt;invoke()&lt;/code&gt; calls&lt;/li&gt;
&lt;li&gt;When we need to interact and manipulate other Java objects apart from the standard Spark DataFrames&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In this part of the series, we have looked at how to use the lower-level invoke interface provided by sparklyr to manipulate Spark objects and other Java object references. In the following part, we will dig a bit deeper and look into using Java’s reflection API to make the invoke interface more accessible from R, getting detail invocation logs and more.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r201-spark-r-1/&#34;&gt;first part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r202-spark-r-dplyr-verbs/&#34;&gt;second part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r203-spark-r-sql/&#34;&gt;third part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;A &lt;a href=&#34;https://hub.docker.com/r/jozefhajnala/sparkly&#34;&gt;Docker image&lt;/a&gt; with R, Spark, sparklyr and Arrow available and &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/sparkly/Dockerfile&#34;&gt;its Dockerfile&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Wikipedia’s article on &lt;a href=&#34;https://en.wikipedia.org/wiki/Method_chaining&#34;&gt;Method Chaining&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Using Spark from R for performance with arbitrary code - Part 3 - Using R to construct SQL queries and let Spark execute them</title>
      <link>https://jozef.io/r203-spark-r-sql/</link>
      <pubDate>Sat, 12 Oct 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r203-spark-r-sql/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the &lt;a href=&#34;https://jozef.io/r202-spark-r-dplyr-verbs/&#34;&gt;previous part&lt;/a&gt; of this series, we looked at writing R functions that can be executed directly by Spark without serialization overhead with a focus on writing functions as combinations of dplyr verbs and investigated how the SQL is generated and Spark plans created.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this third part, we will look at how to write R functions that generate SQL queries that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed. We also briefly present wrapping these approaches into functions that can be combined with other Spark operations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#preparation&#34;&gt;Preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#r-functions-as-spark-sql-generators&#34;&gt;R functions as Spark SQL generators&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#executing-the-generated-queries-via-spark&#34;&gt;Executing the generated queries via Spark&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#using-dbi-as-the-interface&#34;&gt;Using DBI as the interface&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#invoking-sql-on-a-spark-session-object&#34;&gt;Invoking sql on a Spark session object&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-tbl-with-dbplyrs-sql&#34;&gt;Using tbl with dbplyr’s sql&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#wrapping-the-tbl-approach-into-functions&#34;&gt;Wrapping the tbl approach into functions&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#combining-multiple-approaches-and-functions-into-lazy-datasets&#34;&gt;Combining multiple approaches and functions into lazy datasets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#where-sql-can-be-better-than-dbplyr-translation&#34;&gt;Where SQL can be better than dbplyr translation&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#when-translation-is-not-there&#34;&gt;When translation is not there&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#when-translation-does-not-provide-expected-results&#34;&gt;When translation does not provide expected results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#when-portability-is-important&#34;&gt;When portability is important&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;preparation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Preparation&lt;/h1&gt;
&lt;p&gt;The full setup of Spark and sparklyr is not in the scope of this post, please check the &lt;a href=&#34;https://jozef.io/r201-spark-r-1/#setting-up-spark-with-r-and-sparklyr&#34;&gt;previous one&lt;/a&gt; for some setup instructions and a ready-made Docker image.&lt;/p&gt;
&lt;p&gt;If you have docker available, running&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Should make RStudio available by navigating to &lt;a href=&#34;http://localhost:8787&#34;&gt;http://localhost:8787&lt;/a&gt; in your browser. You can then use the user name &lt;code&gt;rstudio&lt;/code&gt; and password &lt;code&gt;pass&lt;/code&gt; to login and continue experimenting with the code in this post.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Prepare the data
weather &amp;lt;- nycflights13::weather %&amp;gt;%
  mutate(id = 1L:nrow(nycflights13::weather)) %&amp;gt;% 
  select(id, everything())

# Connect
sc &amp;lt;- sparklyr::spark_connect(master = &amp;quot;local&amp;quot;)

# Copy the weather dataset to the instance
tbl_weather &amp;lt;- dplyr::copy_to(
  dest = sc, 
  df = weather,
  name = &amp;quot;weather&amp;quot;,
  overwrite = TRUE
)
# Copy the flights dataset to the instance
tbl_flights &amp;lt;- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::flights,
  name = &amp;quot;flights&amp;quot;,
  overwrite = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;r-functions-as-spark-sql-generators&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R functions as Spark SQL generators&lt;/h1&gt;
&lt;p&gt;There are use cases where it is desirable to express the operations directly with SQL instead of combining dplyr verbs, for example when working within multi-language environments where re-usability is important. We can then send the SQL query directly to Spark to be executed. To create such queries, one option is to write R functions that work as query constructors.&lt;/p&gt;
&lt;p&gt;Again using a very simple example, a naive implementation of column normalization could look as follows. Note that the use of &lt;code&gt;SELECT *&lt;/code&gt; is discouraged and only here for illustration purposes:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;normalize_sql &amp;lt;- function(df, colName, newColName) {
  paste0(
    &amp;quot;SELECT&amp;quot;,
    &amp;quot;\n  &amp;quot;, df, &amp;quot;.*&amp;quot;, &amp;quot;,&amp;quot;,
    &amp;quot;\n  (&amp;quot;, colName, &amp;quot; - (SELECT avg(&amp;quot;, colName, &amp;quot;) FROM &amp;quot;, df, &amp;quot;))&amp;quot;,
    &amp;quot; / &amp;quot;,
    &amp;quot;(SELECT stddev_samp(&amp;quot;, colName,&amp;quot;) FROM &amp;quot;, df, &amp;quot;) as &amp;quot;, newColName,
    &amp;quot;\n&amp;quot;, &amp;quot;FROM &amp;quot;, df
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the &lt;code&gt;weather&lt;/code&gt; dataset would then yield the following SQL query when normalizing the &lt;code&gt;temp&lt;/code&gt; column:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;normalize_temp_query &amp;lt;- normalize_sql(&amp;quot;weather&amp;quot;, &amp;quot;temp&amp;quot;, &amp;quot;normTemp&amp;quot;)
cat(normalize_temp_query)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## SELECT
##   weather.*,
##   (temp - (SELECT avg(temp) FROM weather)) / (SELECT stddev_samp(temp) FROM weather) as normTemp
## FROM weather&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we have the query created, we can look at how to send it to Spark for execution.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r201-01-spark-and-r.png&#34; alt=&#34;Apache Spark and R logos&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Apache Spark and R logos&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;executing-the-generated-queries-via-spark&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Executing the generated queries via Spark&lt;/h1&gt;
&lt;div id=&#34;using-dbi-as-the-interface&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using DBI as the interface&lt;/h2&gt;
&lt;p&gt;The R package DBI provides an interface for communication between R and relational database management systems. We can simply use the &lt;code&gt;dbGetQuery()&lt;/code&gt; function to execute our query, for instance:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;res &amp;lt;- DBI::dbGetQuery(sc, statement = normalize_temp_query)
head(res)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   id origin year month day hour  temp  dewp humid wind_dir wind_speed
## 1  1    EWR 2013     1   1    1 39.02 26.06 59.37      270   10.35702
## 2  2    EWR 2013     1   1    2 39.02 26.96 61.63      250    8.05546
## 3  3    EWR 2013     1   1    3 39.02 28.04 64.43      240   11.50780
## 4  4    EWR 2013     1   1    4 39.92 28.04 62.21      250   12.65858
## 5  5    EWR 2013     1   1    5 39.02 28.04 64.43      260   12.65858
## 6  6    EWR 2013     1   1    6 37.94 28.04 67.21      240   11.50780
##   wind_gust precip pressure visib           time_hour   normTemp
## 1       NaN      0   1012.0    10 2013-01-01 06:00:00 -0.9130047
## 2       NaN      0   1012.3    10 2013-01-01 07:00:00 -0.9130047
## 3       NaN      0   1012.5    10 2013-01-01 08:00:00 -0.9130047
## 4       NaN      0   1012.2    10 2013-01-01 09:00:00 -0.8624083
## 5       NaN      0   1011.9    10 2013-01-01 10:00:00 -0.9130047
## 6       NaN      0   1012.4    10 2013-01-01 11:00:00 -0.9737203&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we might have noticed thanks to the way the result is printed, a standard data frame is returned, as opposed to tibbles returned by most sparklyr operations.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It is important to note that using &lt;code&gt;dbGetQuery()&lt;/code&gt; &lt;em&gt;automatically computes and collects&lt;/em&gt; the results to the R session. This is in contrast with the dplyr approach which constructs the query and only collects the results to the R session when &lt;code&gt;collect()&lt;/code&gt; is called, or computes them when &lt;code&gt;compute()&lt;/code&gt; is called.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We will now examine 2 options to use the prepared query lazily and without collecting the results into the R session.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;invoking-sql-on-a-spark-session-object&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Invoking sql on a Spark session object&lt;/h2&gt;
&lt;p&gt;Without going into further details on the &lt;code&gt;invoke()&lt;/code&gt; functionality of sparklyr which we will focus on in the fourth installment of the series, if the desire is to have a “lazy” SQL that does not get automatically computed and collected when called from R, we can invoke a &lt;a href=&#34;https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession@sql(sqlText:String):org.apache.spark.sql.DataFrame&#34;&gt;&lt;code&gt;sql&lt;/code&gt; method&lt;/a&gt; on a SparkSession class object.&lt;/p&gt;
&lt;p&gt;The method takes a string SQL query as input and processes it using Spark, returning the result as a Spark DataFrame. This gives us the ability to only compute and collect the results when desired:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Use the query &amp;quot;lazily&amp;quot; without execution:
normalized_lazy_ds &amp;lt;- sc %&amp;gt;%
  spark_session() %&amp;gt;%
  invoke(&amp;quot;sql&amp;quot;,  normalize_temp_query)
normalized_lazy_ds&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;lt;jobj[124]&amp;gt;
##   org.apache.spark.sql.Dataset
##   [id: int, origin: string ... 15 more fields]&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Collect when needed:
normalized_lazy_ds %&amp;gt;% collect()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 17
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with 26,105 more rows, and 7 more variables: wind_speed &amp;lt;dbl&amp;gt;,
## #   wind_gust &amp;lt;dbl&amp;gt;, precip &amp;lt;dbl&amp;gt;, pressure &amp;lt;dbl&amp;gt;, visib &amp;lt;dbl&amp;gt;,
## #   time_hour &amp;lt;dttm&amp;gt;, normTemp &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-tbl-with-dbplyrs-sql&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using tbl with dbplyr’s sql&lt;/h2&gt;
&lt;p&gt;The above method gives us a reference to a Java object as a result, which might be less intuitive to work with for R users. We can also opt to use dbplyr’s &lt;code&gt;sql()&lt;/code&gt; function in combination with &lt;code&gt;tbl()&lt;/code&gt; to get a more familiar result.&lt;/p&gt;
&lt;p&gt;Note that when printing the below &lt;code&gt;normalized_lazy_tbl&lt;/code&gt;, the query gets partially executed to provide the first few rows. Only when &lt;code&gt;collect()&lt;/code&gt; is called the entire set is retrieved to the R session:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Nothing is executed yet
normalized_lazy_tbl &amp;lt;- normalize_temp_query %&amp;gt;%
  dbplyr::sql() %&amp;gt;%
  tbl(sc, .)

# Print the first few rows
normalized_lazy_tbl&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;SELECT weather.*, (temp - (SELECT avg(temp) FROM weather))
## #   / (SELECT stddev_samp(temp) FROM weather) as normTemp FROM weather&amp;gt;
## #   [?? x 17]
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with more rows, and 7 more variables: wind_speed &amp;lt;dbl&amp;gt;,
## #   wind_gust &amp;lt;dbl&amp;gt;, precip &amp;lt;dbl&amp;gt;, pressure &amp;lt;dbl&amp;gt;, visib &amp;lt;dbl&amp;gt;,
## #   time_hour &amp;lt;dttm&amp;gt;, normTemp &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Collect the entire result to the R session and print
normalized_lazy_tbl %&amp;gt;% collect()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 17
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with 26,105 more rows, and 7 more variables: wind_speed &amp;lt;dbl&amp;gt;,
## #   wind_gust &amp;lt;dbl&amp;gt;, precip &amp;lt;dbl&amp;gt;, pressure &amp;lt;dbl&amp;gt;, visib &amp;lt;dbl&amp;gt;,
## #   time_hour &amp;lt;dttm&amp;gt;, normTemp &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;wrapping-the-tbl-approach-into-functions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Wrapping the tbl approach into functions&lt;/h2&gt;
&lt;p&gt;In the approach above we provided &lt;code&gt;sc&lt;/code&gt; in the call to &lt;code&gt;tbl()&lt;/code&gt;. When wrapping such processes into a function, it might however be useful to take the specific DataFrame reference as an input instead of the generic Spark connection reference.&lt;/p&gt;
&lt;p&gt;In that case, we can use the fact that the connection reference is also stored in the DataFrame reference, in the &lt;code&gt;con&lt;/code&gt; sub-element of the &lt;code&gt;src&lt;/code&gt; element. For instance, looking at our &lt;code&gt;tbl_weather&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;class(tbl_weather[[&amp;quot;src&amp;quot;]][[&amp;quot;con&amp;quot;]])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;spark_connection&amp;quot;       &amp;quot;spark_shell_connection&amp;quot;
## [3] &amp;quot;DBIConnection&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Putting this together, we can create a simple wrapper function that lazily sends a SQL query to be processed on a particular Spark DataFrame reference:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lazy_spark_query &amp;lt;- function(tbl, qry) {
  qry %&amp;gt;%
    dbplyr::sql() %&amp;gt;%
    dplyr::tbl(tbl[[&amp;quot;src&amp;quot;]][[&amp;quot;con&amp;quot;]], .)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And use it to do the same as we did above with a single function call:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lazy_spark_query(tbl_weather, normalize_temp_query) %&amp;gt;% 
  collect()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 17
##       id origin  year month   day  hour  temp  dewp humid wind_dir
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270
##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250
##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240
##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250
##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260
##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240
##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240
##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250
##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260
## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260
## # … with 26,105 more rows, and 7 more variables: wind_speed &amp;lt;dbl&amp;gt;,
## #   wind_gust &amp;lt;dbl&amp;gt;, precip &amp;lt;dbl&amp;gt;, pressure &amp;lt;dbl&amp;gt;, visib &amp;lt;dbl&amp;gt;,
## #   time_hour &amp;lt;dttm&amp;gt;, normTemp &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;combining-multiple-approaches-and-functions-into-lazy-datasets&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Combining multiple approaches and functions into lazy datasets&lt;/h1&gt;
&lt;p&gt;The power of Spark partly comes from the lazy execution and we can take advantage of this in ways that are not immediately obvious. Consider the following function we have shown previously:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lazy_spark_query&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## function(tbl, qry) {
##   qry %&amp;gt;%
##     dbplyr::sql() %&amp;gt;%
##     dplyr::tbl(tbl[[&amp;quot;src&amp;quot;]][[&amp;quot;con&amp;quot;]], .)
## }&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the output of this function without collection is actually only a translated SQL statement, we can take that output and keep combinining it with other operations, for instance:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;qry &amp;lt;- normalize_sql(&amp;quot;flights&amp;quot;, &amp;quot;dep_delay&amp;quot;, &amp;quot;dep_delay_norm&amp;quot;)
lazy_spark_query(tbl_flights, qry) %&amp;gt;%
  group_by(origin) %&amp;gt;%
  summarise(mean(dep_delay_norm)) %&amp;gt;%
  collect()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 3 x 2
##   origin `mean(dep_delay_norm)`
##   &amp;lt;chr&amp;gt;                   &amp;lt;dbl&amp;gt;
## 1 EWR                    0.0614
## 2 JFK                   -0.0131
## 3 LGA                   -0.0570&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The crucial advantage is that even though the &lt;code&gt;lazy_spark_query&lt;/code&gt; would return the entire updated weather dataset when collected stand-alone, in combination with other operations Spark first figures out how to execute all the operations together efficiently and only then physically executes them and returns only the grouped and aggregated data to the R session.&lt;/p&gt;
&lt;p&gt;We can therefore effectively combine multiple approaches to interfacing with Spark while still keeping the benefit of retrieving only very small, aggregated amounts of data to the R session. The effect is quite significant even with a dataset as small as &lt;code&gt;flights&lt;/code&gt; (336,776 rows of 19 columns) and with a local Spark instance. The chart below compares executing a query lazily, aggregating within Spark and only retrieving the aggregated data, versus retrieving first and aggregating locally. The third boxplot shows the cost of pure collection on the query itself:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bench &amp;lt;- microbenchmark::microbenchmark(
  times = 20,
  collect_late = lazy_spark_query(tbl_flights, qry) %&amp;gt;%
    group_by(origin) %&amp;gt;%
    summarise(mean(dep_delay_norm)) %&amp;gt;%
    collect(),
  collect_first = lazy_spark_query(tbl_flights, qry) %&amp;gt;%
    collect() %&amp;gt;% 
    group_by(origin) %&amp;gt;%
    summarise(mean(dep_delay_norm)),
  collect_only = lazy_spark_query(tbl_flights, qry) %&amp;gt;%
    collect()
)&lt;/code&gt;&lt;/pre&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#r203-01-bench-late-collect&#39;).highcharts({
  title: {     
    text: &#34;Combine and collect late and small vs. early and bigger&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;time (milliseconds)&#34;     
    },     
    min: 0     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    boxplot: {     
      fillColor: &#34;#C9E4FF&#34;,     
      lineWidth: 0.5,     
      medianWidth: 1,     
      stemDashStyle: &#34;dot&#34;,     
      stemWidth: 1,     
      whiskerLength: &#34;40%&#34;,     
      whiskerWidth: 1     
    }     
  },     
  chart: {     
    type: &#34;column&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      g2: null,     
      data: [     
        {     
          name: &#34;collect_late&#34;,     
          low: 949,     
          q1: 982,     
          median: 1048,     
          q3: 1113.5,     
          high: 1231     
        },     
        {     
          name: &#34;collect_first&#34;,     
          low: 3196,     
          q1: 3273.5,     
          median: 3419.5,     
          q3: 3810.5,     
          high: 4088     
        },     
        {     
          name: &#34;collect_only&#34;,     
          low: 3015,     
          q1: 3245.5,     
          median: 3403,     
          q3: 3530,     
          high: 3891     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null,     
      color: &#34;blue&#34;,     
      name: &#34;Combine and collect late and small vs. early and bigger&#34;     
    }     
  ]     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;r203-01-bench-late-collect&#34;&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;where-sql-can-be-better-than-dbplyr-translation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Where SQL can be better than dbplyr translation&lt;/h1&gt;
&lt;div id=&#34;when-a-translation-is-not-there&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;When a translation is not there&lt;/h2&gt;
&lt;p&gt;We have discussed in the &lt;a href=&#34;https://jozef.io/r201-spark-r-1/#an-r-function-not-translated-to-spark-sql&#34;&gt;first part&lt;/a&gt; that the set of operations translated to Spark SQL via dbplyr may not cover all possible use cases. In such a case, the option to write SQL directly is very useful.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;when-translation-does-not-provide-expected-results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;When translation does not provide expected results&lt;/h2&gt;
&lt;p&gt;In some instances using dbplyr to translate R operations to Spark SQL can lead to unexpected results. As one example, consider the following integer division on a column of a local data frame.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# id_div_5 is as expected
weather %&amp;gt;%
  mutate(id_div_5 = id %/% 5L) %&amp;gt;%
  select(id, id_div_5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 2
##       id id_div_5
##    &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt;
##  1     1        0
##  2     2        0
##  3     3        0
##  4     4        0
##  5     5        1
##  6     6        1
##  7     7        1
##  8     8        1
##  9     9        1
## 10    10        2
## # … with 26,105 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected, we get the result of integer division in the &lt;code&gt;id_div_5&lt;/code&gt; column. However, applying the very same operation on a Spark DataFrame yields unexpected results:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# id_div_5 is normal division, not integer division
tbl_weather %&amp;gt;%
  mutate(id_div_5 = id %/% 5L) %&amp;gt;%
  select(id, id_div_5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 2]
##       id id_div_5
##    &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1      0.2
##  2     2      0.4
##  3     3      0.6
##  4     4      0.8
##  5     5      1  
##  6     6      1.2
##  7     7      1.4
##  8     8      1.6
##  9     9      1.8
## 10    10      2  
## # … with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is due to the fact that translation to integer division is quite difficult to implement: &lt;a href=&#34;https://github.com/tidyverse/dbplyr/issues/108&#34; class=&#34;uri&#34;&gt;https://github.com/tidyverse/dbplyr/issues/108&lt;/a&gt;. We could certainly figure our a way to fix this particular issue, but the workarounds may prove inefficient:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_weather %&amp;gt;%
  mutate(id_div_5 = as.integer(id %/% 5L)) %&amp;gt;%
  select(id, id_div_5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 2]
##       id id_div_5
##    &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt;
##  1     1        0
##  2     2        0
##  3     3        0
##  4     4        0
##  5     5        1
##  6     6        1
##  7     7        1
##  8     8        1
##  9     9        1
## 10    10        2
## # … with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Not too efficient:
tbl_weather %&amp;gt;%
  mutate(id_div_5 = as.integer(id %/% 5L)) %&amp;gt;%
  select(id, id_div_5) %&amp;gt;%
  explain()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;lt;SQL&amp;gt;
## SELECT `id`, CAST(`id` / 5 AS INT) AS `id_div_5`
## FROM `weather`
## 
## &amp;lt;PLAN&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## == Physical Plan ==
## *(1) Project [id#24, cast((cast(id#24 as double) / 5.0) as int) AS id_div_5#4273]
## +- InMemoryTableScan [id#24]
##       +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas)
##             +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using SQL and the knowledge that Hive does provide a built-in &lt;a href=&#34;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ArithmeticOperators&#34;&gt;&lt;code&gt;DIV&lt;/code&gt; arithmetic operator&lt;/a&gt;, we can get the desired results very simply and efficiently with writing SQL:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;&amp;quot;SELECT `id`, `id` DIV 5 `id_div_5` FROM `weather`&amp;quot; %&amp;gt;%
  dbplyr::sql() %&amp;gt;%
  tbl(sc, .)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;SELECT `id`, `id` DIV 5 `id_div_5` FROM `weather`&amp;gt; [?? x
## #   2]
##       id id_div_5
##    &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1        0
##  2     2        0
##  3     3        0
##  4     4        0
##  5     5        1
##  6     6        1
##  7     7        1
##  8     8        1
##  9     9        1
## 10    10        2
## # … with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Even though the numeric value of the results is correct here, we may still notice that the class of the returned &lt;code&gt;id_div_5&lt;/code&gt; column is actually numeric instead of integer. Such is the life of developers using data processing interfaces.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;when-portability-is-important&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;When portability is important&lt;/h2&gt;
&lt;p&gt;Since the languages that provide interfaces to Spark are not limited to R and multi-language setups are quite common, another reason to use SQL statements directly is the portability of such solutions. A SQL statement can be executed by interfaces provided for all languages - Scala, Java, and Python, without the need to rely on R-specific packages such as dbplyr.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r201-spark-r-1/&#34;&gt;first part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r202-spark-r-dplyr-verbs/&#34;&gt;second part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;Documentation on &lt;a href=&#34;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF&#34;&gt;Hive Operators and User-Defined Functions&lt;/a&gt; website.&lt;/li&gt;
&lt;li&gt;A &lt;a href=&#34;https://hub.docker.com/r/jozefhajnala/sparkly&#34;&gt;Docker image&lt;/a&gt; with R, Spark, sparklyr and Arrow available and &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/sparkly/Dockerfile&#34;&gt;its Dockerfile&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://cran.r-project.org/package=DBI&#34;&gt;DBI package&lt;/a&gt; on CRAN&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Using Spark from R for performance with arbitrary code - Part 2 - Constructing functions by piping dplyr verbs</title>
      <link>https://jozef.io/r202-spark-r-dplyr-verbs/</link>
      <pubDate>Sat, 21 Sep 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r202-spark-r-dplyr-verbs/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the &lt;a href=&#34;https://jozef.io/r201-spark-r-1/&#34;&gt;first part&lt;/a&gt; of this series, we looked at how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We also examined how Apache Arrow can increase the performance of data transfers between the R session and the Spark instance.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this second part, we will look at how to write R functions that can be executed directly by Spark without serialization overhead that we have shown in the previous installment. We will focus on writing functions as combinations of dplyr verbs that can be translated using dbplyr and investigate how the SQL is generated and Spark plans created.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#preparation&#34;&gt;Preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#r-functions-as-combinations-of-dplyr-verbs-and-spark&#34;&gt;R functions as combinations of dplyr verbs and Spark&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#trying-it-with-base-r-functions&#34;&gt;Trying it with base R functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-a-combination-of-supported-dplyr-verbs-and-operations&#34;&gt;Using a combination of supported dplyr verbs and operations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#investigating-the-sql-translation-and-its-spark-plan&#34;&gt;Investigating the SQL translation and its Spark plan&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-more-complex-use-case---joins-group-bys-and-aggregations&#34;&gt;A more complex use case - Joins, group bys, and aggregations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-the-functions-with-local-versus-remote-datasets&#34;&gt;Using the functions with local versus remote datasets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-take-home-message&#34;&gt;The take-home message&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;preparation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Preparation&lt;/h1&gt;
&lt;p&gt;The full setup of Spark and sparklyr is not in the scope of this post, please check the &lt;a href=&#34;https://jozef.io/r201-spark-r-1/#setting-up-spark-with-r-and-sparklyr&#34;&gt;previous one&lt;/a&gt; for some setup instructions and a ready-made Docker image.&lt;/p&gt;
&lt;p&gt;If you have docker available, running&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Should make RStudio available by navigating to &lt;a href=&#34;http://localhost:8787&#34;&gt;http://localhost:8787&lt;/a&gt; in your browser. You can then use the user name &lt;code&gt;rstudio&lt;/code&gt; and password &lt;code&gt;pass&lt;/code&gt; to login and continue experimenting with the code in this post.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r201-01-spark-and-r.png&#34; alt=&#34;Apache Spark and R logos&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Apache Spark and R logos&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;First, we will attach the needed packages and copy some test data from the nycflights13 package into our local Spark instance:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Load packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Prepare the data
weather &amp;lt;- nycflights13::weather %&amp;gt;%
  mutate(id = 1L:nrow(nycflights13::weather)) %&amp;gt;% 
  select(id, everything())

# Connect
sc &amp;lt;- sparklyr::spark_connect(master = &amp;quot;local&amp;quot;)

# Copy the weather dataset to the instance
tbl_weather &amp;lt;- dplyr::copy_to(
  dest = sc, 
  df = weather,
  name = &amp;quot;weather&amp;quot;,
  overwrite = TRUE
)

# Copy the flights dataset to the instance
tbl_flights &amp;lt;- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::flights,
  name = &amp;quot;flights&amp;quot;,
  overwrite = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;r-functions-as-combinations-of-dplyr-verbs-and-spark&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R functions as combinations of dplyr verbs and Spark&lt;/h1&gt;
&lt;p&gt;One of the approaches to retain the performance of Spark with arbitrary R functionality is to carefully design our functions such that in its entirety when using it with sparklyr, the function call can be translated directly to Spark SQL using dbplyr.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This allows us to write, package, test, and document the functions as we normally would, while still getting the performance benefits of Apache Spark.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Let’s look at an example where we would like to do simple transformations of data stored in a column of a data frame, such as normalization of one of the columns. For illustration purposes, we will normalize the values of a column by first subtracting the mean value and then dividing the values by the standard deviation.&lt;/p&gt;
&lt;div id=&#34;trying-it-with-base-r-functions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trying it with base R functions&lt;/h2&gt;
&lt;p&gt;The first attempt could be quite simple, we could attempt to take advantage of R’s base function &lt;code&gt;scale()&lt;/code&gt; to do the work for us:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;normalize_dplyr_scale &amp;lt;- function(df, col, newColName) {
  df %&amp;gt;% mutate(!!newColName := scale({{col}}))
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This function would work fine with a local data frame such as &lt;code&gt;weather&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;weather %&amp;gt;%
  normalize_dplyr_scale(temp, &amp;quot;normTemp&amp;quot;) %&amp;gt;%
  select(id, temp, normTemp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 3
##       id  temp normTemp[,1]
##    &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
##  1     1  39.0       -0.913
##  2     2  39.0       -0.913
##  3     3  39.0       -0.913
##  4     4  39.9       -0.862
##  5     5  39.0       -0.913
##  6     6  37.9       -0.974
##  7     7  39.0       -0.913
##  8     8  39.9       -0.862
##  9     9  39.9       -0.862
## 10    10  41         -0.802
## # … with 26,105 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However for a Spark DataFrame this would throw an error. This is because the base R function &lt;code&gt;scale()&lt;/code&gt; is not translated by dbplyr at the moment and it is not a Hive built-in function either:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_weather %&amp;gt;%
  normalize_dplyr_scale(temp, &amp;quot;normTemp&amp;quot;) %&amp;gt;%
  select(id, temp, normTemp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Error: org.apache.spark.sql.AnalysisException: Undefined function: &amp;#39;scale&amp;#39;. &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-a-combination-of-supported-dplyr-verbs-and-operations&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using a combination of supported dplyr verbs and operations&lt;/h2&gt;
&lt;p&gt;To run the function successfully, we will need to rewrite it as a combination of functions and operations that are supported by the dbplyr translation to Spark SQL. One example implementation is as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;normalize_dplyr &amp;lt;- function(df, col, newColName) {
  df %&amp;gt;% mutate(
    !!newColName := ({{col}} - mean({{col}}, na.rm = TRUE)) /
        sd({{col}}, na.rm = TRUE)
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this function yields the desired results for both local and Spark data frames:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Local data frame
weather %&amp;gt;%
  normalize_dplyr(temp, &amp;quot;normTemp&amp;quot;) %&amp;gt;%
  select(id, temp, normTemp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 3
##       id  temp normTemp
##    &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1  39.0   -0.913
##  2     2  39.0   -0.913
##  3     3  39.0   -0.913
##  4     4  39.9   -0.862
##  5     5  39.0   -0.913
##  6     6  37.9   -0.974
##  7     7  39.0   -0.913
##  8     8  39.9   -0.862
##  9     9  39.9   -0.862
## 10    10  41     -0.802
## # … with 26,105 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Spark DataFrame
tbl_weather %&amp;gt;%
  normalize_dplyr(temp, &amp;quot;normTemp&amp;quot;) %&amp;gt;%
  select(id, temp, normTemp) %&amp;gt;% 
  collect()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 3
##       id  temp normTemp
##    &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1     1  39.0   -0.913
##  2     2  39.0   -0.913
##  3     3  39.0   -0.913
##  4     4  39.9   -0.862
##  5     5  39.0   -0.913
##  6     6  37.9   -0.974
##  7     7  39.0   -0.913
##  8     8  39.9   -0.862
##  9     9  39.9   -0.862
## 10    10  41     -0.802
## # … with 26,105 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;investigating-the-sql-translation-and-its-spark-plan&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Investigating the SQL translation and its Spark plan&lt;/h2&gt;
&lt;p&gt;Another advantage of this approach is that we can investigate the plan by which the actions will be executed by Spark using the &lt;code&gt;explain()&lt;/code&gt; function from the dplyr package. This will print both the SQL query constructed by dbplyr and the plan generated by Spark, which can help us investigate performance issues:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_weather %&amp;gt;%
  normalize_dplyr(temp, &amp;quot;normTemp&amp;quot;) %&amp;gt;%
  dplyr::explain()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;lt;SQL&amp;gt;
## SELECT `id`, `origin`, `year`, `month`, `day`, `hour`, `temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, `precip`, `pressure`, `visib`, `time_hour`, (`temp` - AVG(`temp`) OVER ()) / stddev_samp(`temp`) OVER () AS `normTemp`
## FROM `weather`
## 
## &amp;lt;PLAN&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## == Physical Plan ==
## *(1) Project [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39, ((temp#30 - _we0#948) / _we1#949) AS normTemp#934]
## +- Window [avg(temp#30) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _we0#948, stddev_samp(temp#30) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS _we1#949]
##    +- Exchange SinglePartition
##       +- InMemoryTableScan [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39]
##             +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas)
##                   +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we are only interested in the SQL itself as a character string, we can use dbplyr’s &lt;code&gt;sql_render()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_weather %&amp;gt;%
  normalize_dplyr(temp, &amp;quot;normTemp&amp;quot;) %&amp;gt;%
  dbplyr::sql_render() %&amp;gt;%
  unclass()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;SELECT `id`, `origin`, `year`, `month`, `day`, `hour`, `temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, `precip`, `pressure`, `visib`, `time_hour`, (`temp` - AVG(`temp`) OVER ()) / stddev_samp(`temp`) OVER () AS `normTemp`\nFROM `weather`&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;a-more-complex-use-case---joins-group-bys-and-aggregations&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A more complex use case - Joins, group bys, and aggregations&lt;/h1&gt;
&lt;p&gt;The dplyr syntax makes it very easy to construct more complex aggregations across multiple Spark DataFrames. An example of a function that joins 2 Spark DataFrames and computes a mean of a selected column, grouped by another column can look as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;joingrpagg_dplyr &amp;lt;- function(
  df1, df2, 
  joinColNames = intersect(colnames(df1), colnames(df2)),
  col, groupCol
) {
  df1 %&amp;gt;%
    right_join(df2, by = joinColNames) %&amp;gt;%
    group_by({{groupCol}}) %&amp;gt;%
    summarise(mean({{col}})) %&amp;gt;% 
    arrange({{groupCol}})
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can then use this function for instance to look at the mean arrival delay of flights grouped by visibility. Note that we are only collecting heavily aggregated data - 20 rows in total. The overhead of data transfer from the Spark instance to the R session is therefore small. Also, just assigning the function call to &lt;code&gt;delay_by_visib&lt;/code&gt; does not actually execute or collect anything, execution really starts only when &lt;code&gt;collect()&lt;/code&gt; is called:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;delay_by_visib &amp;lt;- joingrpagg_dplyr(
  tbl_flights, tbl_weather,
  col = arr_delay, groupCol = visib
)
delay_by_visib %&amp;gt;% collect()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 20 x 2
##    visib `mean(arr_delay)`
##    &amp;lt;dbl&amp;gt;             &amp;lt;dbl&amp;gt;
##  1  0                24.9 
##  2  0.06             28.5 
##  3  0.12             45.4 
##  4  0.25             20.8 
##  5  0.5              39.8 
##  6  0.75             41.4 
##  7  1                37.6 
##  8  1.25             65.1 
##  9  1.5              34.7 
## 10  1.75             45.6 
## 11  2                26.3 
## 12  2.5              21.7 
## 13  3                21.7 
## 14  4                17.7 
## 15  5                18.9 
## 16  6                17.3 
## 17  7                16.4 
## 18  8                16.1 
## 19  9                15.6 
## 20 10                 4.32&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can look at the plan and the generated SQL query as well:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;delay_by_visib %&amp;gt;% dplyr::explain()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;lt;SQL&amp;gt;
## SELECT `visib`, AVG(`arr_delay`) AS `mean(arr_delay)`
## FROM (SELECT `RHS`.`year` AS `year`, `RHS`.`month` AS `month`, `RHS`.`day` AS `day`, `LHS`.`dep_time` AS `dep_time`, `LHS`.`sched_dep_time` AS `sched_dep_time`, `LHS`.`dep_delay` AS `dep_delay`, `LHS`.`arr_time` AS `arr_time`, `LHS`.`sched_arr_time` AS `sched_arr_time`, `LHS`.`arr_delay` AS `arr_delay`, `LHS`.`carrier` AS `carrier`, `LHS`.`flight` AS `flight`, `LHS`.`tailnum` AS `tailnum`, `RHS`.`origin` AS `origin`, `LHS`.`dest` AS `dest`, `LHS`.`air_time` AS `air_time`, `LHS`.`distance` AS `distance`, `RHS`.`hour` AS `hour`, `LHS`.`minute` AS `minute`, `RHS`.`time_hour` AS `time_hour`, `RHS`.`id` AS `id`, `RHS`.`temp` AS `temp`, `RHS`.`dewp` AS `dewp`, `RHS`.`humid` AS `humid`, `RHS`.`wind_dir` AS `wind_dir`, `RHS`.`wind_speed` AS `wind_speed`, `RHS`.`wind_gust` AS `wind_gust`, `RHS`.`precip` AS `precip`, `RHS`.`pressure` AS `pressure`, `RHS`.`visib` AS `visib`
## FROM `flights` AS `LHS`
## RIGHT JOIN `weather` AS `RHS`
## ON (`LHS`.`year` = `RHS`.`year` AND `LHS`.`month` = `RHS`.`month` AND `LHS`.`day` = `RHS`.`day` AND `LHS`.`origin` = `RHS`.`origin` AND `LHS`.`hour` = `RHS`.`hour` AND `LHS`.`time_hour` = `RHS`.`time_hour`)
## ) `dbplyr_003`
## GROUP BY `visib`
## ORDER BY `visib`
## 
## &amp;lt;PLAN&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## == Physical Plan ==
## *(6) Sort [visib#38 ASC NULLS FIRST], true, 0
## +- Exchange rangepartitioning(visib#38 ASC NULLS FIRST, 2)
##    +- *(5) HashAggregate(keys=[visib#38], functions=[avg(arr_delay#409)])
##       +- Exchange hashpartitioning(visib#38, 2)
##          +- *(4) HashAggregate(keys=[visib#38], functions=[partial_avg(arr_delay#409)])
##             +- *(4) Project [arr_delay#409, visib#38]
##                +- SortMergeJoin [cast(year#401 as double), cast(month#402 as double), day#403, origin#413, hour#417, time_hour#419], [year#26, month#27, day#28, origin#25, cast(hour#29 as double), time_hour#39], RightOuter
##                   :- *(2) Sort [cast(year#401 as double) ASC NULLS FIRST, cast(month#402 as double) ASC NULLS FIRST, day#403 ASC NULLS FIRST, origin#413 ASC NULLS FIRST, hour#417 ASC NULLS FIRST, time_hour#419 ASC NULLS FIRST], false, 0
##                   :  +- Exchange hashpartitioning(cast(year#401 as double), cast(month#402 as double), day#403, origin#413, hour#417, time_hour#419, 2)
##                   :     +- *(1) Filter (((((isnotnull(month#402) &amp;amp;&amp;amp; isnotnull(day#403)) &amp;amp;&amp;amp; isnotnull(origin#413)) &amp;amp;&amp;amp; isnotnull(year#401)) &amp;amp;&amp;amp; isnotnull(time_hour#419)) &amp;amp;&amp;amp; isnotnull(hour#417))
##                   :        +- InMemoryTableScan [year#401, month#402, day#403, arr_delay#409, origin#413, hour#417, time_hour#419], [isnotnull(month#402), isnotnull(day#403), isnotnull(origin#413), isnotnull(year#401), isnotnull(time_hour#419), isnotnull(hour#417)]
##                   :              +- InMemoryRelation [year#401, month#402, day#403, dep_time#404, sched_dep_time#405, dep_delay#406, arr_time#407, sched_arr_time#408, arr_delay#409, carrier#410, flight#411, tailnum#412, origin#413, dest#414, air_time#415, distance#416, hour#417, minute#418, time_hour#419], StorageLevel(disk, memory, deserialized, 1 replicas)
##                   :                    +- Scan ExistingRDD[year#401,month#402,day#403,dep_time#404,sched_dep_time#405,dep_delay#406,arr_time#407,sched_arr_time#408,arr_delay#409,carrier#410,flight#411,tailnum#412,origin#413,dest#414,air_time#415,distance#416,hour#417,minute#418,time_hour#419]
##                   +- *(3) Sort [year#26 ASC NULLS FIRST, month#27 ASC NULLS FIRST, day#28 ASC NULLS FIRST, origin#25 ASC NULLS FIRST, cast(hour#29 as double) ASC NULLS FIRST, time_hour#39 ASC NULLS FIRST], false, 0
##                      +- Exchange hashpartitioning(year#26, month#27, day#28, origin#25, cast(hour#29 as double), time_hour#39, 2)
##                         +- InMemoryTableScan [origin#25, year#26, month#27, day#28, hour#29, visib#38, time_hour#39]
##                               +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas)
##                                     +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-the-functions-with-local-versus-remote-datasets&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using the functions with local versus remote datasets&lt;/h1&gt;
&lt;p&gt;Some of the appeal of the dplyr syntax comes from the fact that we can use the same functions to conveniently manipulate local data frames in memory and, with the very same code, data from remote sources such as relational databases, data.tables and even data within Spark.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This unified front-end, however, comes with some important differences that we must be aware of when applying and porting code from using it to manipulate and compute on local data versus on remote sources. The same holds for remote Spark DataFrames that we are manipulating when using dplyr functions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;An example of a different behavior is joining. The very simplest example - trying to inner join two tables can lead to a different amount of rows for the remote Spark DataFrames and the local R data frames:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bycols &amp;lt;-  c(&amp;quot;year&amp;quot;, &amp;quot;month&amp;quot;, &amp;quot;day&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;hour&amp;quot;, &amp;quot;time_hour&amp;quot;)

# Look at count of rows of Inner join of the Spark data frames 
tbl_flights %&amp;gt;% inner_join(tbl_weather, by = bycols) %&amp;gt;% count()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 1]
##        n
##    &amp;lt;dbl&amp;gt;
## 1 335096&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Look at count of rows of Inner join of the local data frames 
flights %&amp;gt;% inner_join(weather, by = bycols) %&amp;gt;% count()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 1
##        n
##    &amp;lt;int&amp;gt;
## 1 335220&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Another example of differences can arise from handling &lt;code&gt;NA&lt;/code&gt; and &lt;code&gt;NaN&lt;/code&gt; values:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Create (lazy) left joins
joined_spark &amp;lt;- tbl_flights %&amp;gt;% left_join(tbl_weather, by = bycols) %&amp;gt;% collect()
joined_local &amp;lt;- flights %&amp;gt;% left_join(weather, by = bycols)

# Look at counts of NA values
joined_local %&amp;gt;% filter(is.na(temp)) %&amp;gt;% count()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 1
##       n
##   &amp;lt;int&amp;gt;
## 1  1573&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;joined_spark %&amp;gt;% filter(is.na(temp)) %&amp;gt;% count()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 1
##       n
##   &amp;lt;int&amp;gt;
## 1  1697&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Look at counts of NaN values
joined_local %&amp;gt;% filter(is.nan(temp)) %&amp;gt;% count()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 1
##       n
##   &amp;lt;int&amp;gt;
## 1     0&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;joined_spark %&amp;gt;% filter(is.nan(temp)) %&amp;gt;% count()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 1
##       n
##   &amp;lt;int&amp;gt;
## 1  1697&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Special care must also be taken when dealing with date/time values and their time zones:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Note the time_hour values are different
weather %&amp;gt;% select(id, time_hour)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 26,115 x 2
##       id time_hour          
##    &amp;lt;int&amp;gt; &amp;lt;dttm&amp;gt;             
##  1     1 2013-01-01 01:00:00
##  2     2 2013-01-01 02:00:00
##  3     3 2013-01-01 03:00:00
##  4     4 2013-01-01 04:00:00
##  5     5 2013-01-01 05:00:00
##  6     6 2013-01-01 06:00:00
##  7     7 2013-01-01 07:00:00
##  8     8 2013-01-01 08:00:00
##  9     9 2013-01-01 09:00:00
## 10    10 2013-01-01 10:00:00
## # … with 26,105 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_weather %&amp;gt;% select(id, time_hour)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 2]
##       id time_hour          
##    &amp;lt;int&amp;gt; &amp;lt;dttm&amp;gt;             
##  1     1 2013-01-01 06:00:00
##  2     2 2013-01-01 07:00:00
##  3     3 2013-01-01 08:00:00
##  4     4 2013-01-01 09:00:00
##  5     5 2013-01-01 10:00:00
##  6     6 2013-01-01 11:00:00
##  7     7 2013-01-01 12:00:00
##  8     8 2013-01-01 13:00:00
##  9     9 2013-01-01 14:00:00
## 10    10 2013-01-01 15:00:00
## # … with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And, rather obviously, when using Hive built-in functions in the dplyr-based function, we will most likely not be able to execute it on the local data frames, as we have &lt;a href=&#34;https://jozef.io/r201-spark-r-1/#a-hive-built-in-function-not-existing-in-r&#34;&gt;seen previously&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-take-home-message&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The take-home message&lt;/h1&gt;
&lt;p&gt;In this part of the series, we have shown that we can take advantage of the performance of Spark while still writing arbitrary R functions by using dplyr syntax, which supports translation to Spark SQL using the dbplyr backend. We have also looked at some important differences when applying the same dplyr transformations to local and remote data sets.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With this approach, we can use R development best practices, testing, and documentation methods in a standard way when writing our R packages, getting the best of both worlds - Apache Spark for performance and R for convenient development of data science applications.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In the next installment, we will look at writing R functions that will be using SQL directly, instead of relying on dbplyr for the translation, and how we can efficiently send them to the Spark instance for execution and optionally retrieve the results to our R session.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://jozef.io/r201-spark-r-1/&#34;&gt;first part&lt;/a&gt; of this series&lt;/li&gt;
&lt;li&gt;Documentation on &lt;a href=&#34;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF&#34;&gt;Hive Operators and User-Defined Functions&lt;/a&gt; website.&lt;/li&gt;
&lt;li&gt;A &lt;a href=&#34;https://hub.docker.com/r/jozefhajnala/sparkly&#34;&gt;Docker image&lt;/a&gt; with R, Spark, sparklyr and Arrow available and &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/sparkly/Dockerfile&#34;&gt;its Dockerfile&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Overview of the &lt;a href=&#34;https://dplyr.tidyverse.org/&#34;&gt;dplyr syntax&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Using Spark from R for performance with arbitrary code - Part 1 - Spark SQL translation, custom functions, and Arrow</title>
      <link>https://jozef.io/r201-spark-r-1/</link>
      <pubDate>Sat, 31 Aug 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r201-spark-r-1/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users.&lt;/p&gt;
&lt;p&gt;This series of articles will attempt to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this first part, we will examine how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We will also look at how Apache Arrow can improve the performance of object serialization.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#setting-up-spark-with-r-and-sparklyr&#34;&gt;Setting up Spark with R and sparklyr&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#using-a-ready-made-docker-image&#34;&gt;Using a ready-made Docker Image&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#manual-installation&#34;&gt;Manual Installation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#connecting-and-using-a-local-spark-instance&#34;&gt;Connecting and using a local Spark instance&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#sparklyr-as-a-spark-interface-provider&#34;&gt;Sparklyr as a Spark interface provider&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#an-r-function-translated-to-spark-sql&#34;&gt;An R function translated to Spark SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#an-r-function-not-translated-to-spark-sql&#34;&gt;An R function not translated to Spark SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-hive-built-in-function-not-existing-in-r&#34;&gt;A Hive built-in function not existing in R&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-non-translated-functions-with-sparklyr&#34;&gt;Using non-translated functions with sparklyr&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#what-is-so-important-about-this-distinction&#34;&gt;What is so important about this distinction?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#what-happens-when-we-use-custom-functions-with-spark_apply&#34;&gt;What happens when we use custom functions with spark_apply&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#what-happens-when-we-use-translated-or-hive-built-in-functions&#34;&gt;What happens when we use translated or Hive built-in functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#which-r-functionality-is-currently-translated-and-built-in-to-hive&#34;&gt;Which R functionality is currently translated and built-in to Hive&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#making-serialization-faster-with-apache-arrow&#34;&gt;Making serialization faster with Apache Arrow&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#what-is-apache-arrow-and-how-it-improves-performance&#34;&gt;What is Apache Arrow and how it improves performance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#notes-on-the-setup-of-apache-arrow&#34;&gt;Notes on the setup of Apache Arrow&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-take-home-message&#34;&gt;The take-home message&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#but-we-still-need-arbitrary-r-function-to-run-fast-on-spark&#34;&gt;But we still need arbitrary R function to run fast on Spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;setting-up-spark-with-r-and-sparklyr&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Setting up Spark with R and sparklyr&lt;/h1&gt;
&lt;p&gt;The full instructions on setting up sparklyr are not in the scope of this article, below we only provide a quick set of instructions to get a local Spark instance working with sparklyr.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r201-01-spark-and-r.png&#34; alt=&#34;Apache Spark and R logos&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Apache Spark and R logos&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-a-ready-made-docker-image&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using a ready-made Docker Image&lt;/h2&gt;
&lt;p&gt;For the purpose of this series, a &lt;a href=&#34;https://hub.docker.com/r/jozefhajnala/sparkly&#34;&gt;Docker image&lt;/a&gt; was built which you can use to experiment in the following ways by running one of the commands below within a terminal. If you are using RStudio 1.1 or newer, &lt;a href=&#34;https://jozef.io/r905-rstudio-terminal/&#34;&gt;Terminal functionality&lt;/a&gt; is built into RStudio itself.&lt;/p&gt;
&lt;div id=&#34;interactively-with-r-and-sparklyr&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Interactively with R and sparklyr&lt;/h3&gt;
&lt;p&gt;Running the following should yield an interactive R session with all prerequisites to start working with the sparklyr package using a local Spark instance.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run --rm -it jozefhajnala/sparkly:test R

# Start using sparklyr
library(sparklyr)
sc &amp;lt;- spark_connect(&amp;quot;local&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;interactively-with-the-spark-shell&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Interactively with the Spark shell&lt;/h3&gt;
&lt;p&gt;Running the following should yield an interactive Scala REPL instance. A Spark context should be available as &lt;code&gt;sc&lt;/code&gt; and a Spark session as &lt;code&gt;spark&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run --rm -it jozefhajnala/sparkly:test /root/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-shell&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;running-an-example-r-script&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Running an example R script&lt;/h3&gt;
&lt;p&gt;Running the following should execute an example R script using sparklyr with output appearing in the terminal:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run --rm jozefhajnala/sparkly:test Rscript /root/.local/spark_script.R&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;manual-installation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Manual Installation&lt;/h2&gt;
&lt;p&gt;The following are very basic instructions, for troubleshooting or more detailed step-by-step guides you can refer to RStudio’s &lt;a href=&#34;https://spark.rstudio.com/#installation&#34;&gt;spark website&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;quot;sparklyr&amp;quot;)
install.packages(&amp;quot;nycflights13&amp;quot;)
sparklyr::spark_install(version = &amp;quot;2.4.3&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;connecting-and-using-a-local-spark-instance&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Connecting and using a local Spark instance&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Load packages
library(sparklyr)
library(dplyr)
library(nycflights13)

# Connect
sc &amp;lt;- sparklyr::spark_connect(master = &amp;quot;local&amp;quot;)

# Copy the weather dataset to the instance
tbl_weather &amp;lt;- dplyr::copy_to(
  dest = sc, 
  df = nycflights13::weather,
  name = &amp;quot;weather&amp;quot;,
  overwrite = TRUE
)

# Collect it back
tbl_weather %&amp;gt;% collect()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;sparklyr-as-a-spark-interface-provider&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Sparklyr as a Spark interface provider&lt;/h1&gt;
&lt;p&gt;The sparklyr package is an R &lt;em&gt;interface&lt;/em&gt; to Apache Spark. The meaning of the word interface is very important in this context as the way we use this interface can significantly affect the performance benefits we get from using Spark.&lt;/p&gt;
&lt;p&gt;To understand the meaning of the above a bit better, we will examine 3 very simple functions that are different in implementation but intend to provide the same results, and how they behave with regards to Spark. We will use datasets from the &lt;a href=&#34;https://cran.r-project.org/package=nycflights13&#34;&gt;nycflights13&lt;/a&gt; package for our examples.&lt;/p&gt;
&lt;div id=&#34;an-r-function-translated-to-spark-sql&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;An R function translated to Spark SQL&lt;/h2&gt;
&lt;p&gt;Using the following &lt;code&gt;fun_implemented()&lt;/code&gt; function will yield the expected results for both a local data frame &lt;code&gt;nycflights13::weather&lt;/code&gt; and the remote Spark object referenced by &lt;code&gt;tbl_weather&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# An R function translated to Spark SQL
fun_implemented &amp;lt;- function(df, col) {
  df %&amp;gt;% mutate({{col}} := tolower({{col}}))
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fun_implemented(nycflights13::weather, origin)
fun_implemented(tbl_weather, origin)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is because the R function &lt;code&gt;tolower&lt;/code&gt; was translated by &lt;code&gt;dbplyr&lt;/code&gt; to Spark SQL function &lt;code&gt;LOWER&lt;/code&gt; and the resulting query was sent to Spark to be executed. We can see the actual translated SQL by running &lt;code&gt;sql_render()&lt;/code&gt; on the function call:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dbplyr::sql_render(
  fun_implemented(tbl_weather, origin)
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;sql&#34;&gt;&lt;code&gt;&amp;lt;SQL&amp;gt; SELECT LOWER(`origin`) AS `origin`, `year`, `month`, `day`, `hour`,
`temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, `precip`,
`pressure`, `visib`, `time_hour`
FROM `weather`&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;an-r-function-not-translated-to-spark-sql&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;An R function not translated to Spark SQL&lt;/h2&gt;
&lt;p&gt;Using the following &lt;code&gt;fun_r_only()&lt;/code&gt; function will only yield the expected results for a local data frame &lt;code&gt;nycflights13::weather&lt;/code&gt;. For the remote Spark object referenced by &lt;code&gt;tbl_weather&lt;/code&gt; we will get an error:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# An R function not translated to Spark SQL
fun_r_only &amp;lt;- function(df, col) {
  df %&amp;gt;% mutate({{col}} := casefold({{col}}, upper = FALSE))
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fun_r_only(nycflights13::weather, origin)
fun_r_only(tbl_weather, origin)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;sql&#34;&gt;&lt;code&gt; Error: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input &amp;#39;AS&amp;#39; expecting &amp;#39;)&amp;#39;(line 1, pos 32)

== SQL ==
SELECT casefold(`origin`, FALSE AS `upper`) AS `origin`, 
`year`, `month`, `day`, `hour`, 
`temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`, 
`precip`, `pressure`, `visib`, `time_hour`
--------------------------------^^^
FROM `weather`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is because there simply is no translation provided by dbplyr for the &lt;code&gt;casefold()&lt;/code&gt; function. The generated Spark SQL will therefore not be valid and throw an error once the Spark SQL parser tries to parse it.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;a-hive-built-in-function-not-existing-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A Hive built-in function not existing in R&lt;/h2&gt;
&lt;p&gt;On the other hand, using the below &lt;code&gt;fun_hive_builtin()&lt;/code&gt; function will only yield the expected results for the remote Spark object referenced by &lt;code&gt;tbl_weather&lt;/code&gt;. For the local data frame &lt;code&gt;nycflights13::weather&lt;/code&gt; we will get an error:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# A Hive built-in function not existing in R
fun_hive_builtin &amp;lt;- function(df, col) {
  df %&amp;gt;% mutate({{col}} := lower({{col}}))
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fun_hive_builtin(tbl_weather, origin)
fun_hive_builtin(nycflights13::weather, origin)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Error: Evaluation error: could not find function &amp;quot;lower&amp;quot;.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is because the function &lt;code&gt;lower&lt;/code&gt; does not exist in R itself. For a non-existing R function there obviously is no dbplyr translation either. In this case, dbplyr keeps it as-is when translating to SQL, and the SQL will be valid and executed without problems because &lt;code&gt;lower&lt;/code&gt; is, in fact, a function built-in to Hive:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dbplyr::sql_render(fun_hive_builtin(tbl_weather, origin))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;sql&#34;&gt;&lt;code&gt;&amp;lt;SQL&amp;gt; SELECT lower(`origin`) AS `origin`,
`year`, `month`, `day`, `hour`,
`temp`, `dewp`, `humid`, `wind_dir`, `wind_speed`, `wind_gust`,
`precip`, `pressure`, `visib`, `time_hour`
FROM `weather`&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-non-translated-functions-with-sparklyr&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using non-translated functions with sparklyr&lt;/h1&gt;
&lt;p&gt;It can easily happen that one of the functions we want to use falls into the category where it is neither translated or a Hive built-in function. In this case, there is another interface provided by sparklyr that can allow us to do that - the &lt;code&gt;spark_apply()&lt;/code&gt; function. Here is an oversimplified example that will reach our goal with &lt;code&gt;casefold()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fun_r_custom &amp;lt;- function(tbl, colName) {
  tbl[[colName]] &amp;lt;- casefold(tbl[[colName]], upper = FALSE)
  tbl
}

spark_apply(tbl_weather, fun_r_custom, context = {colName &amp;lt;- &amp;quot;origin&amp;quot;})&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;what-is-so-important-about-this-distinction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What is so important about this distinction?&lt;/h2&gt;
&lt;p&gt;We have now shown that we can also send code that was not translated by &lt;code&gt;dbplyr&lt;/code&gt; to Spark and get it executed without issues using &lt;code&gt;spark_apply()&lt;/code&gt;. So what is the catch and where does the importance of the meaning of the word &lt;em&gt;interface&lt;/em&gt; come in?&lt;/p&gt;
&lt;p&gt;Let us quickly examine the performance of the operations:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mb = microbenchmark::microbenchmark(
  times = 10,
  hive_builtin = fun_hive_builtin(tbl_weather, origin) %&amp;gt;% collect(),
  translated_dplyr = fun_implemented(tbl_weather, origin) %&amp;gt;% collect(),
  spark_apply = spark_apply(tbl_weather, fun_r_custom, context = {colName &amp;lt;- &amp;quot;origin&amp;quot;}) %&amp;gt;% collect()
)&lt;/code&gt;&lt;/pre&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#r201-01-bench-spark-apply&#39;).highcharts({
  title: {     
    text: &#34;Simple column transformation on a small dataset&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;time (milliseconds)&#34;     
    },     
    min: 0     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    boxplot: {     
      fillColor: &#34;#C9E4FF&#34;,     
      lineWidth: 0.5,     
      medianWidth: 1,     
      stemDashStyle: &#34;dot&#34;,     
      stemWidth: 1,     
      whiskerLength: &#34;40%&#34;,     
      whiskerWidth: 1     
    }     
  },     
  chart: {     
    type: &#34;column&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      g2: null,     
      data: [     
        {     
          name: &#34;hive_builtin&#34;,     
          low: 396,     
          q1: 430,     
          median: 461,     
          q3: 486,     
          high: 495     
        },     
        {     
          name: &#34;translated_dplyr&#34;,     
          low: 407,     
          q1: 431,     
          median: 462.5,     
          q3: 501,     
          high: 511     
        },     
        {     
          name: &#34;spark_apply&#34;,     
          low: 372653,     
          q1: 374472,     
          median: 376849,     
          q3: 381262,     
          high: 381262     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null,     
      color: &#34;blue&#34;,     
      name: &#34;Simple column transformation on a small dataset&#34;     
    }     
  ]     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;r201-01-bench-spark-apply&#34;&gt;

&lt;/div&gt;
&lt;p&gt;Note that the absolute values here will vary based on the setup, the important message is in the relative differences.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We can see that the operations executed via the SQL translation mechanism of dbplyr were executed in around &lt;em&gt;0.5 seconds&lt;/em&gt; while those via spark_apply took orders of magnitude longer - more than &lt;em&gt;6 minutes&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;what-happens-when-we-use-custom-functions-with-spark_apply&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What happens when we use custom functions with &lt;code&gt;spark_apply&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;We can now see that the operation with &lt;code&gt;spark_apply()&lt;/code&gt; is extremely slow compared to the other two. The key to understanding the difference is to examine how the custom transformations of data using R functions are performed within &lt;code&gt;spark_apply()&lt;/code&gt;. In simplified terms, this happens in a few steps:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;the data is moved in row-format from Spark into the R process through a socket connection. This is inefficient as multiple data types need to be deserialized over each row&lt;/li&gt;
&lt;li&gt;the data gets converted to columnar format since this is how R data frames are implemented&lt;/li&gt;
&lt;li&gt;the R functions are applied to compute the results&lt;/li&gt;
&lt;li&gt;the results are again converted to row-format, serialized row-by-row and sent back to Spark over the socket connection&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;what-happens-when-we-use-translated-or-hive-built-in-functions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What happens when we use translated or Hive built-in functions&lt;/h2&gt;
&lt;p&gt;When using functions that can be translated to Spark SQL the process is very different&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The call is translated to Spark SQL using the dbplyr backend&lt;/li&gt;
&lt;li&gt;The constructed query is sent to Spark for execution using DBI&lt;/li&gt;
&lt;li&gt;Only when &lt;code&gt;collect()&lt;/code&gt; or &lt;code&gt;compute()&lt;/code&gt; is called, the SQL is executed within Spark&lt;/li&gt;
&lt;li&gt;Only when &lt;code&gt;collect()&lt;/code&gt; is called the results are also sent to the R session&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means that the transfer of data only happens once and only when &lt;code&gt;collect()&lt;/code&gt; is called, which saves a vast amount of overhead.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;which-r-functionality-is-currently-translated-and-built-in-to-hive&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Which R functionality is currently translated and built-in to Hive&lt;/h2&gt;
&lt;p&gt;An important question to answer with regards to performance then is what amount of functionality is available using the fast dbplyr backend. As seen above, these features can be categorized into two groups:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;R functions translatable to Spark SQL via dbplyr. The full list of such functions is available on &lt;a href=&#34;https://spark.rstudio.com/dplyr/#sql-translation&#34;&gt;RStudio’s sparklyr website&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hive built-in functions that get translated as they are and can be evaluated by Spark. The full list is available on the &lt;a href=&#34;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF&#34;&gt;Hive Operators and User-Defined Functions&lt;/a&gt; website.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;making-serialization-faster-with-apache-arrow&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Making serialization faster with Apache Arrow&lt;/h1&gt;
&lt;div id=&#34;what-is-apache-arrow-and-how-it-improves-performance&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What is Apache Arrow and how it improves performance&lt;/h2&gt;
&lt;p&gt;Our benchmarks have shown that using &lt;code&gt;spark_apply()&lt;/code&gt; does not scale well and the penalty of the bottleneck in performance caused by serialization, deserialization, and transfer is too high.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To partially mitigate this we can take advantage of &lt;a href=&#34;https://arrow.apache.org/&#34;&gt;Apache Arrow&lt;/a&gt;, a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;By adding support for Arrow in sparklyr, it makes Spark perform the row-format to column-format conversion in parallel in Spark, data is then transferred through the socket but no custom serialization takes place and all the R process needs to do is copy this data from the socket into its heap, transform it and copy it back to the socket connection.&lt;/p&gt;
&lt;p&gt;This makes the process significantly faster:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mb = microbenchmark::microbenchmark(
  times = 10, 
  setup = library(arrow),
  hive_builtin = fun_hive_builtin(tbl_weather, origin) %&amp;gt;% collect(),
  translated_dplyr = fun_implemented(tbl_weather, origin) %&amp;gt;% collect(),
  spark_apply_arrow = spark_apply(tbl_weather, fun_r_custom, context = {colName &amp;lt;- &amp;quot;origin&amp;quot;}) %&amp;gt;% collect()
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can see that the timing on &lt;code&gt;spark_apply()&lt;/code&gt; decreased from more than 6 minutes to around 4.5 seconds, which is a very signigicant performance boost. Compared to the other methods we however still experience an order of magnitude difference.&lt;/p&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#r201-02-bench-spark-apply&#39;).highcharts({
  title: {     
    text: &#34;Simple column transformation on a small dataset&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;time (milliseconds)&#34;     
    },     
    min: 0     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    boxplot: {     
      fillColor: &#34;#C9E4FF&#34;,     
      lineWidth: 0.5,     
      medianWidth: 1,     
      stemDashStyle: &#34;dot&#34;,     
      stemWidth: 1,     
      whiskerLength: &#34;40%&#34;,     
      whiskerWidth: 1     
    }     
  },     
  chart: {     
    type: &#34;column&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      g2: null,     
      data: [     
        {     
          name: &#34;hive_builtin&#34;,     
          low: 494,     
          q1: 510,     
          median: 524.5,     
          q3: 544,     
          high: 577     
        },     
        {     
          name: &#34;translated_dplyr&#34;,     
          low: 439,     
          q1: 460,     
          median: 556.5,     
          q3: 571,     
          high: 572     
        },     
        {     
          name: &#34;spark_apply_arrow&#34;,     
          low: 4491,     
          q1: 4498,     
          median: 4526.5,     
          q3: 4571,     
          high: 4571     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null,     
      color: &#34;blue&#34;,     
      name: &#34;Simple column transformation on a small dataset&#34;     
    }     
  ]     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;r201-02-bench-spark-apply&#34;&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;notes-on-the-setup-of-apache-arrow&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Notes on the setup of Apache Arrow&lt;/h2&gt;
&lt;p&gt;It is worth noting that the implementation of Apache Arrow &lt;a href=&#34;https://github.com/apache/arrow/tree/master/r&#34;&gt;into R&lt;/a&gt; arrived on &lt;a href=&#34;https://cran.r-project.org/package=arrow&#34;&gt;CRAN&lt;/a&gt; early August 2019, which means at the time of writing of this article it is on CRAN about 3 weeks. The functionality also depends on the &lt;a href=&#34;https://arrow.apache.org/install/&#34;&gt;Arrow C++ library&lt;/a&gt;, so installation is a bit more difficult than with some other R packages.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Care should also be taken with regards to the capability of the C++ library, the arrow R package version and the version of sparklyr. We had good results with using the R package arrow version 0.14.1, sparklyr 1.0.2 and the 0.14.1 version of the C++ libraries.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The aforementioned &lt;a href=&#34;https://hub.docker.com/r/jozefhajnala/sparkly&#34;&gt;Docker image&lt;/a&gt; has both the C++ libraries and the R arrow package available for use.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-take-home-message&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The take-home message&lt;/h1&gt;
&lt;p&gt;Adding Arrow to the mix certainly significantly improved the performance of our example code, but is still quite slow compared to the native approach. Based on the above, we could conclude that&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Performance benefits are present mainly when all the computation is performed within Spark and R serves merely as a “messaging agent”, sending commands to Spark to be executed. If there are object serialization and transfer of larger objects present, performance is strongly impacted.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The take-home message from this exercise is that we should strive to only use R code that can be executed within the Spark instance. If we need some data retrieved, it is advisable that this is data that was previously heavily aggregated within Spark and only a small amount is transferred to the R session.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;but-we-still-need-arbitrary-r-function-to-run-fast-on-spark&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;But we still need arbitrary R function to run fast on Spark&lt;/h1&gt;
&lt;p&gt;In the next installments of this series, we will investigate a few options that allow us to retain the performance of Spark while still being able to write arbitrary R functions (i.e. using methods already implemented and available in the Spark API from R by implementing R functions not directly provided by the sparklyr interface) by:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Rewriting the functions as collections of dplyr verbs that all support translation to Spark SQL&lt;/li&gt;
&lt;li&gt;Rewriting the functions as series of Scala method invocations&lt;/li&gt;
&lt;li&gt;Rewriting the functions into Spark SQL and using &lt;code&gt;DBI&lt;/code&gt; to execute directly&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://arrow.apache.org/&#34;&gt;Apache Arrow&lt;/a&gt; and RStudio’s &lt;a href=&#34;https://spark.rstudio.com/&#34;&gt;Spark website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Homepage of &lt;a href=&#34;https://arrow.apache.org/&#34;&gt;Apache Arrow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R Apache Arrow &lt;a href=&#34;https://github.com/apache/arrow/tree/master/r&#34;&gt;on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R package &lt;a href=&#34;https://cran.r-project.org/package=arrow&#34;&gt;arrow on CRAN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Arrow C++ library &lt;a href=&#34;https://arrow.apache.org/install/&#34;&gt;installation guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Documentation on &lt;a href=&#34;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF&#34;&gt;Hive Operators and User-Defined Functions&lt;/a&gt; website.&lt;/li&gt;
&lt;li&gt;A &lt;a href=&#34;https://hub.docker.com/r/jozefhajnala/sparkly&#34;&gt;Docker image&lt;/a&gt; with R, Spark, sparklyr and Arrow available and &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/sparkly/Dockerfile&#34;&gt;its Dockerfile&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Using parallelization, multiple git repositories and setting permissions when automating R applications with Jenkins</title>
      <link>https://jozef.io/r919-jenkins-pipelines-parallel/</link>
      <pubDate>Sat, 10 Aug 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r919-jenkins-pipelines-parallel/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the &lt;a href=&#34;https://jozef.io/r918-jenkins-pipelines/&#34;&gt;previous post&lt;/a&gt;, we focused on setting up declarative Jenkins pipelines with emphasis on parametrizing builds and using environment variables across pipeline stages.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, combining sources from multiple git repositories and ensuring proper access right to the Jenkins agent.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#running-stages-in-parallel&#34;&gt;Running stages in parallel&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#parallel-computation-using-r&#34;&gt;Parallel computation using R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#orchestrating-parallelization-of-r-jobs-with-jenkins&#34;&gt;Orchestrating parallelization of R jobs with Jenkins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#failing-early&#34;&gt;Failing early&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#cloning-multiple-git-repositories&#34;&gt;Cloning multiple git repositories&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#cloning-into-a-separate-subdirectory&#34;&gt;Cloning into a separate subdirectory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#cleaning-up&#34;&gt;Cleaning up&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#changing-permissions-to-allow-the-jenkins-user-to-read&#34;&gt;Changing permissions to allow the Jenkins user to read&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;running-stages-in-parallel&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Running stages in parallel&lt;/h1&gt;
&lt;div id=&#34;parallel-computation-using-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Parallel computation using R&lt;/h2&gt;
&lt;p&gt;There are numerous way to achieve parallel computation in the context of an R application, those native to R are for example&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf&#34;&gt;parallel package&lt;/a&gt;, which is included with base R since version 2.14 and very stable, or&lt;/li&gt;
&lt;li&gt;the more recent &lt;a href=&#34;https://cran.r-project.org/package=future&#34;&gt;future package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;the CRAN Task View: &lt;a href=&#34;https://cran.r-project.org/web/views/HighPerformanceComputing.html&#34;&gt;High-Performance and Parallel Computing with R&lt;/a&gt; provides a useful and extensive overview of multiple topics, including parallelism with R&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Governing parallelism directly within R code requires tackling many aspects, starting with logging and ending in handling conditions and exception. We might therefore also be interested in leaving the orchestration of parallelism to a layer above the R application code itself. This approach has both benefits and limitations, so careful consideration should be taken before the implementation starts.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;orchestrating-parallelization-of-r-jobs-with-jenkins&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Orchestrating parallelization of R jobs with Jenkins&lt;/h2&gt;
&lt;p&gt;Declarative Jenkins pipelines are one of the ways to orchestrate parallelism with many options, a very simple example of a parallelized process can look as follows:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;pipeline {
  agent any
    stages {
      
      stage(&amp;#39;Preparation&amp;#39;) {
        steps {
          // Cleanup, Environment setup, etc.
        }
      }
      
      stage(&amp;#39;Tests&amp;#39;) {
        parallel {
          stage(&amp;#39;Unit Tests&amp;#39;) {
            steps {
              // Invoke unit tests
            }
          }
          stage(&amp;#39;Integration Tests&amp;#39;) {
            steps {
              // Invoke integration tests
            }
          }
          stage(&amp;#39;Regression Tests&amp;#39;) {
            steps {
             // Invoke regression tests
            }
          }
          stage(&amp;#39;Technical checks&amp;#39;) {
            steps {
              // Invoke Technical checks
            }
          }
        }
      }
      
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note the &lt;code&gt;parallel&lt;/code&gt; directive, which will ensure that the (sub)stages within it&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Unit Tests&lt;/li&gt;
&lt;li&gt;Integration Tests&lt;/li&gt;
&lt;li&gt;Regression Tests and&lt;/li&gt;
&lt;li&gt;Technical checks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;will be executed in parallel.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The parallelization will be orchestrated only after the first stage - “Preparation” was finished first. This is useful in case we need a stage that is shared among the parallel stages to be executed first.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;failing-early&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Failing early&lt;/h2&gt;
&lt;p&gt;If we want to fail the parallel stages early (as soon as any of them fails), we can add &lt;code&gt;failFast true&lt;/code&gt; into the parallel stage:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;stage(&amp;#39;Tests&amp;#39;) {
  failFast true
  parallel {
    // ...
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r919-01-parallel-stages-blueocean.png&#34; alt=&#34;An example parallel Jenkins pipeline shown by BlueOcean. Image credit https://bit.ly/31e8cAy&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;An example parallel Jenkins pipeline shown by BlueOcean. Image credit &lt;a href=&#34;https://bit.ly/31e8cAy&#34; class=&#34;uri&#34;&gt;https://bit.ly/31e8cAy&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;cloning-multiple-git-repositories&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Cloning multiple git repositories&lt;/h1&gt;
&lt;p&gt;In certain situations, we may need to clone not just the main repository that is subject to our multibranch pipeline, but also secondary repositories.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An example of such setup is when we store modeling parameters for our run in a separate repository, or when configurations governing the runs are stored in a separate repository.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;code&gt;git&lt;/code&gt; directive allows us to clone another repository. Note that if you need to use credentials for the process, those are configured in &lt;a href=&#34;https://jenkins.io/doc/book/using/using-credentials/#configuring-credentials&#34;&gt;Jenkins’ credential configuration&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;stage(&amp;#39;Clone another repository&amp;#39;) {
  steps {
    git branch: &amp;#39;master&amp;#39;,
    credentialsId: &amp;#39;my-credential-id&amp;#39;,
    url: &amp;#39;git@github.com:user/repo.git&amp;#39;
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;cloning-into-a-separate-subdirectory&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Cloning into a separate subdirectory&lt;/h2&gt;
&lt;p&gt;Note however this will clone the repository into the current working directory, where the main repository subject to the pipeline is likely already checked out. This may have unintended consequences, so a safer approach is to checkout the secondary repository into a separate directory. We can achieve this using the &lt;code&gt;dir&lt;/code&gt; directive:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;stage(&amp;#39;Clone another repository to subdir&amp;#39;) {
  steps {
    sh &amp;#39;rm subdir -rf; mkdir subdir&amp;#39;
    dir (&amp;#39;subdir&amp;#39;) {
      git branch: &amp;#39;master&amp;#39;,
        credentialsId: &amp;#39;my-credential-id&amp;#39;,
        url: &amp;#39;git@github.com:user/repo.git&amp;#39;
    }
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;cleaning-up&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Cleaning up&lt;/h2&gt;
&lt;p&gt;After the pipeline is done, it may be useful do perform cleanup steps, for example removing unneeded directories. Since we likely want to clean those up regardless of the pipeline results, we can take advantage of the &lt;code&gt;post&lt;/code&gt; directive running &lt;code&gt;always&lt;/code&gt;, which will be executed regardless of the outcome of the pipeline stages.&lt;/p&gt;
&lt;p&gt;One example use is to remove the hidden &lt;code&gt;.git&lt;/code&gt; directories from both the working directory, where the main repository is checked out and the &lt;code&gt;&amp;quot;subdir&amp;quot;&lt;/code&gt;, where we checked out the secondary repository:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;post {
  always {
    sh &amp;#39;rm .git -rf&amp;#39;
    sh &amp;#39;rm subdir/.git -rf&amp;#39;
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;changing-permissions-to-allow-the-jenkins-user-to-read&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Changing permissions to allow the Jenkins user to read&lt;/h1&gt;
&lt;p&gt;One aspect of using Jenkins to execute our R code is to ensure that the Jenkins user executing the code on the worker node has access to all the necessary files. The following is a list of useful Linux commands that can help with the setup. These should, of course, be used with care.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;# Add user `jenkins` to group `somegroup`
usermod -a -G somegroup jenkins

# Change group of somedir/ to somegroup, recursively
chgrp -R somegroup somedir/

# Allow group to read `somedir`, recursively
chmod -R g+r somedir/

# Find all directories in a path and allow group to traverse
find /dir/moredir/somedir -type d -exec chmod g+x {} \;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Jenkins documentation on &lt;a href=&#34;https://jenkins.io/doc/book/pipeline/syntax/#parallel&#34;&gt;parallel blocks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Jenkins documentation on &lt;a href=&#34;https://jenkins.io/doc/book/using/using-credentials/#configuring-credentials&#34;&gt;credential configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;UnixExchange: &lt;a href=&#34;https://unix.stackexchange.com/a/13891&#34;&gt;Traversing directories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;StackOverflow: &lt;a href=&#34;https://stackoverflow.com/questions/38461705/checkout-jenkins-pipeline-git-scm-with-credentials&#34;&gt;Checkout multiple git repos into same Jenkins workspace&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;StackOverflow: &lt;a href=&#34;https://stackoverflow.com/questions/38461705/checkout-jenkins-pipeline-git-scm-with-credentials&#34;&gt;Checkout Jenkins Pipeline Git SCM with credentials?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Using environment variables and parametrized builds for automating R applications with Jenkins</title>
      <link>https://jozef.io/r918-jenkins-pipelines/</link>
      <pubDate>Sat, 27 Jul 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r918-jenkins-pipelines/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Jenkins is a popular open-source tool that helps teams with automation and implementation of continuous integration and deployment pipelines, comparable to for example Atlassian’s Bamboo, GitLab CI or to some extent Travis.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we share some practical lessons learned when integrating R applications via Jenkins for the purpose of continuous integration and regression testing on runner nodes configured using Jenkins via declarative pipelines defined in a Jenkinsfile.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#propagating-environment-variables-to-r-sessions&#34;&gt;Propagating environment variables to R sessions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#checking-and-accessing-the-propagated-variables&#34;&gt;Checking and accessing the propagated variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-a-per-pipeline-r-library&#34;&gt;Using a per-pipeline R library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#working-with-parametrized-builds-from-r&#34;&gt;Working with parametrized builds from R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r918-01-jenkins-pipeline.png&#34; alt=&#34;Example jenkins pipeline. Image credit https://bit.ly/2fpnBWI&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Example jenkins pipeline. Image credit &lt;a href=&#34;https://bit.ly/2fpnBWI&#34; class=&#34;uri&#34;&gt;https://bit.ly/2fpnBWI&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;propagating-environment-variables-to-r-sessions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Propagating environment variables to R sessions&lt;/h1&gt;
&lt;p&gt;When running R code on a local machine or a remote server from a user perspective, we count on a lot of configuration that is already present potentially without the user even noticing or knowing about the details of that configuration. One example of such configuration is the environment variables that configure some of R’s behavior.&lt;/p&gt;
&lt;p&gt;When running R code on a computer that is connected to the Jenkins server as a node (a place where Jenkins sends the jobs to run), those environment variables likely need to be passed to the worker process, including configuration present for example in &lt;code&gt;.Renviron&lt;/code&gt; files and &lt;code&gt;.Rprofile&lt;/code&gt; files.&lt;/p&gt;
&lt;p&gt;To propagate environment variables to all the stages of a declarative pipeline, we can use the &lt;code&gt;environment&lt;/code&gt; directive in the pipeline definition. For example, to propagate a path to a user library, an example Jenkinsfile could look as follows:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;pipeline {
  environment {
       R_LIBS_USER = &amp;#39;/path/to/lib&amp;#39;
  }
  // ... pipeline continues ...
 }&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will ensure that the environment variables defined will be propagated to all the stages defined in the pipeline.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: We might be tempted to simply use &lt;code&gt;EXPORT&lt;/code&gt; on the variables that need to be propagated to other stages. While this will likely work in a classic setup where we are running multiple R scripts under the same shell, Jenkins runs each of the stages in a separate shell, meaning that &lt;code&gt;EXPORT&lt;/code&gt; does &lt;em&gt;not&lt;/em&gt; ensure that the variables will be propagated to other stages. The same of course applies to using &lt;code&gt;Sys.setenv()&lt;/code&gt; from R itself.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;checking-and-accessing-the-propagated-variables&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Checking and accessing the propagated variables&lt;/h1&gt;
&lt;p&gt;To test whether our environment variables were propagated as intended, we can use &lt;code&gt;printenv&lt;/code&gt;, for example in a stage dedicated to showing the environment variables:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;pipeline {
  environment {
       R_LIBS_USER = &amp;#39;/path/to/lib&amp;#39;
  }
  agent any
    stages {
      stage(&amp;#39;Show env vars&amp;#39;) {
        steps {
          sh &amp;#39;printenv&amp;#39;
        }
      }
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;From R, we can access the environment variables using &lt;code&gt;Sys.getenv()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# List all environment variables
Sys.getenv()

# Get a specific one
Sys.getenv(&amp;quot;R_LIBS_USER&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-a-per-pipeline-r-library&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using a per-pipeline R library&lt;/h1&gt;
&lt;p&gt;For continuous integration purposes, it is useful to get our code checked out and tested on each commit. To get our packages installed into a separate library for each branch, one of the options is setting a user library path.&lt;/p&gt;
&lt;p&gt;Doing that we can also choose the granularity of the separation we want to achieve. For example, using a library per branch in a multibranch pipeline context:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;environment {
  R_LIBS_USER = &amp;quot;&amp;quot;&amp;quot;${sh(
    returnStdout: true,
    script: &amp;#39;echo $PWD/test-lib&amp;#39;
  )}&amp;quot;&amp;quot;&amp;quot;.trim()
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this would mean the same library is used for each build of the same branch. If we need more granularity we can use a library per both branch and build adding the &lt;code&gt;BUILD_ID&lt;/code&gt; variable to the path:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;environment {
  R_LIBS_USER = &amp;quot;&amp;quot;&amp;quot;${sh(
    returnStdout: true,
    script: &amp;#39;echo $PWD/$BUILD_ID/test-lib&amp;#39;
  )}&amp;quot;&amp;quot;&amp;quot;.trim()
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note the need to apply the &lt;code&gt;trim()&lt;/code&gt; method on the constructed paths to strip whitespaces/linebreaks that get produced when retrieving the value from standard output.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;working-with-parametrized-builds-from-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Working with parametrized builds from R&lt;/h1&gt;
&lt;p&gt;Jenkins also offers the option to parametrize builds, such that parameters of several types can be passed as environment variables to the shell through which the staged jobs are executed.&lt;/p&gt;
&lt;p&gt;For usage with R applications, this means we can retrieve such parameters using the &lt;code&gt;Sys.getenv()&lt;/code&gt; function. For example, if we create a parameter named &lt;code&gt;r_num_cores&lt;/code&gt; in Jenkins, we can easily access its value within the build:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Sys.getenv(&amp;quot;r_num_cores&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A small caveat to this is that all the parameters are passed as strings, so in case we want to pass R objects as parameters (for example a vector &lt;code&gt;c(1, 2)&lt;/code&gt;), we would need to parse the string values, for example writing a wrapper function. A naive implementation of such wrapper can look as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;env_get &amp;lt;- function(varName, parse = TRUE) {
  res &amp;lt;- Sys.getenv(varName)
  if (isTRUE(parse)) res &amp;lt;- eval(parse(text = res))
  res
}&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;It is also worth noting that syntactical differences can require some further tweaking, for example, boolean Jenkins parameters are passed as &lt;code&gt;&amp;quot;true&amp;quot;&lt;/code&gt; or &lt;code&gt;&amp;quot;false&amp;quot;&lt;/code&gt;, so would not work with the &lt;code&gt;eval(parse(...)&lt;/code&gt; approach unless changed to uppercase first.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Jenkins documentation on &lt;a href=&#34;https://jenkins.io/doc/book/pipeline/multibranch/#creating-a-multibranch-pipeline&#34;&gt;Creating multibranch pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Jenkins documentation on &lt;a href=&#34;https://jenkins.io/doc/book/pipeline/syntax/#declarative-pipeline&#34;&gt;Declarative pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Jenkins documentation on &lt;a href=&#34;https://jenkins.io/doc/book/pipeline/jenkinsfile/#setting-environment-variables&#34;&gt;Setting environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Jenkins documentation on &lt;a href=&#34;https://wiki.jenkins.io/display/JENKINS/Parameterized+Build&#34;&gt;Parametrized builds&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How data.table&#39;s fread can save you a lot of time and memory, and take input from shell commands</title>
      <link>https://jozef.io/r917-fread-comparisons/</link>
      <pubDate>Sat, 22 Jun 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r917-fread-comparisons/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Recently I was involved in a task that included reading and writing quite large amounts of data, totaling more than 1 TB worth of csvs without the standard big data infrastructure. After trying multiple approaches, the one that made this possible was using data.table’s reading and writing facilities - &lt;code&gt;fread()&lt;/code&gt; and &lt;code&gt;fwrite()&lt;/code&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This motivated me to look at benchmarking data.table’s &lt;code&gt;fread()&lt;/code&gt; and how it compares to other packages such as tidyverse’s readr and base R for reading tabular data from text files such as csvs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#comparing-fread-readrs-read_csv-and-base-r&#34;&gt;Comparing fread, readr’s read_csv and base R&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#the-benchmarked-data&#34;&gt;The benchmarked data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#base-r-code-to-be-benchmarked&#34;&gt;Base R code to be benchmarked&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data.table-fread-code-to-be-benchmarked&#34;&gt;data.table fread code to be benchmarked&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#readrread_csv-code-to-be-benchmarked&#34;&gt;readr::read_csv code to be benchmarked&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-benchmarking-method&#34;&gt;The benchmarking method&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-results&#34;&gt;The results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#when-your-mind-gets-blown---fread-from-shell-command-outputs&#34;&gt;When your mind gets blown - fread() from shell command outputs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#optimizing-further&#34;&gt;Optimizing further&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-show-me-the-code&#34;&gt;TL;DR - Just show me the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;comparing-fread-readrs-read_csv-and-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Comparing fread, readr’s read_csv and base R&lt;/h1&gt;
&lt;p&gt;The data.table package is a bit lesser known in the R community, but if people know it, it is most likely for its speed when working with data tables themselves within R. The package however also provides functions for efficient reading and writing of tabular data from and into text files - &lt;code&gt;fread()&lt;/code&gt; for fast reading and &lt;code&gt;fwrite()&lt;/code&gt; for fast writing.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Another underrated property of the &lt;code&gt;fread()&lt;/code&gt; apart from speed however is memory efficiency, which can be crucial if we need to read in a lot of data without big data infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;the-benchmarked-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The benchmarked data&lt;/h2&gt;
&lt;p&gt;As the data for this quick benchmark, we used the &lt;a href=&#34;http://stat-computing.org/dataexpo/2009/the-data.html&#34;&gt;Airline on-time performance&lt;/a&gt; data from for years 2000 to 2008. This simple code chunk can be used to retrieve and extract the data. The download size is 868 MB in bz2 files. The extracted size is 5.34 GB in csv files and when combined translates to a data frame with some 59 million rows and 29 columns. This is quite limited due to the specs of the machine used, but enough to show significant differences between packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;destDir &amp;lt;- path.expand(&amp;quot;~/dataexpo&amp;quot;)
years &amp;lt;- 2000:2008
baseUrl &amp;lt;- &amp;quot;http://stat-computing.org/dataexpo/2009&amp;quot;

bz2Names &amp;lt;- file.path(destDir, paste0(years, &amp;quot;.csv.bz2&amp;quot;))
dlUrls   &amp;lt;- file.path(baseUrl, paste0(years, &amp;quot;.csv.bz2&amp;quot;))

if (!dir.exists(destDir)) {
  dir.create(destDir, recursive = TRUE)
}

# download files
mapply(download.file, dlUrls, bz2Names)

# extract
system(paste0(
  &amp;quot;cd &amp;quot;, destDir, &amp;quot;; &amp;quot;,
  &amp;quot;bzip2 -d -k &amp;quot;, paste(bz2Names, collapse = &amp;quot; &amp;quot;)
))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;base-r-code-to-be-benchmarked&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Base R code to be benchmarked&lt;/h2&gt;
&lt;p&gt;Loading csv data from multiple files into a single data frame with base R is very simple:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dataDir &amp;lt;- path.expand(&amp;quot;~/dataexpo&amp;quot;)
dataFls &amp;lt;- dir(dataDir, pattern = &amp;quot;csv$&amp;quot;, full.names = TRUE)
df &amp;lt;- do.call(rbind, lapply(dataFls, read.csv))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;data.table-fread-code-to-be-benchmarked&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;data.table &lt;code&gt;fread&lt;/code&gt; code to be benchmarked&lt;/h2&gt;
&lt;p&gt;For data.table, we use &lt;code&gt;rbindlist()&lt;/code&gt; for row binding instead of &lt;code&gt;do.call(rbind, ...)&lt;/code&gt; and &lt;code&gt;fread()&lt;/code&gt; for reading:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(data.table)
dataDir &amp;lt;- path.expand(&amp;quot;~/dataexpo&amp;quot;)
dataFls &amp;lt;- dir(dataDir, pattern = &amp;quot;csv$&amp;quot;, full.names = TRUE)
dt &amp;lt;- data.table::rbindlist(
  lapply(dataFls, data.table::fread, showProgress = FALSE)
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;readrread_csv-code-to-be-benchmarked&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;code&gt;readr::read_csv&lt;/code&gt; code to be benchmarked&lt;/h2&gt;
&lt;p&gt;The script for readr’s read_csv is also simple, with the small caveat that we need to predefine the column types, as &lt;code&gt;rbind_rows&lt;/code&gt; does not like to coerce the data. Doing things the tidyverse way, we also use &lt;code&gt;purrr::map_dfr()&lt;/code&gt; to for row binding and &lt;code&gt;readr::read_csv()&lt;/code&gt; for reading:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(readr)
library(purrr)
library(magrittr)
dataDir &amp;lt;- path.expand(&amp;quot;~/dataexpo&amp;quot;)
dataFiles &amp;lt;- dir(dataDir, pattern = &amp;quot;csv$&amp;quot;, full.names = TRUE)

# rbind_rows won&amp;#39;t coerce, prefedine
col_types &amp;lt;- readr::cols(
  .default = col_double(),
  UniqueCarrier = col_character(),
  TailNum = col_character(),
  Origin = col_character(),
  Dest = col_character(),
  CancellationCode = col_character(),
  CarrierDelay = col_double(),
  WeatherDelay = col_double(),
  NASDelay = col_double(),
  SecurityDelay = col_double(),
  LateAircraftDelay = col_double()
)

df &amp;lt;- dataFiles %&amp;gt;% 
  purrr::map_dfr(
    readr::read_csv,
    col_types = col_types,
    progress = FALSE
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-benchmarking-method&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The benchmarking method&lt;/h2&gt;
&lt;p&gt;A simple bash script was used to measure the maximum memory needed (Maximum resident set size to be precise) and to time the run of the script 10 times:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;#!/bin/bash
scriptf=$1
printf &amp;quot;$scriptf \n\n&amp;quot;

/usr/bin/time -v Rscript $scriptf  \
 2&amp;gt;&amp;amp;1 &amp;gt;/dev/null | \
 grep -E &amp;#39;Maximum resident&amp;#39;

time for i in {1..10}; do Rscript $scriptf &amp;gt;/dev/null; done&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-results&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The results&lt;/h1&gt;
&lt;p&gt;The results speak for themselves. Not only was &lt;code&gt;fread()&lt;/code&gt; almost 2.5 times faster than readr’s functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr’s 27 GB. Interestingly, even though very slow, base R also spent less memory than the tidyverse suite.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For larger data sets, data.table’s efficiency can save not only very significant amounts of time, but also needed memory, which can have important implications with regards to the cost of the hardware needed for processing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th&gt;method&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;max. memory&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;avg. time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;&lt;code&gt;utils::read.csv&lt;/code&gt; + &lt;code&gt;base::rbind&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.70 GB&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.13 m&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;&lt;code&gt;readr::read_csv&lt;/code&gt; + &lt;code&gt;purrr:map_dfr&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;27.02 GB&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.43 m&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;&lt;code&gt;data.table::fread&lt;/code&gt; + &lt;code&gt;rbindlist&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;15.25 GB&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.40 m&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;when-your-mind-gets-blown---fread-from-shell-command-outputs&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;When your mind gets blown - &lt;code&gt;fread()&lt;/code&gt; from shell command outputs&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;../img/r917-01-datatable-logo.png&#34; alt=&#34;data.table&#39;s logo&#34; class=&#34;leftsmall&#34;&gt; And it gets better than that. Consider a scenario where we need to read the data, subset or split into groups and compute on the processed data. The classic approach would be to load the data from files into R as seen above and then do the data processing.&lt;/p&gt;
&lt;p&gt;For scenarios like these, &lt;code&gt;fread()&lt;/code&gt; provides an ever more powerful facility - the &lt;code&gt;cmd&lt;/code&gt; argument with a shell command that pre-processes the file(s). If we want to filter our data used above to only look at flights operated by American Airlines the classic approach would be to read the data in and filter. With &lt;code&gt;fread()&lt;/code&gt; we can, however, use &lt;code&gt;grep&lt;/code&gt; first and only have &lt;code&gt;fread()&lt;/code&gt; process output of that command:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(data.table)
dataDir &amp;lt;- path.expand(&amp;quot;~/dataexpo&amp;quot;)
dataFiles &amp;lt;- dir(dataDir, pattern = &amp;quot;csv$&amp;quot;, full.names = TRUE)

# All flights by American Airlines
command &amp;lt;- sprintf(
  &amp;quot;grep --text &amp;#39;,AA,&amp;#39; %s&amp;quot;,
  paste(dataFiles, collapse = &amp;quot; &amp;quot;)
)

dt &amp;lt;- data.table::fread(cmd = command)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looking at our benchmarks, this approach only cost us 1.68GB of memory and about 24 seconds of runtime on average:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th&gt;method&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;max. memory&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;avg. time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;&lt;code&gt;data.table::fread&lt;/code&gt; from &lt;code&gt;grep&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.68 GB&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.40 m&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;optimizing-further&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Optimizing further&lt;/h1&gt;
&lt;p&gt;The above is of course only the beginning of potential optimizations. We could probably save a lot of time taking advantage of &lt;a href=&#34;https://www.gnu.org/software/parallel/&#34;&gt;GNU parallel&lt;/a&gt; to process the files with &lt;code&gt;grep&lt;/code&gt; much faster. The key here is the flexibility of inputs that &lt;code&gt;fread&lt;/code&gt; can process, without splitting the inputs into multiple files and other maintenance-heavy pre-processing.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In a bigger data setting, this can have a significant impact on the cost of a data science project and even investments in big data infrastructure, engineers and maintenance related to managing such a project.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-show-me-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just show me the code&lt;/h1&gt;
&lt;p&gt;The benchmarking code &lt;a href=&#34;https://gitlab.com/jozefhajnala/fread-benchmarks&#34;&gt;can be found on GitLab&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread&#34;&gt;Convenience features of fread&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/Rdatatable/data.table/wiki&#34;&gt;data.table wiki&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://h2oai.github.io/db-benchmark/&#34;&gt;Proper benchmarks on group-by operations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How to interactively examine any R code - 4 ways to not just read the code, but delve into it step-by-step</title>
      <link>https://jozef.io/r916-exploring-r-code-interactively/</link>
      <pubDate>Sat, 25 May 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r916-exploring-r-code-interactively/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;As pointed out by a recent &lt;a href=&#34;https://blog.r-hub.io/2019/05/14/read-the-source/&#34;&gt;read the R source&lt;/a&gt; post on the R hub’s website, reading the actual code, not just the documentation is a great way to learn more about programming and implementation details. But there is one more activity to get even more hands-on experience and understanding of the code in practice.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we provide tips on how to interactively debug R code step-by-step and investigate the values of objects in the middle of function execution. We will look at doing this for both exported and non-exported functions from different packages. We will also look at interactively debugging generics and methods, using functionality provided by base R.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#interactively-examining-functions-with-debug-and-debugonce&#34;&gt;Interactively examining functions with debug() and debugonce()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#debugging-non-exported-functions-using&#34;&gt;Debugging non-exported functions using &lt;code&gt;:::&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conveniently-debugging-methods-with-debugcall&#34;&gt;Conveniently debugging methods with debugcall()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#inserting-debugging-code-anywhere-inside-a-function-body-with-trace&#34;&gt;Inserting debugging code anywhere inside a function body with trace()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;interactively-examining-functions-with-debug-and-debugonce&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Interactively examining functions with &lt;code&gt;debug()&lt;/code&gt; and &lt;code&gt;debugonce()&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;The 2 key functions we will be using for our interactive investigation of code are &lt;code&gt;debug()&lt;/code&gt; and &lt;code&gt;debugonce()&lt;/code&gt;. When &lt;code&gt;debug()&lt;/code&gt; is called on a function, it will set a debugging flag on that function. When the function is executed, the execution will proceed one step at a time, giving us the option to investigate exactly what is going on in the context of that function call similarly to placing &lt;code&gt;browser()&lt;/code&gt; at a certain point in our code.&lt;/p&gt;
&lt;p&gt;Let us see a quick example:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;debug(order)
order(10:1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When running the second line, the code execution will stop inside &lt;code&gt;order()&lt;/code&gt; and we can freely run the function line by line.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r916-01-debugging.png&#34; alt=&#34;Debugging an R function interactively with debugonce()&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Debugging an R function interactively with debugonce()&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;When we no longer want to have the function flagged for debugging, call &lt;code&gt;undebug()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;undebug(order)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, if we only want to have the function in debug mode for one execution, we can call &lt;code&gt;debugonce()&lt;/code&gt; on the function. This approach may also be safer due to no need to &lt;code&gt;undebug()&lt;/code&gt; later:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;debugonce(order)
order(10:1)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;debugging-non-exported-functions-using&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Debugging non-exported functions using &lt;code&gt;:::&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;The great thing about &lt;code&gt;debug()&lt;/code&gt; and &lt;code&gt;debugonce()&lt;/code&gt; is that they allow us to interactively investigate not just the code that we are currently writing, but any interpreted R function. To debug functions not even exported from package namespaces, we can use &lt;code&gt;:::&lt;/code&gt;. For example, we normally cannot access the &lt;code&gt;list_rmds()&lt;/code&gt; function from the blogdown package as it is not exported.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This will not work
library(blogdown)
debugonce(list_rmds)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error in debugonce(list_rmds): object &amp;#39;list_rmds&amp;#39; not found&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This will not work either
debugonce(blogdown::list_rmds)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error: &amp;#39;list_rmds&amp;#39; is not an exported object from &amp;#39;namespace:blogdown&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we need to, we can still debug it using &lt;code&gt;:::&lt;/code&gt; to access it in the package namespace:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This will work
debugonce(blogdown:::list_rmds)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is particularly useful when debugging nested calls inside package code, which tend to use unexported functions.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conveniently-debugging-methods-with-debugcall&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conveniently debugging methods with &lt;code&gt;debugcall()&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;Many R functions are implemented as S3 generics, that will call the proper method based on the signature of the arguments. A good example of this approach is &lt;code&gt;aggregate()&lt;/code&gt;. Looking at its code, we see it only dispatches to the proper method based on the arguments provided:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;body(stats::aggregate)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## UseMethod(&amp;quot;aggregate&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using &lt;code&gt;debug(aggregate)&lt;/code&gt; would therefore not be very useful for interactive investigation, as we most likely want to look at the method that is called to actually see what is going on.&lt;/p&gt;
&lt;p&gt;For this purpose, we can use &lt;code&gt;debugcall()&lt;/code&gt;, which will conveniently take us directly to the method. In the following case, it is the &lt;code&gt;data.frame&lt;/code&gt; method of the &lt;code&gt;aggregate()&lt;/code&gt; generic:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;eval(debugcall(
  aggregate(mtcars[&amp;quot;hp&amp;quot;], mtcars[&amp;quot;carb&amp;quot;], FUN = mean),
  once = TRUE
))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As seen above, we can also use the &lt;code&gt;once = TRUE&lt;/code&gt; argument to only debug the call once.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For more technical details, the reference provided by &lt;code&gt;?debugcall&lt;/code&gt; is a great resource. This is also true for &lt;code&gt;?debug&lt;/code&gt; and &lt;code&gt;?trace&lt;/code&gt; which I also strongly recommend reading.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;inserting-debugging-code-anywhere-inside-a-function-body-with-trace&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Inserting debugging code anywhere inside a function body with &lt;code&gt;trace()&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;If &lt;code&gt;debugonce()&lt;/code&gt; and friends are not sufficient for our purposes and we want to insert advanced debugging code at different places within a function body, we can use &lt;code&gt;trace()&lt;/code&gt; to do just that.&lt;/p&gt;
&lt;p&gt;Imagine for example we would like to investigate a specific place in the code of the aforementioned &lt;code&gt;stats::aggregate.data.frame&lt;/code&gt; method. First, we can explore the function body:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;as.list(body(stats::aggregate.data.frame))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## `{`
## 
## [[2]]
## if (!is.data.frame(x)) x &amp;lt;- as.data.frame(x)
## 
## [[3]]
## FUN &amp;lt;- match.fun(FUN)
## 
## [[4]]
## if (NROW(x) == 0L) stop(&amp;quot;no rows to aggregate&amp;quot;)
## 
## [[5]]
## if (NCOL(x) == 0L) {
##     x &amp;lt;- data.frame(x = rep(1, NROW(x)))
##     return(aggregate.data.frame(x, by, function(x) 0L)[seq_along(by)])
## }
## 
## [[6]]
## if (!is.list(by)) stop(&amp;quot;&amp;#39;by&amp;#39; must be a list&amp;quot;)
## 
## [[7]]
## if (is.null(names(by)) &amp;amp;&amp;amp; length(by)) names(by) &amp;lt;- paste0(&amp;quot;Group.&amp;quot;, 
##     seq_along(by)) else {
##     nam &amp;lt;- names(by)
##     ind &amp;lt;- which(!nzchar(nam))
##     names(by)[ind] &amp;lt;- paste0(&amp;quot;Group.&amp;quot;, ind)
## }
## 
## [[8]]
## if (any(lengths(by) != NROW(x))) stop(&amp;quot;arguments must have same length&amp;quot;)
## 
## [[9]]
## y &amp;lt;- as.data.frame(by, stringsAsFactors = FALSE)
## 
## [[10]]
## keep &amp;lt;- complete.cases(by)
## 
## [[11]]
## y &amp;lt;- y[keep, , drop = FALSE]
## 
## [[12]]
## x &amp;lt;- x[keep, , drop = FALSE]
## 
## [[13]]
## nrx &amp;lt;- NROW(x)
## 
## [[14]]
## ident &amp;lt;- function(x) {
##     y &amp;lt;- as.factor(x)
##     l &amp;lt;- length(levels(y))
##     s &amp;lt;- as.character(seq_len(l))
##     n &amp;lt;- nchar(s)
##     levels(y) &amp;lt;- paste0(strrep(&amp;quot;0&amp;quot;, n[l] - n), s)
##     as.character(y)
## }
## 
## [[15]]
## grp &amp;lt;- lapply(y, ident)
## 
## [[16]]
## multi.y &amp;lt;- !drop &amp;amp;&amp;amp; ncol(y)
## 
## [[17]]
## if (multi.y) {
##     lev &amp;lt;- lapply(grp, function(e) sort(unique(e)))
##     y &amp;lt;- as.list(y)
##     for (i in seq_along(y)) y[[i]] &amp;lt;- y[[i]][match(lev[[i]], 
##         grp[[i]])]
##     eGrid &amp;lt;- function(L) expand.grid(L, KEEP.OUT.ATTRS = FALSE, 
##         stringsAsFactors = FALSE)
##     y &amp;lt;- eGrid(y)
## }
## 
## [[18]]
## grp &amp;lt;- if (ncol(y)) {
##     names(grp) &amp;lt;- NULL
##     do.call(paste, c(rev(grp), list(sep = &amp;quot;.&amp;quot;)))
## } else integer(nrx)
## 
## [[19]]
## if (multi.y) {
##     lev &amp;lt;- as.list(eGrid(lev))
##     names(lev) &amp;lt;- NULL
##     lev &amp;lt;- do.call(paste, c(rev(lev), list(sep = &amp;quot;.&amp;quot;)))
##     grp &amp;lt;- factor(grp, levels = lev)
## } else y &amp;lt;- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]
## 
## [[20]]
## nry &amp;lt;- NROW(y)
## 
## [[21]]
## z &amp;lt;- lapply(x, function(e) {
##     ans &amp;lt;- lapply(X = split(e, grp), FUN = FUN, ...)
##     if (simplify &amp;amp;&amp;amp; length(len &amp;lt;- unique(lengths(ans))) == 1L) {
##         if (len == 1L) {
##             cl &amp;lt;- lapply(ans, oldClass)
##             cl1 &amp;lt;- cl[[1L]]
##             ans &amp;lt;- unlist(ans, recursive = FALSE)
##             if (!is.null(cl1) &amp;amp;&amp;amp; all(vapply(cl, identical, NA, 
##                 y = cl1))) 
##                 class(ans) &amp;lt;- cl1
##         }
##         else if (len &amp;gt; 1L) 
##             ans &amp;lt;- matrix(unlist(ans, recursive = FALSE), nrow = nry, 
##                 ncol = len, byrow = TRUE, dimnames = if (!is.null(nms &amp;lt;- names(ans[[1L]]))) 
##                   list(NULL, nms))
##     }
##     ans
## })
## 
## [[22]]
## len &amp;lt;- length(y)
## 
## [[23]]
## for (i in seq_along(z)) y[[len + i]] &amp;lt;- z[[i]]
## 
## [[24]]
## names(y) &amp;lt;- c(names(by), names(x))
## 
## [[25]]
## row.names(y) &amp;lt;- NULL
## 
## [[26]]
## y&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can choose a point in the function body, where we would like to interactively explore. For example the 21st element starting with &lt;code&gt;z &amp;lt;- lapply(x, function(e)) {&lt;/code&gt; may be of interest. In that case, we can call:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trace(stats::aggregate.data.frame, tracer = browser, at = 21)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Tracing function &amp;quot;aggregate.data.frame&amp;quot; in package &amp;quot;stats&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;aggregate.data.frame&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And see that this has added a call to &lt;code&gt;.doTrace()&lt;/code&gt; to the function body:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;as.list(body(stats::aggregate.data.frame))[[21L]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## {
##     .doTrace(browser(), &amp;quot;step 21&amp;quot;)
##     z &amp;lt;- lapply(x, function(e) {
##         ans &amp;lt;- lapply(X = split(e, grp), FUN = FUN, ...)
##         if (simplify &amp;amp;&amp;amp; length(len &amp;lt;- unique(lengths(ans))) == 
##             1L) {
##             if (len == 1L) {
##                 cl &amp;lt;- lapply(ans, oldClass)
##                 cl1 &amp;lt;- cl[[1L]]
##                 ans &amp;lt;- unlist(ans, recursive = FALSE)
##                 if (!is.null(cl1) &amp;amp;&amp;amp; all(vapply(cl, identical, 
##                   NA, y = cl1))) 
##                   class(ans) &amp;lt;- cl1
##             }
##             else if (len &amp;gt; 1L) 
##                 ans &amp;lt;- matrix(unlist(ans, recursive = FALSE), 
##                   nrow = nry, ncol = len, byrow = TRUE, dimnames = if (!is.null(nms &amp;lt;- names(ans[[1L]]))) 
##                     list(NULL, nms))
##         }
##         ans
##     })
## }&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When we now call the &lt;code&gt;aggregate()&lt;/code&gt; function on a data.frame, we will have the code stop at our selected point in the execution of the data.frame method:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;aggregate(mtcars[&amp;quot;hp&amp;quot;], mtcars[&amp;quot;carb&amp;quot;], FUN = mean)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When done debugging, use &lt;code&gt;untrace()&lt;/code&gt; to cancel the tracing:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;untrace(stats::aggregate.data.frame)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Untracing function &amp;quot;aggregate.data.frame&amp;quot; in package &amp;quot;stats&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Happy investigating and debugging!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;r-documentation-of-the-referenced-functions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R documentation of the referenced functions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;R documentation on &lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/debug.html&#34;&gt;&lt;code&gt;debug()&lt;/code&gt;, &lt;code&gt;debugonce()&lt;/code&gt;, etc.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R documentation on &lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/debug.html&#34;&gt;&lt;code&gt;trace()&lt;/code&gt;, &lt;code&gt;untrace()&lt;/code&gt;, etc.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R documentation on &lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/debug.html&#34;&gt;&lt;code&gt;debugcall()&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;more-general-debugging-related-topics&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;More general debugging-related topics&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Debugging&#34;&gt;Debugging chapter&lt;/a&gt; of Writing R Extensions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio#entering-debug-mode-stopping&#34;&gt;Debugging with RStudio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Jenny Bryan’s &lt;a href=&#34;https://github.com/jennybc/access-r-source#accessing-r-source&#34;&gt;Accessing R Source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R-hub blog’s &lt;a href=&#34;https://blog.r-hub.io/2019/05/14/read-the-source/&#34;&gt;Read the R source!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Advanced R’s &lt;a href=&#34;https://adv-r.hadley.nz/debugging.html&#34;&gt;Debugging&lt;/a&gt; and &lt;a href=&#34;http://adv-r.had.co.nz/Exceptions-Debugging.html&#34;&gt;Debugging, condition handling, and defensive programming&lt;/a&gt; chapters&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Porting and redirecting a Hugo-based blogdown website to an HTTPS-enabled custom domain and how to do it the easy way</title>
      <link>https://jozef.io/r915-gitlab-pages-own-domain/</link>
      <pubDate>Sat, 11 May 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r915-gitlab-pages-own-domain/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;As we wrote in &lt;a href=&#34;https://jozef.io/r914-one-year-r-blogging/&#34;&gt;Should you start your R blog now?&lt;/a&gt;, blogging has probably never been more accessible to the general population, R users included. Usually, the simplest solution is to host your blog via a service that provides it for free, such as Netlify, GitHub or GitLab Pages. But what if you want to host that awesome blog on your own, HTTPS enabled domain?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will look at how to port a Hugo-based website, such as a blogdown blog to our own domain, specifically focusing on GitLab Pages. We will also cover setting up SSL certificates, redirects from www to non-www sites and other details that I had to solve when porting my blogdown blog from GitLab’s hosting.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#if-you-are-just-starting---there-is-an-easy-way&#34;&gt;If you are just starting - there is an easy way&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#serving-a-page-on-a-custom-domain-via-gitlab-pages&#34;&gt;Serving a page on a custom domain via GitLab Pages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#redirecting-the-gitlab.io-address-to-the-custom-domain&#34;&gt;Redirecting the gitlab.io address to the custom domain&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#redirecting-www-and-non-www-urls-to-a-single-address&#34;&gt;Redirecting www and non-www URLs to a single address&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;if-you-are-just-starting---there-is-an-easy-way&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;If you are just starting - there is an easy way&lt;/h1&gt;
&lt;p&gt;This post is mostly a reminder-to-self of what porting this blog from GitLab Pages hosting to a custom domain entailed. The route I took was heavily influenced by the way I was serving the website at the beginning - using GitLab Pages on a project address. Migrating to a new domain I wanted to&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Keep all the functionality that a GitLab repository with GitLab CI/CD provides&lt;/li&gt;
&lt;li&gt;Serving the content at a custom HTTPS-enabled domain&lt;/li&gt;
&lt;li&gt;Redirecting to the new domain with minimal to no content duplication&lt;/li&gt;
&lt;li&gt;Making sure that the website works on both www and non-www addresses&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;If you have your blog hosted on GitLab pages and want to port and redirect it to your own HTTPS-enabled domain, while keeping the functionality that GitLab provides, you might find my journey useful.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r915-01-blogdown-hugo-le.png&#34; alt=&#34;Blogdown, Hugo &amp;amp; Let’s Encrypt logos&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Blogdown, Hugo &amp;amp; Let’s Encrypt logos&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;If you are just starting, it is easier to choose a different approach to publish your blog - here are 2 tips to consider if you want to prevent the pain I went through because of my past decisions:&lt;/p&gt;
&lt;div id=&#34;what-would-i-do-if-starting-today&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What would I do if starting today&lt;/h2&gt;
&lt;p&gt;With the knowledge I gained when investigating the process, I would probably take the following route:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Register a custom domain via &lt;a href=&#34;https://www.cloudflare.com/&#34;&gt;CloudFlare&lt;/a&gt;. This should make non-www/www redirects and getting SSL certificates seamless&lt;/li&gt;
&lt;li&gt;Deploy the pages by connecting the GitLab repository to &lt;a href=&#34;https://www.netlify.com/&#34;&gt;Netlify&lt;/a&gt;, which should be equally easy as using GitLab CI/CD&lt;/li&gt;
&lt;li&gt;Setup deployment to the custom domain via Netlify. This should make the redirects to the custom domain seamless and technically sound&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;doing-it-the-simplest-way&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Doing it the simplest way&lt;/h2&gt;
&lt;p&gt;Serving a Hugo-based website can in principle be even simpler - in fact, all that is really necessary is just copying/uploading contents of the &lt;code&gt;public&lt;/code&gt; directory generated for example with &lt;code&gt;blogdown::build_site()&lt;/code&gt; to the proper place. All that we do around it are processes that make your lives nicer at the cost of extra effort.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;serving-a-page-on-a-custom-domain-via-gitlab-pages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Serving a page on a custom domain via GitLab Pages&lt;/h1&gt;
&lt;div id=&#34;choose-and-register-a-domain-name&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;1. Choose and register a domain name&lt;/h2&gt;
&lt;p&gt;The first step is to choose a domain name (i.e. the web address) for your brand new website. This is completely up to you and the internet is full of &lt;a href=&#34;https://www.hover.com/blog/choosing-domain-name-for-portfolio-website/&#34;&gt;tips like this one&lt;/a&gt;. Next, register that domain name with a provider of your choice. I use a &lt;a href=&#34;https://websupport.sk&#34;&gt;local provider&lt;/a&gt; for all of my websites for many years, so the choice was easy.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To register your domain name, you can pick from a &lt;a href=&#34;https://www.techradar.com/news/best-domain-name-registrar&#34;&gt;plethora of providers&lt;/a&gt;, each with their of own pros and cons&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;setup-an-ssl-certificate&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;2. Setup an SSL certificate&lt;/h2&gt;
&lt;p&gt;Setting up your website such that it can be accessed via HTTPS should be the standard these days, so we also need to set up an SSL certificate. Once again, this should be simple as we can use free &lt;a href=&#34;https://en.wikipedia.org/wiki/Let%27s_Encrypt&#34;&gt;Let’s Encrypt&lt;/a&gt; certificates to achieve that.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The actual process once again depends on the provider of your domain services - in practice, it should entail just a few clicks in their web UI&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;setup-gitlab-to-serve-your-pages-to-a-custom-domain&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;3. Setup GitLab to serve your pages to a custom domain&lt;/h2&gt;
&lt;p&gt;Setting up GitLab Pages to be served to your own domain is well documented in &lt;a href=&#34;https://about.gitlab.com/2016/04/07/gitlab-pages-setup/#custom-domains&#34;&gt;GitLab’s documentation here&lt;/a&gt; and even &lt;a href=&#34;https://docs.gitlab.com/ee/user/project/pages/getting_started_part_three.html&#34;&gt;better here&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you have chosen CloudFlare as your service to manage the DNS, Nick Zeng &lt;a href=&#34;https://blog.zenggyu.com/en/post/2019-02-08/deploying-a-blogdown-website-with-gitlab-pages/#adding-a-custom-domain-and-enabling-https-protocol&#34;&gt;wrote a detailed guide&lt;/a&gt; on how to setup GitLab pages with a custom domain. GitLab also has links to setting up DNS records for &lt;a href=&#34;https://docs.gitlab.com/ee/user/project/pages/getting_started_part_three.html#dns-records&#34;&gt;other hosting providers&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After these 3 steps, you should see your website served on your new domain and HTTPS should work just fine. Now onto the not-so-simple issues.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;redirecting-the-gitlab.io-address-to-the-custom-domain&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Redirecting the gitlab.io address to the custom domain&lt;/h1&gt;
&lt;div id=&#34;server-side-redirects-are-not-supported-with-gitlab-pages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Server-side redirects are not supported with GitLab pages&lt;/h2&gt;
&lt;p&gt;Now that we can see our content on our new domain, we may want to take care of the fact that it is now visible on 2 addresses - the new custom domain and the original GitLab pages address. The traditional way of handling this is with server-side redirects - the server would issue an HTTP 301 Moved Permanently redirect to the new domain. The issue with that is that GitLab Pages does not support &lt;a href=&#34;https://gitlab.com/gitlab-org/gitlab-ee/issues/302&#34;&gt;server-side redirects&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On the other hand, redirects are supported by GitHub pages, which have &lt;a href=&#34;https://help.github.com/en/articles/custom-domain-redirects-for-github-pages-sites&#34;&gt;this feature&lt;/a&gt; and by Netlify &lt;a href=&#34;https://www.netlify.com/docs/redirects/&#34;&gt;as well&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;javascript-to-the-rescue&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;JavaScript to the rescue&lt;/h2&gt;
&lt;p&gt;We can also find a suggestion to use &lt;a href=&#34;https://docs.gitlab.com/ee/user/project/pages/introduction.html#redirects-in-gitlab-pages&#34;&gt;meta refresh tags&lt;/a&gt;, but since using them is not always simple and server-side functionality is not available, we can opt for client-side JavaScript to solve our redirection issues.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It may not seem like a good idea from SEO perspective at first, but looking at some research on &lt;a href=&#34;https://searchengineland.com/tested-googlebot-crawls-javascript-heres-learned-220157&#34;&gt;how Google handles JavaScript redirects&lt;/a&gt; it looks like the JavaScript redirects are quickly followed by Google. From an indexing standpoint, they are interpreted as 301s — the end-state URLs replaced the redirected URLs in Google’s index.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;An example of a JavaScript implementation using the &lt;code&gt;window.location&lt;/code&gt; object that can be used to get the current page address and to redirect the browser to a new page can look as follows:&lt;/p&gt;
&lt;pre class=&#34;javascript&#34;&gt;&lt;code&gt;function replacePath(path, old_d, new_d) {
  path = path.replace(old_d, new_d);
  if (path.includes(new_d)) {
    // only if really on the new domain
    path = path.replace(&amp;quot;http:&amp;quot;, &amp;quot;https:&amp;quot;);
  }
  return path;
}
newpath = replacePath(
  window.location.href,
  &amp;quot;://jozefhajnala.gitlab.io/r&amp;quot;,
  &amp;quot;://jozef.io&amp;quot;
);

// Prevent infinite redirect
if (window.location.href != newpath){
  window.location.replace(newpath);
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We would obviously replace the mentioned addresses by the desired ones and omit the &lt;code&gt;https&lt;/code&gt; replacement if the new domain does not have SSL enabled.&lt;/p&gt;
&lt;p&gt;We can test our JavaScript with a very simple function to see if all the URLs will be translated correctly:&lt;/p&gt;
&lt;pre class=&#34;javascript&#34;&gt;&lt;code&gt;// Place all urls into a variable
// this is just an example with a few
var oldLinks = [
  &amp;quot;https://jozefhajnala.gitlab.io/r&amp;quot;,
  &amp;quot;https://jozefhajnala.gitlab.io/r/categories/rcase4base&amp;quot;,
  &amp;quot;https://jozefhajnala.gitlab.io/r/categories/rstudioaddins&amp;quot;,
  &amp;quot;https://jozefhajnala.gitlab.io/r/categories/various&amp;quot;
];

const old_d = &amp;quot;://jozefhajnala.gitlab.io/r&amp;quot;;
const new_d = &amp;quot;://jozef.io&amp;quot;;

// get the new links
var newLinks = oldLinks.map(x =&amp;gt; replacePath(x, old_d, new_d));

function getStatus(url) {
  var req = new XMLHttpRequest();
  req.open(&amp;quot;GET&amp;quot;, url, false);
  req.send(null);
  return req.status;
}

// check the response statuses
// we want no 404s here, all 200 would be ideal
var statuses = newLinks.map(getStatus);&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;setting-canonical-links&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Setting canonical links&lt;/h2&gt;
&lt;p&gt;To be completely sure about duplicate content, if we have several similar versions of the same content, we can choose one version and point the search engines at this version by specifying a &lt;a href=&#34;https://support.google.com/webmasters/answer/139066?hl=en&#34;&gt;canonical URL&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To specify them using Hugo is very simple thanks to the way it provides variables and the partials approach to building themes. Simply add a line like this to your &lt;code&gt;header.html&lt;/code&gt; or &lt;code&gt;head.html&lt;/code&gt; partial file:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;link rel=&amp;quot;canonical&amp;quot; href=&amp;quot;{{ .Permalink }}&amp;quot;&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;The .html files for partials are usually located in the &lt;code&gt;themes/&amp;lt;your_theme&amp;gt;/layouts/partials/&lt;/code&gt; directory.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;redirecting-www-and-non-www-urls-to-a-single-address&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Redirecting www and non-www URLs to a single address&lt;/h1&gt;
&lt;p&gt;Another aspect of the move is to make sure that the content is available via both &lt;code&gt;www.example.com&lt;/code&gt; and &lt;code&gt;example.com&lt;/code&gt;, but not duplicated. Which of those is preferred is once again up to you. One solution would be to tell GitLab to serve the content to both and use the canonical link or use the JavaScript redirect again. However, there is a much nicer solution on offer here since for our own domain we can use server-side redirects.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If your provider of choice is CloudFlare, it seems that this redirect can be done &lt;a href=&#34;https://antonyagnel.com/redirect-non-www-urls-to-www-using-cloudflare/&#34;&gt;in a few clicks&lt;/a&gt; via the web UI&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;using-.htaccess&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using .htaccess&lt;/h2&gt;
&lt;p&gt;One way to create this redirect is by using a &lt;code&gt;.htaccess&lt;/code&gt; file. An example content, if you want to redirect to the www address with HTTPS, can look as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule .* https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteRule .* https://www.%{HTTP_HOST}%{REQUEST_URI} [L,R=301]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once you have the file ready, upload it to your site through an ftp client. If you host directly via your own server&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;place the &lt;code&gt;.htaccess&lt;/code&gt; file into the proper directory, for example, &lt;code&gt;/var/www/example.com/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;do not forget to activate the apache mod_rewrite module using &lt;code&gt;sudo a2enmod rewrite&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;an extra SSL certificate is likely to be needed for https to work correctly. The generation using Let’s Encrypt is simple using &lt;code&gt;certbot&lt;/code&gt;, &lt;a href=&#34;https://www.digitalocean.com/community/tutorials/how-to-secure-apache-with-let-s-encrypt-on-ubuntu-18-04&#34;&gt;described for example here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Read more details on using &lt;code&gt;.htaccess&lt;/code&gt; &lt;a href=&#34;https://www.digitalocean.com/community/tutorials/how-to-use-the-htaccess-file&#34;&gt;here&lt;/a&gt; and more details on using Mod_Rewrites for redirects &lt;a href=&#34;https://www.digitalocean.com/community/tutorials/how-to-set-up-mod_rewrite&#34;&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;using-custom-domains-with-gitlab-pages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using custom domains with GitLab Pages&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Adding &lt;a href=&#34;https://about.gitlab.com/2016/04/07/gitlab-pages-setup/#custom-domains&#34;&gt;custom domains to GitLab Pages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;GitLab Pages &lt;a href=&#34;https://docs.gitlab.com/ee/user/project/pages/getting_started_part_three.html&#34;&gt;custom domains and SSL/TLS Certificates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Setting up &lt;a href=&#34;https://docs.gitlab.com/ee/user/project/pages/getting_started_part_three.html#dns-records&#34;&gt;DNS Records with popular hosting providers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;redirects-and-duplicate-content&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Redirects and duplicate content&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Using &lt;a href=&#34;https://stackoverflow.com/questions/503093/how-do-i-redirect-to-another-webpage&#34;&gt;JavaScript to redirect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Wiki on &lt;a href=&#34;https://en.wikipedia.org/wiki/HTTP_301&#34;&gt;HTTP 301 redirects&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Consolidating &lt;a href=&#34;https://support.google.com/webmasters/answer/139066?hl=en&#34;&gt;duplicate URLs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;setting-up-virtual-hosts-securing-apache-with-ssl-using-.htaccess&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Setting up virtual hosts, securing Apache with SSL, using .htaccess&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;How To &lt;a href=&#34;https://www.digitalocean.com/community/tutorials/how-to-set-up-apache-virtual-hosts-on-ubuntu-16-04&#34;&gt;Set Up Apache Virtual Hosts on Ubuntu&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;How To &lt;a href=&#34;https://www.digitalocean.com/community/tutorials/how-to-secure-apache-with-let-s-encrypt-on-ubuntu-18-04&#34;&gt;Secure Apache with Let’s Encrypt on Ubuntu 16.04&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;How To &lt;a href=&#34;https://www.digitalocean.com/community/tutorials/how-to-use-the-htaccess-file&#34;&gt;Use the .htaccess File&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;How To &lt;a href=&#34;https://www.digitalocean.com/community/tutorials/how-to-set-up-mod_rewrite&#34;&gt;Set Up Mod_Rewrite&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Setting up continuous multi-platform R package building, checking and testing with R-Hub, Docker and GitLab CI/CD for free, with a working example</title>
      <link>https://jozef.io/r107-multiplatform-gitlabci-rhub/</link>
      <pubDate>Sat, 27 Apr 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r107-multiplatform-gitlabci-rhub/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the &lt;a href=&#34;https://jozef.io/r106-r-package-gitlab-ci/&#34;&gt;previous post&lt;/a&gt;, we looked at how to easily automate R analysis, modeling, and development work for free using GitLab’s CI/CD. Together with the fantastic &lt;a href=&#34;https://builder.r-hub.io/&#34;&gt;R-hub project&lt;/a&gt;, we can use GitLab CI/CD to do much more.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will take it to the next level by using R-hub to test our development work on many different platforms such as multiple Linux setups, MS Windows and MacOS. We will also show how to automate and continuously execute those multiplatform checks using GitLab CI/CD integration and Docker images.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For those too busy to read, we also provide a working example implementation in a public GitLab repository.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#using-r-hub-to-build-check-and-test-our-r-package-on-many-platforms&#34;&gt;Using R-hub to build, check and test our package on many platforms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-and-evaluating-r-hub-check-results-via-r-scripts&#34;&gt;Using and evaluating R-hub check results via R scripts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#preparing-a-private-docker-image-to-use-with-r-hub&#34;&gt;Preparing a private docker image to use with R-hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#creating-a-gitlab-cicd-pipeline&#34;&gt;Creating a GitLab CI/CD pipeline&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr-just-show-it-to-me-in-action&#34;&gt;TL;DR: Just show it to me in action&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;using-r-hub-to-build-check-and-test-our-r-package-on-many-platforms&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using R-hub to build, check and test our R package on many platforms&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://builder.r-hub.io/about.html&#34;&gt;R-hub&lt;/a&gt; is a project supported by the &lt;a href=&#34;https://www.r-consortium.org/&#34;&gt;R Consortium&lt;/a&gt; and offers free R CMD check as a service on different platforms. This enables us to quickly and efficiently check the R package you are developing to make sure it passes all necessary checks on several platforms. As an added bonus, the checks seem to be running in a very short time span, which means we can have your results at hand in a few minutes.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I also recommend that you read the &lt;a href=&#34;https://blog.r-hub.io/2019/03/26/why-care/&#34;&gt;why should you care about R-hub?&lt;/a&gt; blog post for more info.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r107-02-gitlab-rhub-run.gif&#34; alt=&#34;CI/CD running checks on multiple platforms with R-hub&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;CI/CD running checks on multiple platforms with R-hub&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-started-with-r-hub&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting started with R-hub&lt;/h2&gt;
&lt;p&gt;Getting started with R-hub is also very simple and can be achieved in 3 lines of code, from a package directory or an RStudio project for a package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Install the package
install.packages(&amp;quot;rhub&amp;quot;)

# Validate your e-mail address
# Provide the email argument if not detected automatically
rhub::validate_email()

# In an interactive session, 
# this will offer a list of platforms to choose from
cr &amp;lt;- rhub::check()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Your &lt;code&gt;validated_emails.csv&lt;/code&gt; should be saved into &lt;code&gt;rappdirs::user_data_dir(&amp;quot;rhub&amp;quot;, &amp;quot;rhub&amp;quot;)&lt;/code&gt; directory once &lt;code&gt;validate_email()&lt;/code&gt; was run successfully.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For more details on getting started, the &lt;a href=&#34;https://r-hub.github.io/rhub/articles/rhub.html&#34;&gt;Get started with rhub&lt;/a&gt; post has you covered in detail.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-and-evaluating-r-hub-check-results-via-r-scripts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using and evaluating R-hub check results via R scripts&lt;/h1&gt;
&lt;p&gt;For continuous integration purposes, we may want to evaluate the results of the check based on the number of errors, warnings, and notes that the check gives for each platform. To achieve this goal, we need to tackle 2 issues:&lt;/p&gt;
&lt;div id=&#34;getting-the-results-in-a-non-interactive-context&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting the results in a non-interactive context&lt;/h2&gt;
&lt;p&gt;In a non-interactive session, R-hub will run the check asynchronously and end our process used to request the service to free up resources. This is great but can pose some challenges in the CI context, as we would have to keep around a job to repeatedly query the R-hub job’s status and processing the results once done. Or implement a much smarter reporting solution.&lt;/p&gt;
&lt;p&gt;Luckily, since for this purpose maximizing efficiency is not our top concern, the simple workaround is to execute the check as-if in an interactive session via the CI tool. This will provide us with the actual results of the check as soon as done and also write the log into our CI’s run log, at the obvious cost of having the process blocked while waiting for the check to finish on R-hub’s servers.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;processing-the-check-results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Processing the check results&lt;/h2&gt;
&lt;p&gt;The public methods for an &lt;code&gt;rhub_check&lt;/code&gt; object currently seem to provide only side-effecting results such as printing them in various levels of detail and returning &lt;code&gt;self&lt;/code&gt;, so investigating results via code may be challenging.&lt;/p&gt;
&lt;p&gt;The simplest current solution is to use the object’s private fields to access the results in the desired format. The below example looks at the &lt;code&gt;status_&lt;/code&gt; private field and returns a data frame with the number of errors, warnings, and notes for each. For an object containing only 1 check result it can look as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;statuses &amp;lt;- cr[[&amp;quot;.__enclos_env__&amp;quot;]][[&amp;quot;private&amp;quot;]][[&amp;quot;status_&amp;quot;]]
res &amp;lt;- do.call(rbind, lapply(statuses, function(thisStatus) {
  data.frame(
    plaform  = thisStatus[[&amp;quot;platform&amp;quot;]][[&amp;quot;name&amp;quot;]],
    errors   = length(thisStatus[[&amp;quot;result&amp;quot;]][[&amp;quot;errors&amp;quot;]]),
    warnings = length(thisStatus[[&amp;quot;result&amp;quot;]][[&amp;quot;warnings&amp;quot;]]),
    notes    = length(thisStatus[[&amp;quot;result&amp;quot;]][[&amp;quot;notes&amp;quot;]]),
    stringsAsFactors = FALSE
  )
}))
res&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##              plaform errors warnings notes
## 1 debian-gcc-release      0        0     0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have a data frame which we can use to signal the CI/CD job to succeed or fail based on our wishes. For example, if we want to fail if the check discovered any notes, warnings or errors, a simple statement like the following will suffice:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;if (any(colSums(res[2L:4L]) &amp;gt; 0)) {
  stop(&amp;quot;Some checks resulted in errors, warnings or notes.&amp;quot;)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;putting-it-together-into-a-script&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Putting it together into a script&lt;/h2&gt;
&lt;p&gt;Now that we have solved the above challenges, we can put it all together into a script that can be later used in the context of a CI/CD job:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Retrieve passed command line arguments
args &amp;lt;- commandArgs(trailingOnly = TRUE)
if (length(args) != 1L) {
  stop(&amp;quot;Incorrect number of args, needs 1: platform (string)&amp;quot;)
}
platform &amp;lt;- args[[1L]]

# Check if passed platform is valid 
if (!is.element(platform, rhub::platforms()[[1L]])) {
  stop(paste(
    &amp;quot;Given platform not in rhub::platforms()[[1L]]:&amp;quot;,
    platform
  ))
}

# Run the check on the selected platform
# Use show_status = TRUE to wait for results
cr &amp;lt;- rhub::check(platform = platform, show_status = TRUE)

# Get the statuses from private field status_
statuses &amp;lt;- cr[[&amp;quot;.__enclos_env__&amp;quot;]][[&amp;quot;private&amp;quot;]][[&amp;quot;status_&amp;quot;]]

# Create and print a data frame with results
res &amp;lt;- do.call(rbind, lapply(statuses, function(thisStatus) {
  data.frame(
    plaform  = thisStatus[[&amp;quot;platform&amp;quot;]][[&amp;quot;name&amp;quot;]],
    errors   = length(thisStatus[[&amp;quot;result&amp;quot;]][[&amp;quot;errors&amp;quot;]]),
    warnings = length(thisStatus[[&amp;quot;result&amp;quot;]][[&amp;quot;warnings&amp;quot;]]),
    notes    = length(thisStatus[[&amp;quot;result&amp;quot;]][[&amp;quot;notes&amp;quot;]]),
    stringsAsFactors = FALSE
  )
}))
print(res)

# Fail if any errors, warnings or notes found
if (any(colSums(res[2L:4L]) &amp;gt; 0)) {
  stop(&amp;quot;Some checks had errors, warnings or notes. See above for details.&amp;quot;)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;preparing-a-private-docker-image-to-use-with-r-hub&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Preparing a private docker image to use with R-hub&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are new to Docker, Colin Fay has you covered with his &lt;a href=&#34;https://colinfay.me/docker-r-reproducibility/&#34;&gt;Introduction to Docker for R Users&lt;/a&gt; blog post.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;creating-and-testing-an-image&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating and testing an image&lt;/h2&gt;
&lt;p&gt;Thanks to all the hard work done by the maintainers of the &lt;a href=&#34;https://www.rocker-project.org/images/&#34;&gt;Rocker images&lt;/a&gt;, our task with creating an image suitable for use with R hub is very simple. Essentially we only need 2 additions to the &lt;a href=&#34;https://hub.docker.com/_/r-base&#34;&gt;r-base image&lt;/a&gt;:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The &lt;code&gt;rhub&lt;/code&gt; package and a few system dependencies&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;validated_emails.csv&lt;/code&gt; file placed into the correct directory, providing R-hub with the information on validated e-mail to use for the checks&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/rhub/Dockerfile&#34;&gt;following Dockerfile&lt;/a&gt; can be used the create such an image for yourself. Just make sure you have your &lt;code&gt;validated_emails.csv&lt;/code&gt; file present in the &lt;code&gt;resources&lt;/code&gt; folder when running &lt;code&gt;docker build&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;To test our docker image, we can use a command like the following to create a container and run R within it in an interactive session:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker run --rm -it &amp;lt;hub-username&amp;gt;/&amp;lt;repo-name&amp;gt;:&amp;lt;tag&amp;gt; R&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can see the list of validated e-mails in that R session:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rhub::list_validated_emails()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                  email                token
## 1 myemail@somemail.com 00000000000000000000&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;pushing-the-image-into-a-private-repository&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Pushing the image into a private repository&lt;/h2&gt;
&lt;p&gt;Now that we have our image created, we need to push it to a repository for GitLab CI to be able to use it. Normally this is very simple:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;docker push &amp;lt;hub-username&amp;gt;/&amp;lt;repo-name&amp;gt;:&amp;lt;tag&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However as we are storing some relatively sensitive data in our image, namely our R-hub token we should probably make this image private. Thanks to Dockerhub, this process is very easy - just click the proper buttons as shown in &lt;a href=&#34;https://docs.docker.com/docker-hub/repos/#private-repositories&#34;&gt;this post in the Dockerhub docs&lt;/a&gt;. Note that for free a Dockerhub user has only 1 private repository available.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-a-gitlab-cicd-pipeline&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Creating a GitLab CI/CD pipeline&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;For an introduction to using GitLab CI/CD for R work, look at the previous post on &lt;a href=&#34;https://jozef.io/r106-r-package-gitlab-ci/&#34;&gt;How to easily automate R analysis, modeling and development work using CI/CD, with working examples&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;setting-up-a-pipeline-with-.gitlab-ci.yml&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Setting up a pipeline with .gitlab-ci.yml&lt;/h2&gt;
&lt;p&gt;Now, we are ready with our private Docker image and the script to run and evaluate our R-hub checks, all that is left is to create and setup a CI/CD pipeline. For GitLab CI/CD, this means creating a &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; file in the root of our GitLab repository directory. Without much extra talk, that file can look as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image: index.docker.io/jozefhajnala/rhub:rbase

stages:
  - check

variables:
  _R_CHECK_CRAN_INCOMING_: &amp;quot;false&amp;quot;
  _R_CHECK_FORCE_SUGGESTS_: &amp;quot;true&amp;quot;

before_script:
  - apt-get update

check_ubuntu:
  stage: check
  script:
    - Rscript inst/rhubcheck.R &amp;quot;ubuntu-gcc-release&amp;quot;

check_fedora:
  stage: check
  script:
    - Rscript inst/rhubcheck.R &amp;quot;fedora-clang-devel&amp;quot;

check_mswin:
  stage: check
  script:
    - Rscript inst/rhubcheck.R &amp;quot;windows-x86_64-devel&amp;quot;

check_macos:
  stage: check
  script:
    - Rscript inst/rhubcheck.R &amp;quot;macos-elcapitan-release&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This file will make sure that:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The CI/CD jobs start from the image we have created&lt;/li&gt;
&lt;li&gt;Will have one stage named &lt;code&gt;check&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Set a couple of environment variables for R&lt;/li&gt;
&lt;li&gt;Run three jobs &lt;code&gt;check_ubuntu&lt;/code&gt;, &lt;code&gt;check_fedora&lt;/code&gt;, &lt;code&gt;check_mswin&lt;/code&gt;, and &lt;code&gt;check_macos&lt;/code&gt; - each of them by using &lt;code&gt;Rscript&lt;/code&gt; to execute an R script stored under &lt;code&gt;inst/rhubcheck.R&lt;/code&gt;, with different arguments specifying the platform to check on&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;authenticating-to-use-a-private-repository&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Authenticating to use a private repository&lt;/h2&gt;
&lt;p&gt;Since we have made our Docker image private, GitLab will not be able to use it out of the box, we need to provide it with information on how to authenticate against Dockerhub to be able to pull the private image. There are a &lt;a href=&#34;https://docs.gitlab.com/ee/ci/docker/using_docker_images.html#define-an-image-from-a-private-container-registry&#34;&gt;few ways to reach this&lt;/a&gt; goal, I have used the one to setup a variable via the &lt;code&gt;Settings -&amp;gt; CI/CD -&amp;gt; Variables&lt;/code&gt; option in GitLab’s web UI:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r107-01-cicd-variable.png&#34; alt=&#34;Creating CI/CD variable with GitLab&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Creating CI/CD variable with GitLab&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The variable name should be &lt;code&gt;DOCKER_AUTH_CONFIG&lt;/code&gt; and the value:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;{
  &amp;quot;auths&amp;quot;: {
    &amp;quot;registry.example.com:5000&amp;quot;: {
      &amp;quot;auth&amp;quot;: &amp;quot;bXlfdXNlcm5hbWU6bXlfcGFzc3dvcmQ=&amp;quot;
    }
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;&amp;quot;registry.example.com:5000&amp;quot;&lt;/code&gt; is replaced by our registry, for example &lt;code&gt;&amp;quot;index.docker.io&amp;quot;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the value for &lt;code&gt;&amp;quot;auth&amp;quot;&lt;/code&gt; is replaced by a base64-encoded version of our &lt;code&gt;&amp;quot;&amp;lt;username&amp;gt;:&amp;lt;password&amp;gt;&amp;quot;&lt;/code&gt;, which we can retrieve for example using R:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;base64enc::base64encode(charToRaw(&amp;quot;my_username:my_password&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;bXlfdXNlcm5hbWU6bXlfcGFzc3dvcmQ=&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And that is all! We are now ready to run our checks using a Docker image stored in a private repository. Once we push the &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; and &lt;code&gt;inst/rhubcheck.R&lt;/code&gt; files to a GitLab repository, the pipeline will be automatically executed every time we push a commit to that repository.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr-just-show-it-to-me-in-action&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR: Just show it to me in action&lt;/h1&gt;
&lt;p&gt;In case you are only interested in seeing the CI/CD pipeline with R-hub implemented for an R package, look at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/experimental/.gitlab-ci.yml&#34;&gt;.gitlab-ci.yml&lt;/a&gt; file for the &lt;code&gt;jhaddins&lt;/code&gt; package on branch experimental&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/rhub/Dockerfile&#34;&gt;Dockerfile&lt;/a&gt; used to build the image used in the above .gitlab-ci.yml&lt;/li&gt;
&lt;li&gt;An &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/experimental/inst/rhubcheck.R&#34;&gt;R script that runs the checks&lt;/a&gt; via R-hub and evaluates the results&lt;/li&gt;
&lt;li&gt;An example &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/pipelines/57477509&#34;&gt;of a successful run&lt;/a&gt; with checks on 3 platforms&lt;/li&gt;
&lt;li&gt;An example output of a &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/-/jobs/199152548&#34;&gt;check on Windows&lt;/a&gt; provided by R-hub&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;r-hub&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R-hub&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://blog.r-hub.io/2019/03/26/why-care/&#34;&gt;why should you care about R-hub?&lt;/a&gt; blog post&lt;/li&gt;
&lt;li&gt;Get &lt;a href=&#34;https://r-hub.github.io/rhub/articles/rhub.html&#34;&gt;started with R-Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R-Hub on the &lt;a href=&#34;https://www.r-consortium.org/projects/r-hub&#34;&gt;R Consortium website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;R-Hub’s &lt;a href=&#34;https://r-hub.github.io/rhub/reference/index.html&#34;&gt;reference online&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Documentation on &lt;a href=&#34;https://r-hub.github.io/rhub/reference/rhub_check.html&#34;&gt;rhub_check R6 objects&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-work-and-gitlab&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R work and GitLab&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Blog post on &lt;a href=&#34;https://jozef.io/r106-r-package-gitlab-ci/&#34;&gt;automating R analysis, modeling and development using CI/CD, with working examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;GitLab &lt;a href=&#34;https://docs.gitlab.com/ee/ci/README.html&#34;&gt;Continuous Integration documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;GitLab CI/CD &lt;a href=&#34;https://docs.gitlab.com/ee/ci/variables/&#34;&gt;environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Using a &lt;a href=&#34;https://docs.gitlab.com/ee/ci/docker/using_docker_images.html#define-an-image-from-a-private-container-registry&#34;&gt;private container registry&lt;/a&gt; with GitLab CI/CD&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-work-and-docker&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R work and Docker&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rocker-project.org/images/&#34;&gt;Docker images for R&lt;/a&gt; on the Rocker Project&lt;/li&gt;
&lt;li&gt;Colin Fay’s &lt;a href=&#34;https://colinfay.me/docker-r-reproducibility/&#34;&gt;Introduction to Docker for R Users&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.docker.com/get-started/&#34;&gt;Get started with Docker&lt;/a&gt; official documentation&lt;/li&gt;
&lt;li&gt;Using &lt;a href=&#34;https://docs.docker.com/docker-hub/repos/#private-repositories&#34;&gt;private repositories&lt;/a&gt; in DockerHub&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How to easily automate R analysis, modeling and development work using CI/CD, with working examples</title>
      <link>https://jozef.io/r106-r-package-gitlab-ci/</link>
      <pubDate>Sat, 13 Apr 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r106-r-package-gitlab-ci/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Automating the execution, testing and deployment of R work is a very powerful tool to ensure the reproducibility, quality and overall robustness of the code that we are building, be it for data analysis and modeling purposes, developing R packages or even blogging. Modern tools also provide a free an easy to use way of achieving this goal.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will show a quick and simple way to automate R data analysis and package development checking, testing and installation with GitLab CI/CD and provide example files that can be used for testing packages and deploying blogdown-based websites.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#a-quick-overview-of-cicd-and-gitlabs-approach-to-it&#34;&gt;A quick overview of CI/CD and GitLab’s approach to it&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-simplest-example-with-r-use&#34;&gt;The simplest example with R use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#example-pipeline-for-r-package-testing-and-deployment&#34;&gt;Example pipeline for R package testing and deployment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#docker-images-for-r-users-and-developers&#34;&gt;Docker images for R users and developers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr-just-show-it-to-me-in-action&#34;&gt;TL;DR: Just show it to me in action&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;a-quick-overview-of-cicd-and-gitlabs-approach-to-it&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A quick overview of CI/CD and GitLab’s approach to it&lt;/h1&gt;
&lt;p&gt;In this paragraph, we will try to introduce GitLab CI/CD and it prerequisites in very &lt;em&gt;simple and practical terms, at the cost of technical precision&lt;/em&gt;. Terms will be linked to relevant pages for those interested in precise definitions. We will also focus on using the CI/CD provided directly on GitLab. It is also possible to use it on your own infrastructure, but this is out of the scope of this introductory post.&lt;/p&gt;
&lt;div id=&#34;what-is-cicd-and-how-to-use-it-with-gitlab&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What is CI/CD and how to use it with Gitlab&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Continuous_integration&#34;&gt;Continuous integration (CI)&lt;/a&gt; and &lt;a href=&#34;https://en.wikipedia.org/wiki/Continuous_deployment&#34;&gt;Continuous deployment (CD)&lt;/a&gt; are IT practices that encourage checking and testing code often (e.g. on every change pushed to a repository) and being able to provide the resulting product (e.g. an application) to the users automatically.&lt;/p&gt;
&lt;p&gt;For the purpose of this post, we will be focusing on R code and will be happy with the CI/CD&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;detecting any code changes that we make in our repository&lt;/li&gt;
&lt;li&gt;automatically running a set of actions that we define when changes are detected&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;what-is-gitlab-cicd-what-can-it-do&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What is GitLab CI/CD, what can it do&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://about.gitlab.com/product/continuous-integration/&#34;&gt;GitLab CI/CD&lt;/a&gt; is a service provided by GitLab that makes using basic CI/CD easy even for non-IT professionals, such as R users&lt;/li&gt;
&lt;li&gt;Free for both public and private repositories hosted on GitLab&lt;/li&gt;
&lt;li&gt;Can execute a &lt;a href=&#34;https://docs.gitlab.com/ee/ci/examples/&#34;&gt;wide variety of tasks&lt;/a&gt;, ranging from executing custom scripts, deployment of Java applications, building Docker images, checking and testing R packages to publishing blogs and more&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;what-are-the-prerequisites-how-to-make-it-work&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What are the prerequisites, how to make it work?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;To use GitLab CI/CD your project’s code should be hosted on &lt;a href=&#34;https://gitlab.com/&#34;&gt;GitLab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;To make it work, you need to create a &lt;a href=&#34;https://docs.gitlab.com/ee/ci/yaml/&#34;&gt;yaml file&lt;/a&gt; called &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; with the instructions and push it to the root directory of your project’s repository&lt;/li&gt;
&lt;li&gt;Once you do that, instructions in that yaml file will be executed by GitLab automatically each time you push a code change to the repository. Different triggers such as specified times, etc. can also be used&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;why-gitlab-what-if-my-code-is-on-github&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Why GitLab? What if my code is on GitHub?&lt;/h2&gt;
&lt;p&gt;This post is by no means supposed to be an advertisement for GitLab, I chose it some time ago for 2 very simple reasons&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;It allowed for free private repositories, which is now also true for GitHub&lt;/li&gt;
&lt;li&gt;The CI/CD is fully integrated, with no need for other tools&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you use GitHub, the favorite CI tool for R code hosted there seems to be &lt;a href=&#34;https://travis-ci.com/&#34;&gt;Travis&lt;/a&gt;. Some examples specific for R can be &lt;a href=&#34;https://docs.travis-ci.com/user/languages/r/#examples&#34;&gt;found here&lt;/a&gt;. You can also read a more generic &lt;a href=&#34;https://docs.travis-ci.com/user/tutorial/&#34;&gt;Travis CI Tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-simplest-example-with-r-use&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The simplest example with R use&lt;/h1&gt;
&lt;p&gt;To make the post a bit less abstract and more practical, here is an overly simplified example of GitLab CI/CD used with R, which just runs the current version of R and prints the &lt;code&gt;mtcars&lt;/code&gt; dataset:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci&#34;&gt;repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci/blob/master/.gitlab-ci.yml&#34;&gt;.gitlab-ci.yml&lt;/a&gt; file&lt;/li&gt;
&lt;li&gt;An overview of &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci/pipelines&#34;&gt;pipeline runs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;An example &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci/-/jobs/194646370&#34;&gt;output of a run&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let’s have a look at this very simplistic &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image: r-base

test:
  script:
  - R -e &amp;#39;print(datasets::mtcars)&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;image: r-base&lt;/code&gt; tells GitLab CI/CD to use the r-base Docker image for the run - more on that later&lt;/li&gt;
&lt;li&gt;the rest of the yaml tells GitLab CI/CD to run a job named &lt;code&gt;test&lt;/code&gt;, its task is to execute a &lt;code&gt;script&lt;/code&gt; defined as &lt;code&gt;R -e &#39;print(datasets::mtcars)&#39;&lt;/code&gt;, meaning just to run R and print the &lt;code&gt;mtcars&lt;/code&gt; dataset&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now let us take a look at a more useful example for developing R packages.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-pipeline-for-r-package-testing-and-deployment&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Example pipeline for R package testing and deployment&lt;/h1&gt;
&lt;p&gt;We can use GitLab CI/CD to automatically&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Build our package&lt;/li&gt;
&lt;li&gt;Perform the &lt;code&gt;R CMD check&lt;/code&gt; and investigate if we have any errors, warnings or notes&lt;/li&gt;
&lt;li&gt;Run our unit tests&lt;/li&gt;
&lt;li&gt;Check our testing coverage and finally&lt;/li&gt;
&lt;li&gt;Install the package and potentially use it to perform more actions&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r106-01-ci.gif&#34; alt=&#34;GitLab CI/CD Pipeline for an R package&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;GitLab CI/CD Pipeline for an R package&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;An example .gitlab-ci.yml with a pipeline based on a Docker image to test an R package can look as follows. Note that this is most likely overkill and too spacious, one could have a pipeline that is way shorter for this purpose:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image: jozefhajnala/rdev:3.4.4

stages:
  - build
  - document
  - check
  - test
  - deploy

variables:
  _R_CHECK_CRAN_INCOMING_: &amp;quot;false&amp;quot;
  _R_CHECK_FORCE_SUGGESTS_: &amp;quot;true&amp;quot;
  CODECOV_TOKEN: &amp;quot;2329aed3-de38-468c-9a06-95564363211c&amp;quot;

before_script:
  - apt-get update

buildbinary:
  stage: build
  script:
    - r -e &amp;#39;devtools::build(binary = TRUE)&amp;#39;

documentation:
  stage: document
  script:
    - r inst/ci/document.R

checkerrors:
  stage: check
  script:
    - r -e &amp;#39;if (!identical(devtools::check(document = FALSE, args = &amp;quot;--no-tests&amp;quot;)[[&amp;quot;errors&amp;quot;]], character(0))) stop(&amp;quot;Check with Errors&amp;quot;)&amp;#39;

checkwarnings:
  stage: check
  script:
    - r -e &amp;#39;if (!identical(devtools::check(document = FALSE, args = &amp;quot;--no-tests&amp;quot;)[[&amp;quot;warnings&amp;quot;]], character(0))) stop(&amp;quot;Check with Warnings&amp;quot;)&amp;#39;

checknotes:
  stage: check
  script:
    - r -e &amp;#39;if (!identical(devtools::check(document = FALSE, args = &amp;quot;--no-tests&amp;quot;)[[&amp;quot;notes&amp;quot;]], character(0))) stop(&amp;quot;Check with Notes&amp;quot;)&amp;#39;

unittests:
  stage: test
  script:
    - r -e &amp;#39;if (any(as.data.frame(devtools::test())[[&amp;quot;failed&amp;quot;]] &amp;gt; 0)) stop(&amp;quot;Some tests failed.&amp;quot;)&amp;#39;

codecov:
  stage: test
  script:
    - r -e &amp;#39;covr::codecov()&amp;#39;

install:
  stage: deploy
  script:
    - r -e &amp;#39;devtools::install()&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s again take a look at the content:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;image&lt;/code&gt; - a docker image to use for the pipeline&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stages&lt;/code&gt; - defines the ordering of job execution, jobs of the same stage are run in parallel, jobs of the next stage are run after the jobs from the previous stage complete successfully. Stages are the “columns” of the chart below.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;variables&lt;/code&gt; - used to pass environment variables to the jobs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;before_script&lt;/code&gt; - used to define commands that should be run before all jobs. An example is installing a needed R package that is not contained in our Docker image.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rest are jobs definitions. Each job has a &lt;code&gt;stage&lt;/code&gt;, which defines in which stage it is ran, where multiple jobs can be included in one stage. A &lt;code&gt;script&lt;/code&gt; essentially defines what to do. For R uses, this can usually be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Rscript -e &#39;&amp;lt;r commands&amp;gt;&#39;&lt;/code&gt; to execute R commands specified between the quotes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Rscript pathtoscript.R&lt;/code&gt; to execute a script stored in a file&lt;/li&gt;
&lt;li&gt;for &lt;a href=&#34;http://dirk.eddelbuettel.com/code/littler.html&#34;&gt;littler&lt;/a&gt; users, we can replace &lt;code&gt;Rscript&lt;/code&gt; with &lt;code&gt;r&lt;/code&gt; for similar purposes as above&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;docker-images-for-r-users-and-developers&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Docker images for R users and developers&lt;/h1&gt;
&lt;p&gt;As we have seen, most of the pipelines start with &lt;code&gt;image: &amp;lt;image&amp;gt;&lt;/code&gt;. This tells GitLab to use the specified Docker image for the run, which is extremely useful because a suitable Docker image will include all the software that we need to execute our analyses, modeling or other tasks, without us having to install that software within the .yaml file. Example of such software available in an image for R use is, obviously, R and other dependencies such as additional packages.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you would like to read more about Docker, Colin Fay has you covered with his &lt;a href=&#34;https://colinfay.me/docker-r-reproducibility/&#34;&gt;Introduction to Docker for R Users&lt;/a&gt;. For now, let’s just assume that using this image provides GitLab with a place that has R (and all needed packages) installed and can run the specified scripts for us.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One of the great things about Docker images is that they are easy to share and adapt. A huge thank you and kudos go to Carl Boettiger and Dirk Eddelbuettel, who maintain the &lt;a href=&#34;https://www.rocker-project.org/&#34;&gt;Rocker project&lt;/a&gt; which provides a &lt;a href=&#34;https://www.rocker-project.org/images&#34;&gt;collection of images&lt;/a&gt; suited for different R needs built on Debian.&lt;/p&gt;
&lt;p&gt;My personal favorites from the Rocker project are&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://hub.docker.com/r/rocker/r-ver&#34;&gt;r-ver&lt;/a&gt; images - providing an environment fixed in time, including using a specifically dated MRAN repository. Have a look at their &lt;a href=&#34;https://github.com/rocker-org/rocker-versioned/tree/master/r-ver&#34;&gt;Dockerfiles on GitHub&lt;/a&gt;. The image used by my CI pipeline for testing packages is &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/r3.4.4/Dockerfile&#34;&gt;adapted from r-ver:3.4.4&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://hub.docker.com/_/r-base&#34;&gt;r-base&lt;/a&gt; for the current version of base R. Have a look at the &lt;a href=&#34;https://github.com/rocker-org/rocker/blob/master/r-base/Dockerfile&#34;&gt;Dockerfile on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr-just-show-it-to-me-in-action&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR: Just show it to me in action&lt;/h1&gt;
&lt;p&gt;In case you are only interested in seeing the CI/CD pipeline work in action for some R uses, you can look at:&lt;/p&gt;
&lt;div id=&#34;the-simplest-example-using-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The simplest example using R&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci&#34;&gt;repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci/blob/master/.gitlab-ci.yml&#34;&gt;.gitlab-ci.yml&lt;/a&gt; file&lt;/li&gt;
&lt;li&gt;An overview of the &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci/pipelines&#34;&gt;pipeline runs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;An example &lt;a href=&#34;https://gitlab.com/jozefhajnala/simplestrci/-/jobs/194646370&#34;&gt;output of a run&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;the-mentioned-r-package-testing&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The mentioned R package testing&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/develop/.gitlab-ci.yml&#34;&gt;.gitlab-ci.yml&lt;/a&gt; file for the &lt;code&gt;jhaddins&lt;/code&gt; package on branch develop&lt;/li&gt;
&lt;li&gt;An overview of the &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/pipelines&#34;&gt;pipeline runs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;An example &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/pipelines/55954699&#34;&gt;output of a successful run&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;An example output of a run that &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/-/jobs/139893165&#34;&gt;discovered a NOTE&lt;/a&gt; in the check process&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;building-a-docker-image&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Building a Docker image&lt;/h2&gt;
&lt;p&gt;The Docker image &lt;code&gt;jozefhajnala/rdev:3.4.4&lt;/code&gt; used above&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;is based on the following &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/r3.4.4/Dockerfile&#34;&gt;Dockerfile&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;is also built with GitLab CI/CD, look at the &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/blob/master/.gitlab-ci.yml&#34;&gt;.gitlab-ci.yml file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;the build &lt;a href=&#34;https://gitlab.com/jozefhajnala/dockerfiles/-/jobs/194667135&#34;&gt;pipeline in action&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;publishing-a-hugo-based-blogdown-blog&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Publishing a Hugo-based blogdown blog&lt;/h2&gt;
&lt;p&gt;Also, this blog itself is deployed on a schedule via GitLab CI/CD, using a file very similar to the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://pages.gitlab.io/hugo/&#34;&gt;Example Hugo site&lt;/a&gt; deployed with GitLab CI/CD&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://gitlab.com/pages/hugo/blob/master/.gitlab-ci.yml&#34;&gt;.gitlab-ci.yml&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;GitLab &lt;a href=&#34;https://docs.gitlab.com/ee/ci/README.html&#34;&gt;Continuous Integration documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;GitLab CI/CD &lt;a href=&#34;https://docs.gitlab.com/ee/ci/yaml/README.html&#34;&gt;Pipeline Configuration Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Colin Fay’s &lt;a href=&#34;https://colinfay.me/docker-r-reproducibility/&#34;&gt;Introduction to Docker for R Users&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.travis-ci.com/user/languages/r&#34;&gt;Building an R Project&lt;/a&gt; with Travis CI&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rocker-project.org/images/&#34;&gt;Docker images for R&lt;/a&gt; on the Rocker Project&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Should you start your R blog now? 6 reasons I found in my first year of R blogging</title>
      <link>https://jozef.io/r914-one-year-r-blogging/</link>
      <pubDate>Sat, 30 Mar 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r914-one-year-r-blogging/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;It has been a year since I posted &lt;a href=&#34;https://jozef.io/r000-about-case4base/&#34;&gt;the first post&lt;/a&gt; on this blog. Since that time, I have learned many lessons, but the main one is probably that blogging has never been as accessible as it is now.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this anniversary post, I would like to give you a few reasons to start your own R blog and write about what I have learned in my first year of blogging about R.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#the-barrier-to-entry-is-low-and-the-tools-excellent&#34;&gt;The barrier to entry is low and the tools excellent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#writing-is-a-great-way-to-learn-and-discover&#34;&gt;Writing is a great way to learn and discover&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#getting-some-readers-is-easier-than-expected&#34;&gt;Getting some readers is easier than expected&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-community-is-amazing&#34;&gt;The community is amazing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#blogging-is-fun&#34;&gt;Blogging is fun&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#write-for-yourself-the-inspiration-will-come&#34;&gt;Write for yourself, the inspiration will come&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;the-barrier-to-entry-is-low-and-the-tools-excellent&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The barrier to entry is low and the tools excellent&lt;/h1&gt;
&lt;p&gt;For many people, writing a blog on their own can seem like a challenge. In the end, you are basically creating a full-blown website, with styles, content, hopefully also with a responsive design. Then you need to setup hosting, publishing, and all the other necessities to actually get the content online. How does one do all that on their own?&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://bookdown.org/yihui/blogdown/images/logo.png&#34; alt=&#34;Blogdown Logo&#34; class=&#34;leftsmall&#34;&gt;&lt;/p&gt;
&lt;p&gt;Just like many other areas a task that would be difficult years ago, the tools came a long way and we need very little technical knowledge to have a blog up and running in under an hour. I &lt;a href=&#34;https://jozef.io/r907-christmas-praise/&#34;&gt;wrote about the amazing free tools&lt;/a&gt; that I personally use for this blog and I really believe that thanks to those tools blogging about R is very accessible to a wide range of R developers and users.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are interested in starting up right now, I would recommend taking a look at the &lt;a href=&#34;https://bookdown.org/yihui/blogdown/get-started.html&#34;&gt;get started chapter&lt;/a&gt; of the blogdown: Creating Websites with R Markdown written by Yihui Xie, the author of the blogdown package himself.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;writing-is-a-great-way-to-learn-and-discover&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Writing is a great way to learn and discover&lt;/h1&gt;
&lt;p&gt;When I was starting to write the blog, the intention was mainly to provide more exposure to base R functionality, which I felt has too little presence and popularity online, at least relative to other packages with much better presence and marketing - hence the &lt;a href=&#34;https://jozef.io/categories/rcase4base/&#34;&gt;R:case4base section&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Regardless of whether this mission was successful or not, it surprised me how much I learned during the writing process. Be it technical details, alternative ways of implementation, other class methods or even function arguments I never used before. Writing about R requires a lot of reading, which in turn resulted in learning many new approaches and exploring ideas.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Even writing this post I have discovered a cool new R package - &lt;a href=&#34;https://github.com/gaborcsardi/prompt&#34;&gt;prompt&lt;/a&gt; by Gábor Csárdi.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Writing will of course force you, well, to write your thoughts down, which is more difficult than it seems, especially if you are not a trained writer already and helps express your thoughts in a more concise way. I find that even if no one except me read the blog posts, the added value of the writing itself is worth the effort.&lt;/p&gt;
&lt;p&gt;Last but not least, I have used the posts on my blog as a reference for work since most of the time I write about issues I come across and try to propose solutions. Keeping a blog is a good way to have a written resource that you can get back to when you approach the same challenge at a later point in time. For example, I come back to the &lt;a href=&#34;https://jozef.io/r902-primer-java-from-r-2/#handling-java-exceptions-in-r&#34;&gt;handling Java exceptions&lt;/a&gt; regularly to refresh my memory.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you think without writing, you only think you’re thinking. &lt;br/&gt;&lt;em&gt;Leslie Lamport&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-some-readers-is-easier-than-expected&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting some readers is easier than expected&lt;/h1&gt;
&lt;p&gt;Apart from the learning experience, most of the people who write are happy when people actually read their blog and, if that is the goal, find it helpful. When starting, my optimistic expectation was that it would start to get some readers and crawl out of obscurity about a year from the first post, provided I keep posting consistently.&lt;/p&gt;
&lt;p&gt;To much surprise, getting exposure proved much easier than anticipated, mainly thanks to the amazing &lt;a href=&#34;https://r-bloggers.com/&#34;&gt;r-bloggers&lt;/a&gt;, an aggregator of R blogs with a huge reader base. In fact, in the first 3 months since it was added, 40% of all the readers came to this blog from R bloggers. There are also other aggregators and websites that you can add your content to for some extra exposure, such as &lt;a href=&#34;https://rweekly.org/&#34;&gt;R Weekly&lt;/a&gt; and &lt;a href=&#34;https://awesome-blogdown.com/&#34;&gt;Awesome Blogdown&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Resources online that provide some very useful tips include Maëlle’s &lt;a href=&#34;https://masalmon.eu/2018/07/16/soapbox/&#34;&gt;Get on your soapbox! R blog content and promotion&lt;/a&gt; post. I only discovered that the Twitter hashtag for R is #rstats reading this blog post. &lt;br /&gt;&lt;em&gt;Thanks, Maëlle!&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;the-community-is-amazing&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The community is amazing&lt;/h1&gt;
&lt;p&gt;In some effort to gain more exposure, one can also turn to social media. And since my poor skills predispose me to not much than Twitter (which I still cannot use properly) I try to at least post a tweet when I publish a new post - with variable success. Twitter is a lot of work, in fact much more than one would expect, so I failed miserably on my goals to publish &lt;em&gt;n&lt;/em&gt; tweets each week. There probably were months where I did not even open Twitter at all. The good thing about Twitter is, at least in my experience, there seems to be a very strong correlation between the amount of work you put in and exposure you get. And the amount variable is fully in your hands.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On an even more positive note, the #rstats community is just full of helpful and nice individuals, so any worries you might have will disappear soon.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It can also happen that you get lucky and some of the #rstats superstars such as &lt;a href=&#34;https://twitter.com/dataandme&#34;&gt;Mara Averick&lt;/a&gt; will notice and retweet your tweet, which can really help boost your exposure. And you can easily communicate with other well-known figures of the community as well.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;blogging-is-fun&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Blogging is fun&lt;/h1&gt;
&lt;p&gt;When push comes to shove, writing a blog is a spare time activity and to invest part of those precious moments, one must enjoy it. And there is a lot to enjoy when creating your own blog, especially with blogdown and hugo, where you have full control over the entire content and infrastructure of the blog. You can enjoy a multitude of activities related to content, design and more. To put this in perspective, my personal favorite time wasters include&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;endlessly obsessing about tiny details in the css, making sure everything is exactly as I want it to be, resulting in commits with messages like &lt;code&gt;Dont use -1em top margin for pre.r, use 0 instead&lt;/code&gt;. Obviously, “as I want it to be” changes pretty much on a monthly basis, with current weather potentially having a significant effect.&lt;/li&gt;
&lt;li&gt;trying to make the site light to load, resulting in spending hours on editing the &lt;a href=&#34;https://github.com/encharm/Font-Awesome-SVG-PNG&#34;&gt;svg representations of Font-Awesome&lt;/a&gt; icons to save 75KB of resources on page load. Check the footer of the page source code if you are interested in the result.&lt;/li&gt;
&lt;li&gt;related to the above, make interactive charts light to load by &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/develop/R/makeHighChart.R&#34;&gt;writing a wrapper&lt;/a&gt; to minimize the rendered highchart size to the necessary JavaScript minimum&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;write-for-yourself-the-inspiration-will-come&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Write for yourself, the inspiration will come&lt;/h1&gt;
&lt;p&gt;One of my worries when starting a blog was that when done with the first few posts that I had planned, I will not find more inspiration and topics to write about. After sticking to my schedule of posting every other Saturday, instead of running out of inspiration, it seems that the topics that I want to write about are coming in a pace faster than 2 per month, which essentially means if this keeps up, I will never run out of ideas. And if that tragic moment comes when I have nothing to write about I guess the internet will happily &lt;a href=&#34;https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read&#34;&gt;keep growing&lt;/a&gt; without those few kilobytes I would add.&lt;/p&gt;
&lt;p&gt;In terms of popularity, I write mostly what I find interesting and helpful for my future self. Even if I tried to write what others like, my estimation of what others may be interested in reading is so bad I would fail miserably. A case in point, my personal estimation was that the posts on &lt;a href=&#34;https://jozef.io/tags/rjava/&#34;&gt;interfacing Java from R&lt;/a&gt; would be the most read posts of the year. They took a lot of work and investigation to write and I find them really interesting. As it happens, both of them combined only have 10% of the reads compared to the post I called &lt;a href=&#34;https://jozef.io/r907-christmas-praise/&#34;&gt;Christmas praise&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Happy R blogging!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How to create professional reports from R scripts, with custom styles</title>
      <link>https://jozef.io/r913-spin-with-style/</link>
      <pubDate>Sat, 16 Mar 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r913-spin-with-style/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;If the &lt;a href=&#34;https://jozef.io/r909-rmarkdown-tips/&#34;&gt;practical tips for R Markdown&lt;/a&gt; post we talked briefly about how we can easily create professional reports directly from R scripts, without the need for converting them manually to Rmd and creating code chunks. In this one, we will provide useful tips on advanced options for styling, using themes and producing light-weight HTML reports directly from R scripts. We will also provide a repository with example R script and rendering code to get different styled and sized outputs easily.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#creating-reports-directly-from-r-scripts&#34;&gt;Creating reports directly from R scripts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-knitrs-spin-directly&#34;&gt;Using knitr’s spin directly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-rmarkdowns-render&#34;&gt;Using rmarkdown’s render&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr-just-show-me-the-examples&#34;&gt;TL;DR: Just show me the examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-reports-directly-from-r-scripts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Creating reports directly from R scripts&lt;/h1&gt;
&lt;p&gt;For an introduction on creating nice reports directly from R scripts, look into the &lt;a href=&#34;https://jozef.io/r909-rmarkdown-tips/#creating-beautiful-multi-format-reports-directly-from-r-scripts&#34;&gt;relevant section&lt;/a&gt; of the previous blog post. In one sentence, we can just call one of the following:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# with knitr directly
knitr::spin(&amp;quot;path-to-r-script.R&amp;quot;)

# or with rmarkdown
rmakdown::render(&amp;quot;path-to-r-script.R&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;to create a report from an R script directly. Both &lt;code&gt;spin()&lt;/code&gt; and &lt;code&gt;render()&lt;/code&gt; provide a default style that will be used to render an R script to html. The same is true from RStudio’s built-in &lt;code&gt;File -&amp;gt; Compile report...&lt;/code&gt; functionality, which will call &lt;code&gt;render()&lt;/code&gt; in the background when used.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We might, however, be interested in using different styles other than the default one when rendering our R scripts into HTML reports, and there are multiple ways to achieve this.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;including-styles-the-quick-dirty-and-risky-way&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Including styles the quick, dirty and risky way&lt;/h2&gt;
&lt;p&gt;The fastest way to include a custom css stored in a file is to simply include a line like the following at the beginning of the R script that we are using &lt;code&gt;spin()&lt;/code&gt; on:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#&amp;#39; &amp;lt;link rel=&amp;quot;stylesheet&amp;quot; type=&amp;quot;text/css&amp;quot; href=&amp;quot;path-to-our.css&amp;quot;&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This simple approach however has many caveats, as the line is just inserted into the body of the document within a paragraph, completely oblivious to what else was inserted. Unless there is a very good reason, we should use one of the safer and more robust approaches mentioned below.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-knitrs-spin-directly&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using knitr’s spin directly&lt;/h1&gt;
&lt;div id=&#34;under-the-spins-hood&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Under the spin’s hood&lt;/h2&gt;
&lt;p&gt;Under the hood, &lt;code&gt;spin()&lt;/code&gt; calls &lt;code&gt;knit2html()&lt;/code&gt;, which passes many useful arguments to &lt;code&gt;markdownToHTML()&lt;/code&gt;, the function that actually converts a markdown file to the final HTML format. Unfortunately, many of those useful arguments are not exposed via &lt;code&gt;spin()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Bearing this in mind, we have a few ways to access and provide them with the desired values:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Changing the options that govern the default values and just call &lt;code&gt;spin()&lt;/code&gt; as before&lt;/li&gt;
&lt;li&gt;Perform the spinning in 2 steps&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;changing-the-options-that-govern-the-default-values-and-just-call-spin-as-before&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Changing the options that govern the default values and just call &lt;code&gt;spin()&lt;/code&gt; as before&lt;/h2&gt;
&lt;p&gt;As mentioned above, &lt;code&gt;spin()&lt;/code&gt; does not expose the arguments of &lt;code&gt;markdownToHTML()&lt;/code&gt; directly, so what happens in practice is that the default values for those arguments are used when &lt;code&gt;spin()&lt;/code&gt; is called. Some of the interesting arguments are by default selected in the following way:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options = getOption(&amp;quot;markdown.HTML.options&amp;quot;), 
extensions = getOption(&amp;quot;markdown.extensions&amp;quot;) 
stylesheet = getOption(&amp;quot;markdown.HTML.stylesheet&amp;quot;)
template = getOption(&amp;quot;markdown.HTML.template&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s have a look at some interesting default options’ values:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(markdown)
options()[c(
  &amp;quot;markdown.HTML.options&amp;quot;,
  &amp;quot;markdown.extensions&amp;quot;,
  &amp;quot;markdown.HTML.stylesheet&amp;quot;
)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## $markdown.HTML.options
## [1] &amp;quot;use_xhtml&amp;quot;      &amp;quot;smartypants&amp;quot;    &amp;quot;base64_images&amp;quot;  &amp;quot;mathjax&amp;quot;       
## [5] &amp;quot;highlight_code&amp;quot;
## 
## $markdown.extensions
## [1] &amp;quot;no_intra_emphasis&amp;quot; &amp;quot;tables&amp;quot;            &amp;quot;fenced_code&amp;quot;      
## [4] &amp;quot;autolink&amp;quot;          &amp;quot;strikethrough&amp;quot;     &amp;quot;lax_spacing&amp;quot;      
## [7] &amp;quot;space_headers&amp;quot;     &amp;quot;superscript&amp;quot;       &amp;quot;latex_math&amp;quot;       
## 
## $markdown.HTML.stylesheet
## [1] &amp;quot;/usr/local/lib/R/site-library/markdown/resources/markdown.css&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we want to keep the spinning in one step, we can simply update those options before calling spin (and ideally change them back afterwards). For a somewhat minimalistic HTML output still keeping images self-contained, we can do:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options(
  markdown.extensions = &amp;quot;fenced_code&amp;quot;,
  markdown.HTML.options = &amp;quot;base64_images&amp;quot;,
  markdown.HTML.stylesheet = &amp;quot;{}&amp;quot;
)
knitr::spin(&amp;quot;spin_exaple.R&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To use a custom css stylesheet instead of the one &lt;a href=&#34;https://github.com/rstudio/markdown/blob/master/inst/resources/markdown.css&#34;&gt;provided by default&lt;/a&gt; with the markdown package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options(markdown.HTML.stylesheet = &amp;quot;path_to_custom.css&amp;quot;)
knitr::spin(&amp;quot;path-to-r-script.R&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;perform-the-report-creation-in-2-steps&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Perform the report creation in 2 steps&lt;/h2&gt;
&lt;p&gt;The method above works but can seem quite workaround-ish. The method that could be considered more proper is to actually split the production of the final output into 2 steps:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Generate an intermediate .Rmd file via &lt;code&gt;spin()&lt;/code&gt;, using &lt;code&gt;spin(..., knit = FALSE)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;knit2html()&lt;/code&gt; on the created .Rmd file with the desired options directly specified as arguments&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This allows us to provide additional arguments &lt;code&gt;extensions&lt;/code&gt;, &lt;code&gt;stylesheet&lt;/code&gt;, &lt;code&gt;header&lt;/code&gt;, &lt;code&gt;template&lt;/code&gt; and &lt;code&gt;encoding&lt;/code&gt; in the second step, instead of relying on the changed options to be passed as defaults.&lt;/p&gt;
&lt;p&gt;The below example will embed styles present in &lt;code&gt;path_to_custom.css&lt;/code&gt; into the resulting HTML:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Creates the intermediate path-to-r-script.Rmd
knitr::spin(&amp;quot;path-to-r-script.R&amp;quot;, knit = FALSE)

# Now create the final HTML output from
# path-to-r-script.Rmd, with desired options
knitr::knit2html(
  input = &amp;quot;path-to-r-script.Rmd&amp;quot;,
  stylesheet = &amp;quot;path_to_custom.css&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Using both of the above options will actually embed the css directly into the HTML output that is produced, making the output larger in size.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note that the arguments we are looking to provide to &lt;code&gt;knit2html()&lt;/code&gt; are implemented as part of &lt;code&gt;...&lt;/code&gt;, so we will have to name them. To look at the details, study the &lt;a href=&#34;https://www.rdocumentation.org/packages/markdown/versions/0.9/topics/markdownToHTML&#34;&gt;documentation of &lt;code&gt;markdownToHTML()&lt;/code&gt;&lt;/a&gt;, to which those arguments get passed.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r913-01-spin-with-css.png&#34; alt=&#34;spin with custom air.css&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;spin with custom &lt;a href=&#34;https://github.com/markdowncss/air&#34;&gt;air.css&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-rmarkdowns-render&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using rmarkdown’s render()&lt;/h1&gt;
&lt;p&gt;To produce an HTML report from an R script we can also use &lt;code&gt;rmarkdown::render()&lt;/code&gt; on an R script file. This will create a report with slight differences to the default &lt;code&gt;knit()&lt;/code&gt; output, one notable for HTML output is that &lt;code&gt;render()&lt;/code&gt; will by default include inline base64 representations of fonts and JavaScript sources. It will also include some potentially useful metadata, such as the author’s name and the date of rendering.&lt;/p&gt;
&lt;div id=&#34;the-output_format-powerhouse&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The output_format powerhouse&lt;/h2&gt;
&lt;p&gt;The output of &lt;code&gt;render()&lt;/code&gt; is governed mainly by the &lt;code&gt;output_format&lt;/code&gt; argument. Most of the time users will pass on just the name of the format, such as &lt;code&gt;&amp;quot;html_document&amp;quot;&lt;/code&gt;, as most of the options are governed by the yaml metadata present at the beginning of our Rmd files.&lt;/p&gt;
&lt;p&gt;For R scripts we usually do not use the yaml metadata. In this case, we can take full advantage of the flexibility of that argument, passing a call to &lt;code&gt;rmarkdown::html_document()&lt;/code&gt; with the desired parameters as &lt;code&gt;output_format&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;minimalistic-output-with-render&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Minimalistic output with render()&lt;/h2&gt;
&lt;p&gt;To produce a minimalistic HTML output from our &lt;code&gt;path-to-r-script.R&lt;/code&gt; script, we can for example specify the following as &lt;code&gt;output_format&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rmarkdown::render(
  &amp;quot;path-to-r-script.R&amp;quot;, 
  output_format = rmarkdown::html_document(
    theme = NULL,
    mathjax = NULL,
    highlight = NULL
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;custom-css-with-render&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Custom css with render()&lt;/h2&gt;
&lt;p&gt;Including a custom css stylesheet is equally simple, just provide a &lt;code&gt;css&lt;/code&gt; argument with the css file path to the &lt;code&gt;html_document()&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rmarkdown::render(
  &amp;quot;path-to-r-script.R&amp;quot;, 
  output_format = rmarkdown::html_document(
    theme = NULL,
    mathjax = NULL,
    highlight = NULL,
    css = &amp;quot;path_to_custom.css&amp;quot;
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An interesting property of including custom css styles is that by default the argument &lt;code&gt;self_contained&lt;/code&gt; is set to &lt;code&gt;TRUE&lt;/code&gt;, meaning that the full stylesheet will be embedded into the output HTML file, including all the external css imported into the one we are using. This means that if your stylesheets import external fonts such as the following, those will also be pasted directly into the output:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;@import url(http://fonts.googleapis.com/css?family=Open+Sans:300italic,300);&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This behavior is different for &lt;code&gt;spin()&lt;/code&gt;, which will paste the &lt;code&gt;@import&lt;/code&gt; clause into the output as-is, instead of parsing and pasting the actual content of the provided url.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr-just-show-me-the-examples&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR: Just show me the examples&lt;/h1&gt;
&lt;p&gt;If instead of reading about it you would like to just test it yourself, I created a very simple R project showcasing the mentioned methods and some more &lt;a href=&#34;https://gitlab.com/jozefhajnala/gists/tree/master/rmarkdown/spin&#34;&gt;available via a GitLab repo&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The project has the following files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;src/path-to-r-script.R&lt;/code&gt; - an R script with custom formatted comments to be used as the source for creating reports with &lt;code&gt;knitr::spin()&lt;/code&gt; and &lt;code&gt;rmarkdown::render()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rendering_render.R&lt;/code&gt; - an R script that uses &lt;code&gt;rmarkdown::render()&lt;/code&gt; to create multiple different output reports based on &lt;code&gt;path-to-r-script.R&lt;/code&gt; and save them to &lt;code&gt;outputs/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rendering_spin.R&lt;/code&gt; - an R script that uses &lt;code&gt;knitr::spin()&lt;/code&gt; to create multiple different output reports based on &lt;code&gt;path-to-r-script.R&lt;/code&gt; and save them to &lt;code&gt;outputs/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;outputs/&lt;/code&gt; - HTML reports generated from the content of &lt;code&gt;path-to-r-script.R&lt;/code&gt; by running &lt;code&gt;rendering_spin.R&lt;/code&gt; and &lt;code&gt;rendering_render.R&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;css/&lt;/code&gt; - Example css used for creating &lt;code&gt;outputs/ex_04_spin_air_css.html&lt;/code&gt;, all credit for the &lt;code&gt;air.css&lt;/code&gt; goes to &lt;a href=&#34;https://github.com/markdowncss/air&#34;&gt;https://github.com/markdowncss/air&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://bookdown.org/yihui/rmarkdown/html-document.html&#34;&gt;HTML document&lt;/a&gt; chapter of the R Markdown: The Definitive Guide book&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/r909-rmarkdown-tips/&#34;&gt;Create R Markdown reports and presentations even better with these 3 practical tips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/markdowncss/air&#34;&gt;air.css&lt;/a&gt; style used to create the report on the screenshot above&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Creating blazing fast pivot tables from R with data.table - now with subtotals using grouping sets</title>
      <link>https://jozef.io/r912-datatable-grouping-sets/</link>
      <pubDate>Sat, 02 Mar 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r912-datatable-grouping-sets/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Data manipulation and aggregation is one of the classic tasks anyone working with data will come across. We of course can perform data transformation and aggregation &lt;a href=&#34;https://jozef.io/categories/rcase4base/&#34;&gt;with base R&lt;/a&gt;, but when speed and memory efficiency come into play, data.table is my package of choice.&lt;/p&gt;
&lt;p&gt;In this post we will look at of the fresh and very useful functionality that came to data.table only last year - grouping sets, enabling us, for example, to create pivot table-like reports with sub-totals and grand total quickly and easily.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#basic-by-group-summaries-with-data.table&#34;&gt;Basic by-group summaries with data.table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#quick-pivot-tables-with-subtotals-and-a-grand-total&#34;&gt;Quick pivot tables with subtotals and a grand total&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#custom-grouping-sets&#34;&gt;Custom grouping sets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#cube-and-rollup-as-special-cases-of-grouping-sets&#34;&gt;Cube and rollup as special cases of grouping sets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;basic-by-group-summaries-with-data.table&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Basic by-group summaries with data.table&lt;/h1&gt;
&lt;p&gt;To showcase the functionality, we will use a very slightly modified dataset provided by Hadley Wickham’s &lt;a href=&#34;https://cran.r-project.org/package=nycflights13&#34;&gt;nycflights13&lt;/a&gt; package, mainly the &lt;code&gt;flights&lt;/code&gt; data frame. Lets prepare a small dataset suitable for the showcase:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(data.table)
dataurl &amp;lt;- &amp;quot;https://jozef.io/post/data/&amp;quot;
flights &amp;lt;- readRDS(url(paste0(dataurl, &amp;quot;r006/flights.rds&amp;quot;)))
flights &amp;lt;- as.data.table(flights)[month &amp;lt; 3]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, for those unfamiliar with data table, to create a summary of distances flown per month and originating airport with data.table, we could simply use:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;flights[, sum(distance), by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    month origin       V1
## 1:     1    EWR  9524521
## 2:     1    LGA  6359510
## 3:     1    JFK 11304774
## 4:     2    EWR  8725657
## 5:     2    LGA  5917983
## 6:     2    JFK 10331869&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To also name the new column nicely, say &lt;code&gt;distance&lt;/code&gt; instead of the default &lt;code&gt;V1&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;flights[, .(distance = sum(distance)), by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    month origin distance
## 1:     1    EWR  9524521
## 2:     1    LGA  6359510
## 3:     1    JFK 11304774
## 4:     2    EWR  8725657
## 5:     2    LGA  5917983
## 6:     2    JFK 10331869&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For more on basic data.table operations, look at the &lt;a href=&#34;https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html&#34;&gt;Introduction to data.table&lt;/a&gt; vignette.&lt;/p&gt;
&lt;p&gt;As you have probably noticed, the above gave us the sums of distances by months and origins. When creating reports, especially readers coming from Excel may expect 2 extra perks&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Looking at sub-totals and grand total&lt;/li&gt;
&lt;li&gt;Seeing the data in wide format&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Since the wide format is just a reshape and data table has the &lt;a href=&#34;https://www.rdocumentation.org/packages/data.table/versions/1.12.0/topics/dcast.data.table&#34;&gt;&lt;code&gt;dcast()&lt;/code&gt;&lt;/a&gt; function for that for quite a while now, we will only briefly show it in practice. The focus of this post will be on the new functionality that was only released in &lt;a href=&#34;https://github.com/Rdatatable/data.table/blob/master/NEWS.md#changes-in-v1110--01-may-2018&#34;&gt;data.table v1.11&lt;/a&gt; in May last year - creating the grand- and sub-totals.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-pivot-tables-with-subtotals-and-a-grand-total&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick pivot tables with subtotals and a grand total&lt;/h1&gt;
&lt;p&gt;To create a “classic” pivot table as known from Excel, we need to aggregate the data and also compute the subtotals for all combinations of the selected dimensions and a grand total. In comes &lt;code&gt;cube()&lt;/code&gt;, the function that will do just that:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get subtotals for origin, month and month&amp;amp;origin with `cube()`:
cubed &amp;lt;- data.table::cube(
  flights,
  .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;)
)
cubed&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     month origin distance
##  1:     1    EWR  9524521
##  2:     1    LGA  6359510
##  3:     1    JFK 11304774
##  4:     2    EWR  8725657
##  5:     2    LGA  5917983
##  6:     2    JFK 10331869
##  7:     1   &amp;lt;NA&amp;gt; 27188805
##  8:     2   &amp;lt;NA&amp;gt; 24975509
##  9:    NA    EWR 18250178
## 10:    NA    LGA 12277493
## 11:    NA    JFK 21636643
## 12:    NA   &amp;lt;NA&amp;gt; 52164314&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we can see, compared to the simple group by summary we did earlier, we have extra rows in the output&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Rows &lt;code&gt;7,8&lt;/code&gt; with months &lt;code&gt;1,2&lt;/code&gt; and origin &lt;code&gt;&amp;lt;NA&amp;gt;, &amp;lt;NA&amp;gt;&lt;/code&gt; - these are the subtotals per month across all origins&lt;/li&gt;
&lt;li&gt;Rows &lt;code&gt;9,10,11&lt;/code&gt; with months &lt;code&gt;NA, NA, NA&lt;/code&gt; and origins &lt;code&gt;EWR, LGA, JFK&lt;/code&gt; - these are the subtotals per origin across all months&lt;/li&gt;
&lt;li&gt;Row &lt;code&gt;12&lt;/code&gt; with &lt;code&gt;NA&lt;/code&gt; month and &lt;code&gt;&amp;lt;NA&amp;gt;&lt;/code&gt; origin - this is the Grand total across all origins and months&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All that is left to get a familiar pivot table shape is to reshape the data to wide format with the aforementioned &lt;code&gt;dcast()&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# - Origins in columns, months in rows
data.table::dcast(cubed, month ~ origin,  value.var = &amp;quot;distance&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    month       NA      EWR      JFK      LGA
## 1:    NA 52164314 18250178 21636643 12277493
## 2:     1 27188805  9524521 11304774  6359510
## 3:     2 24975509  8725657 10331869  5917983&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# - Origins in rows, months in columns
data.table::dcast(cubed, origin ~ month,  value.var = &amp;quot;distance&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    origin       NA        1        2
## 1:   &amp;lt;NA&amp;gt; 52164314 27188805 24975509
## 2:    EWR 18250178  9524521  8725657
## 3:    JFK 21636643 11304774 10331869
## 4:    LGA 12277493  6359510  5917983&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r912-01-datatable-pivot.gif&#34; alt=&#34;Pivot table with data.table&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Pivot table with data.table&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-more-dimensions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using more dimensions&lt;/h2&gt;
&lt;p&gt;We can use the same approach to create summaries with more than two dimensions, for example, apart from months and origins, we can also look at carriers, simply by adding &lt;code&gt;&amp;quot;carrier&amp;quot;&lt;/code&gt; into the &lt;code&gt;by&lt;/code&gt; argument:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# With 3 dimensions:
cubed2 &amp;lt;- cube(
  flights, 
  .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;)
)
cubed2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      month origin carrier distance
##   1:     1    EWR      UA  5084378
##   2:     1    LGA      UA   729667
##   3:     1    JFK      AA  2013434
##   4:     1    JFK      B6  3672655
##   5:     1    LGA      DL  1678965
##  ---                              
## 153:    NA   &amp;lt;NA&amp;gt;      F9   174960
## 154:    NA   &amp;lt;NA&amp;gt;      HA   293997
## 155:    NA   &amp;lt;NA&amp;gt;      YV    21526
## 156:    NA   &amp;lt;NA&amp;gt;      OO      733
## 157:    NA   &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt; 52164314&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And &lt;code&gt;dcast()&lt;/code&gt; to wide format which suits our needs best:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# For example, with month and carrier in rows, origins in columns:
dcast(cubed2, month + carrier ~ origin,  value.var = &amp;quot;distance&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     month carrier       NA      EWR      JFK      LGA
##  1:    NA    &amp;lt;NA&amp;gt; 52164314 18250178 21636643 12277493
##  2:    NA      9E  1431961    88706  1271194    72061
##  3:    NA      AA  7171819   789591  3830482  2551746
##  4:    NA      AS   283436   283436       NA       NA
##  5:    NA      B6  9036256   940582  7062702  1032972
##  6:    NA      DL  8729015   465275  4963047  3300693
##  7:    NA      EV  4188259  3940295    48792   199172
##  8:    NA      F9   174960       NA       NA   174960
##  9:    NA      FL   431194       NA       NA   431194
## 10:    NA      HA   293997       NA   293997       NA
## 11:    NA      MQ  2439609   293352   425390  1720867
## 12:    NA      OO      733       NA       NA      733
## 13:    NA      UA 13016872  9770500  1834968  1411404
## 14:    NA      US  1677108   641427   442107   593574
## 15:    NA      VX  1463964       NA  1463964       NA
## 16:    NA      WN  1803605  1037014       NA   766591
## 17:    NA      YV    21526       NA       NA    21526
## 18:     1    &amp;lt;NA&amp;gt; 27188805  9524521 11304774  6359510
## 19:     1      9E   749305    46125   666109    37071
## 20:     1      AA  3773186   415707  2013434  1344045
## 21:     1      AS   148924   148924       NA       NA
## 22:     1      B6  4699834   484431  3672655   542748
## 23:     1      DL  4503241   245277  2578999  1678965
## 24:     1      EV  2178833  2067900    24624    86309
## 25:     1      F9    95580       NA       NA    95580
## 26:     1      FL   226658       NA       NA   226658
## 27:     1      HA   154473       NA   154473       NA
## 28:     1      MQ  1284653   152428   223510   908715
## 29:     1      OO      733       NA       NA      733
## 30:     1      UA  6777189  5084378   963144   729667
## 31:     1      US   858820   339595   219387   299838
## 32:     1      VX   788439       NA   788439       NA
## 33:     1      WN   938403   539756       NA   398647
## 34:     1      YV    10534       NA       NA    10534
## 35:     2    &amp;lt;NA&amp;gt; 24975509  8725657 10331869  5917983
## 36:     2      9E   682656    42581   605085    34990
## 37:     2      AA  3398633   373884  1817048  1207701
## 38:     2      AS   134512   134512       NA       NA
## 39:     2      B6  4336422   456151  3390047   490224
## 40:     2      DL  4225774   219998  2384048  1621728
## 41:     2      EV  2009426  1872395    24168   112863
## 42:     2      F9    79380       NA       NA    79380
## 43:     2      FL   204536       NA       NA   204536
## 44:     2      HA   139524       NA   139524       NA
## 45:     2      MQ  1154956   140924   201880   812152
## 46:     2      UA  6239683  4686122   871824   681737
## 47:     2      US   818288   301832   222720   293736
## 48:     2      VX   675525       NA   675525       NA
## 49:     2      WN   865202   497258       NA   367944
## 50:     2      YV    10992       NA       NA    10992
##     month carrier       NA      EWR      JFK      LGA&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;custom-grouping-sets&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Custom grouping sets&lt;/h1&gt;
&lt;p&gt;So far we have focused on the “default” pivot table shapes with all sub-totals and a grand total, however the &lt;code&gt;cube()&lt;/code&gt; function could be considered just a useful special case shortcut for a more generic concept - grouping sets. You can read more on grouping sets with &lt;a href=&#34;https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/bb522495(v%3dsql.105)&#34;&gt;MS SQL Server&lt;/a&gt; or with &lt;a href=&#34;https://www.postgresql.org/docs/devel/queries-table-expressions.html#QUERIES-GROUPING-SETS&#34;&gt;PostgreSQL&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;groupingsets()&lt;/code&gt; function allows us to create sub-totals on arbitrary groups of dimensions. Custom subtotals are defined by the &lt;code&gt;sets&lt;/code&gt; argument, a list of character vectors, each of them defining one subtotal. Now let us have a look at a few practical examples:&lt;/p&gt;
&lt;div id=&#34;replicate-a-simple-group-by-without-any-subtotals-or-grand-total&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Replicate a simple group by, without any subtotals or grand total&lt;/h2&gt;
&lt;p&gt;For reference, to replicate a simple group by with grouping sets, we could use:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
  sets = list(c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;)),
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which would give the same results as&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;flights[, .(distance = sum(distance)), by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;)]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;custom-subtotals&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Custom subtotals&lt;/h2&gt;
&lt;p&gt;To give only the subtotals for each of the dimensions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
  sets = list(
    c(&amp;quot;month&amp;quot;),
    c(&amp;quot;origin&amp;quot;),
    c(&amp;quot;carrier&amp;quot;)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     month origin carrier distance
##  1:     1   &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt; 27188805
##  2:     2   &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt; 24975509
##  3:    NA    EWR    &amp;lt;NA&amp;gt; 18250178
##  4:    NA    LGA    &amp;lt;NA&amp;gt; 12277493
##  5:    NA    JFK    &amp;lt;NA&amp;gt; 21636643
##  6:    NA   &amp;lt;NA&amp;gt;      UA 13016872
##  7:    NA   &amp;lt;NA&amp;gt;      AA  7171819
##  8:    NA   &amp;lt;NA&amp;gt;      B6  9036256
##  9:    NA   &amp;lt;NA&amp;gt;      DL  8729015
## 10:    NA   &amp;lt;NA&amp;gt;      EV  4188259
## 11:    NA   &amp;lt;NA&amp;gt;      MQ  2439609
## 12:    NA   &amp;lt;NA&amp;gt;      US  1677108
## 13:    NA   &amp;lt;NA&amp;gt;      WN  1803605
## 14:    NA   &amp;lt;NA&amp;gt;      VX  1463964
## 15:    NA   &amp;lt;NA&amp;gt;      FL   431194
## 16:    NA   &amp;lt;NA&amp;gt;      AS   283436
## 17:    NA   &amp;lt;NA&amp;gt;      9E  1431961
## 18:    NA   &amp;lt;NA&amp;gt;      F9   174960
## 19:    NA   &amp;lt;NA&amp;gt;      HA   293997
## 20:    NA   &amp;lt;NA&amp;gt;      YV    21526
## 21:    NA   &amp;lt;NA&amp;gt;      OO      733
##     month origin carrier distance&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To give only the subtotals per combinations of 2 dimensions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
  sets = list(
    c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;),
    c(&amp;quot;month&amp;quot;, &amp;quot;carrier&amp;quot;),
    c(&amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     month origin carrier distance
##  1:     1    EWR    &amp;lt;NA&amp;gt;  9524521
##  2:     1    LGA    &amp;lt;NA&amp;gt;  6359510
##  3:     1    JFK    &amp;lt;NA&amp;gt; 11304774
##  4:     2    EWR    &amp;lt;NA&amp;gt;  8725657
##  5:     2    LGA    &amp;lt;NA&amp;gt;  5917983
##  6:     2    JFK    &amp;lt;NA&amp;gt; 10331869
##  7:     1   &amp;lt;NA&amp;gt;      UA  6777189
##  8:     1   &amp;lt;NA&amp;gt;      AA  3773186
##  9:     1   &amp;lt;NA&amp;gt;      B6  4699834
## 10:     1   &amp;lt;NA&amp;gt;      DL  4503241
## 11:     1   &amp;lt;NA&amp;gt;      EV  2178833
## 12:     1   &amp;lt;NA&amp;gt;      MQ  1284653
## 13:     1   &amp;lt;NA&amp;gt;      US   858820
## 14:     1   &amp;lt;NA&amp;gt;      WN   938403
## 15:     1   &amp;lt;NA&amp;gt;      VX   788439
## 16:     1   &amp;lt;NA&amp;gt;      FL   226658
## 17:     1   &amp;lt;NA&amp;gt;      AS   148924
## 18:     1   &amp;lt;NA&amp;gt;      9E   749305
## 19:     1   &amp;lt;NA&amp;gt;      F9    95580
## 20:     1   &amp;lt;NA&amp;gt;      HA   154473
## 21:     1   &amp;lt;NA&amp;gt;      YV    10534
## 22:     1   &amp;lt;NA&amp;gt;      OO      733
## 23:     2   &amp;lt;NA&amp;gt;      US   818288
## 24:     2   &amp;lt;NA&amp;gt;      UA  6239683
## 25:     2   &amp;lt;NA&amp;gt;      B6  4336422
## 26:     2   &amp;lt;NA&amp;gt;      AA  3398633
## 27:     2   &amp;lt;NA&amp;gt;      EV  2009426
## 28:     2   &amp;lt;NA&amp;gt;      FL   204536
## 29:     2   &amp;lt;NA&amp;gt;      MQ  1154956
## 30:     2   &amp;lt;NA&amp;gt;      DL  4225774
## 31:     2   &amp;lt;NA&amp;gt;      WN   865202
## 32:     2   &amp;lt;NA&amp;gt;      9E   682656
## 33:     2   &amp;lt;NA&amp;gt;      VX   675525
## 34:     2   &amp;lt;NA&amp;gt;      AS   134512
## 35:     2   &amp;lt;NA&amp;gt;      F9    79380
## 36:     2   &amp;lt;NA&amp;gt;      HA   139524
## 37:     2   &amp;lt;NA&amp;gt;      YV    10992
## 38:    NA    EWR      UA  9770500
## 39:    NA    LGA      UA  1411404
## 40:    NA    JFK      AA  3830482
## 41:    NA    JFK      B6  7062702
## 42:    NA    LGA      DL  3300693
## 43:    NA    EWR      B6   940582
## 44:    NA    LGA      EV   199172
## 45:    NA    LGA      AA  2551746
## 46:    NA    JFK      UA  1834968
## 47:    NA    LGA      B6  1032972
## 48:    NA    LGA      MQ  1720867
## 49:    NA    EWR      AA   789591
## 50:    NA    JFK      DL  4963047
## 51:    NA    EWR      MQ   293352
## 52:    NA    EWR      DL   465275
## 53:    NA    EWR      US   641427
## 54:    NA    EWR      EV  3940295
## 55:    NA    JFK      US   442107
## 56:    NA    LGA      WN   766591
## 57:    NA    JFK      VX  1463964
## 58:    NA    LGA      FL   431194
## 59:    NA    EWR      AS   283436
## 60:    NA    LGA      US   593574
## 61:    NA    JFK      MQ   425390
## 62:    NA    JFK      9E  1271194
## 63:    NA    LGA      F9   174960
## 64:    NA    EWR      WN  1037014
## 65:    NA    JFK      HA   293997
## 66:    NA    JFK      EV    48792
## 67:    NA    EWR      9E    88706
## 68:    NA    LGA      9E    72061
## 69:    NA    LGA      YV    21526
## 70:    NA    LGA      OO      733
##     month origin carrier distance&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;grand-total&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Grand total&lt;/h2&gt;
&lt;p&gt;To give only the grand total:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
  sets = list(
    character(0)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    month origin carrier distance
## 1:    NA   &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt; 52164314&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;cube-and-rollup-as-special-cases-of-grouping-sets&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Cube and rollup as special cases of grouping sets&lt;/h1&gt;
&lt;div id=&#34;implementation-of-cube&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Implementation of cube&lt;/h2&gt;
&lt;p&gt;We mentioned above that &lt;code&gt;cube()&lt;/code&gt; can be considered just a shortcut to a useful special case of &lt;code&gt;groupingsets()&lt;/code&gt;. And indeed, looking at the implementation of the data.table method &lt;code&gt;data.table:::cube.data.table&lt;/code&gt;, most of what it does is to define the &lt;code&gt;sets&lt;/code&gt; to represent the given vector and all of its possible subsets, and passes that to &lt;code&gt;groupingsets()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;function (x, j, by, .SDcols, id = FALSE, ...) {
  if (!is.data.table(x)) 
    stop(&amp;quot;Argument &amp;#39;x&amp;#39; must be a data.table object&amp;quot;)
  if (!is.character(by)) 
    stop(&amp;quot;Argument &amp;#39;by&amp;#39; must be a character vector of column names used in grouping.&amp;quot;)
  if (!is.logical(id)) 
    stop(&amp;quot;Argument &amp;#39;id&amp;#39; must be a logical scalar.&amp;quot;)
  n = length(by)
  keepBool = sapply(2L^(seq_len(n) - 1L), function(k) rep(c(FALSE, 
    TRUE), times = k, each = ((2L^n)/(2L * k))))
  sets = lapply((2L^n):1L, function(j) by[keepBool[j, ]])
  jj = substitute(j)
  groupingsets.data.table(x, by = by, sets = sets, .SDcols = .SDcols, 
    id = id, jj = jj)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This means for example that&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cube(flights, sum(distance),  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      month origin carrier       V1
##   1:     1    EWR      UA  5084378
##   2:     1    LGA      UA   729667
##   3:     1    JFK      AA  2013434
##   4:     1    JFK      B6  3672655
##   5:     1    LGA      DL  1678965
##  ---                              
## 153:    NA   &amp;lt;NA&amp;gt;      F9   174960
## 154:    NA   &amp;lt;NA&amp;gt;      HA   293997
## 155:    NA   &amp;lt;NA&amp;gt;      YV    21526
## 156:    NA   &amp;lt;NA&amp;gt;      OO      733
## 157:    NA   &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt; 52164314&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Is equivalent to&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
  sets = list(
    c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
    c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;),
    c(&amp;quot;month&amp;quot;, &amp;quot;carrier&amp;quot;),
    c(&amp;quot;month&amp;quot;),
    c(&amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
    c(&amp;quot;origin&amp;quot;),
    c(&amp;quot;carrier&amp;quot;),
    character(0)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      month origin carrier distance
##   1:     1    EWR      UA  5084378
##   2:     1    LGA      UA   729667
##   3:     1    JFK      AA  2013434
##   4:     1    JFK      B6  3672655
##   5:     1    LGA      DL  1678965
##  ---                              
## 153:    NA   &amp;lt;NA&amp;gt;      F9   174960
## 154:    NA   &amp;lt;NA&amp;gt;      HA   293997
## 155:    NA   &amp;lt;NA&amp;gt;      YV    21526
## 156:    NA   &amp;lt;NA&amp;gt;      OO      733
## 157:    NA   &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt; 52164314&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;implementation-of-rollup&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Implementation of rollup&lt;/h2&gt;
&lt;p&gt;The same can be said about &lt;code&gt;rollup()&lt;/code&gt;, another shortcut than can be useful. Instead of all possible subsets, it will create a list representing the vector passed to &lt;code&gt;by&lt;/code&gt; and its subsets “from right to left”, including the empty vector to get a grand total. Looking at the implementation of the data.table method &lt;code&gt;data.table::rollup.data.table&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;function (x, j, by, .SDcols, id = FALSE, ...) {
  if (!is.data.table(x)) 
    stop(&amp;quot;Argument &amp;#39;x&amp;#39; must be a data.table object&amp;quot;)
  if (!is.character(by)) 
    stop(&amp;quot;Argument &amp;#39;by&amp;#39; must be a character vector of column names used in grouping.&amp;quot;)
  if (!is.logical(id)) 
    stop(&amp;quot;Argument &amp;#39;id&amp;#39; must be a logical scalar.&amp;quot;)
  sets = lapply(length(by):0L, function(i) by[0L:i])
  jj = substitute(j)
  groupingsets.data.table(x, by = by, sets = sets, .SDcols = .SDcols, 
    id = id, jj = jj)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For example, the following:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rollup(flights, sum(distance),  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Is equivalent to&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;groupingsets(
  flights,
  j = .(distance = sum(distance)),
  by = c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
  sets = list(
    c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;, &amp;quot;carrier&amp;quot;),
    c(&amp;quot;month&amp;quot;, &amp;quot;origin&amp;quot;),
    c(&amp;quot;month&amp;quot;),
    character(0)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.postgresql.org/docs/devel/queries-table-expressions.html#QUERIES-GROUPING-SETS&#34;&gt;Grouping sets, cube and rollup&lt;/a&gt; in PostgreSQL&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/bb522495(v%3dsql.105)&#34;&gt;MS SQL Server&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;And in &lt;a href=&#34;https://oracle-base.com/articles/misc/rollup-cube-grouping-functions-and-grouping-sets#grouping_sets&#34;&gt;Oracle&lt;/a&gt; documentation&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html&#34;&gt;Introduction to data.table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cloud.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html&#34;&gt;Efficient reshaping using data.tables&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Verbose data.table and uncovering hidden cedta&#39;s data table awareness decisions</title>
      <link>https://jozef.io/r911-datatable-cedta/</link>
      <pubDate>Sat, 16 Feb 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r911-datatable-cedta/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;When speed and memory efficiency is important, the data.table package is one of the ways to improve those aspects of our R code dramatically. Including data.table in a package also comes with the added benefit of only importing the methods package, which is part of base R. We must also however pay attention to correctly importing and using methods, as data.table handles data.frame subsetting operators in a special way. This post is mostly a lesson learned for future self on how I did not pay attention and what I found out investigating.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr-if-you-just-want-something-useful&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR if you just want something useful&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;options(datatable.verbose = TRUE)&lt;/code&gt; to see useful logging information&lt;/li&gt;
&lt;li&gt;If you are getting weird errors with subset methods, check if data frame methods do not get called instead of the data table ones (e.g. running &lt;code&gt;traceback()&lt;/code&gt; after the error occurs)&lt;/li&gt;
&lt;li&gt;If so, check if &lt;code&gt;data.table:::cedta()&lt;/code&gt; returns &lt;code&gt;FALSE&lt;/code&gt; for your package. And if it does, check if you import data.table in the NAMESPACE file of your package&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;a-somewhat-reproducible-example-of-the-issue&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A somewhat reproducible example of the issue&lt;/h1&gt;
&lt;p&gt;Imagine a very simple function that takes a data table and sums a column with a name provided via the &lt;code&gt;y&lt;/code&gt; argument, grouped by the column name provided via the &lt;code&gt;by&lt;/code&gt; argument. An oversimplified definition and example use with the &lt;code&gt;mtcars&lt;/code&gt; dataset could look as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sumData &amp;lt;- function(dt, y, by) dt[, sum(get(y)), by = by]

mtcarsdt &amp;lt;- data.table::as.data.table(datasets::mtcars)
sumData(mtcarsdt, &amp;quot;disp&amp;quot;, &amp;quot;gear&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    gear     V1
## 1:    4 1476.2
## 2:    3 4894.5
## 3:    5 1012.4&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So far so good, everything works great. Now we put our awesome function into a nice package called &lt;code&gt;dtexample&lt;/code&gt;. Add some roxygen documentation, add data.table into Imports in our DESCRIPTION, try to install our package. All still works. Run R CMD check for good measure and get 0 errors, 0 warnings and 0 notes, like a boss!&lt;/p&gt;
&lt;p&gt;Now let’s see our function in action, from within the new package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dtexample::sumData(mtcarsdt, &amp;quot;disp&amp;quot;, &amp;quot;gear&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Error in get(y) : object &amp;#39;disp&amp;#39; not found &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Oops. Something went wrong. Debugging such an issue can be tricky, especially if this happened in a more realistic setting, such as writing the function across multiple days and having a more complicated function than a one-liner. Most often the issue is inside the actual code, especially when passing around more complicated quoted expressions into data table’s subsetting machinery.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;traceback-and-datatable.verbose-to-the-rescue&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Traceback and datatable.verbose to the rescue&lt;/h1&gt;
&lt;p&gt;Let us look at the &lt;code&gt;traceback()&lt;/code&gt; to get some insight into what is going on:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;traceback()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 5: get(y)
## 4: `[.data.frame`(x, i, j)
## 3: `[.data.table`(dt, , sum(get(y)), by = by) at sumData.R#12
## 2: dt[, sum(get(y)), by = by] at sumData.R#12
## 1: dtexample::sumData(dt, &amp;quot;disp&amp;quot;, &amp;quot;gear&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note the &lt;code&gt;4:&lt;/code&gt; despite the object being a data table (which is also confirmed by the third line of the traceback), the data frame method was called. It would also seem that this was deliberate on data table’s side. Let us turn on the &lt;code&gt;datatable.verbose&lt;/code&gt; option and see what it has to say:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options(datatable.verbose = TRUE)
dtexample::sumData(mtcarsdt, &amp;quot;disp&amp;quot;, &amp;quot;gear&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## cedta decided &amp;#39;dtexample&amp;#39; wasn&amp;#39;t data.table aware. Here is call stack with [[1L]] applied:
## [[1]]
## dtexample::sumData
## 
## [[2]]
## `[`
## 
## [[3]]
## `[.data.table`
## 
## [[4]]
## cedta&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r911-01-datatable-cedta.gif&#34; alt=&#34;Traceback and cedta()&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Traceback and cedta()&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;so-what-is-this-cedta&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;So what is this &lt;code&gt;cedta()&lt;/code&gt;?&lt;/h1&gt;
&lt;p&gt;Looking at data table’s verbose output, we immediately notice this message:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;cedta decided ‘dtexample’ wasn’t data.table aware. Here is call stack with &lt;code&gt;[[1L]]&lt;/code&gt; applied:&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So, what is this &lt;code&gt;cedta()&lt;/code&gt; and why is it making such decisions? Let us look how we get from subsetting a data table to a function deciding that our package is not data table aware. Examining the first rows of the body of &lt;code&gt;data.table:::[.data.table&lt;/code&gt; we can see that the subset method first examines the output of &lt;code&gt;cedta()&lt;/code&gt; and if its results is &lt;code&gt;FALSE&lt;/code&gt;, calls the data frame methods. This answers our question of why a data frame method was called:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;  if (!cedta()) {
    Nargs = nargs() - (!missing(drop))
    ans = if (Nargs &amp;lt; 3L) {
      `[.data.frame`(x, i)
    }
    else if (missing(drop)) 
      `[.data.frame`(x, i, j)
    else `[.data.frame`(x, i, j, drop)
    if (!missing(i) &amp;amp; is.data.table(ans)) 
      setkey(ans, NULL)
    return(ans)
  }&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now looking into &lt;code&gt;data.table:::cedta()&lt;/code&gt; itself we see that in case &lt;code&gt;topenv(parent.frame(n))&lt;/code&gt; is not a namespace, &lt;code&gt;cedta()&lt;/code&gt; happily returns &lt;code&gt;TRUE&lt;/code&gt;. This explains why our function worked when it was defined and run from the global environment. However, in case we are in the context of a namespace, our namespace must satisfy at least one of eight conditions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;  ans = nsname == &amp;quot;data.table&amp;quot; || 
  &amp;quot;data.table&amp;quot; %chin% names(getNamespaceImports(ns)) ||
  (nsname == &amp;quot;utils&amp;quot; &amp;amp;&amp;amp; exists(
    &amp;quot;debugger.look&amp;quot;,
    parent.frame(n + 1L)
  )) ||
  (nsname == &amp;quot;base&amp;quot; &amp;amp;&amp;amp; all(c(&amp;quot;FUN&amp;quot;, &amp;quot;X&amp;quot;) %chin% ls(parent.frame(n)))) ||
  (nsname %chin% cedta.pkgEvalsUserCode &amp;amp;&amp;amp; any(
    sapply(sys.calls(), function(x)
      is.name(x[[1L]]) &amp;amp;&amp;amp; (x[[1L]] == &amp;quot;eval&amp;quot; || x[[1L]] == &amp;quot;evalq&amp;quot;))
    )
  ) ||
  nsname %chin% cedta.override ||
  isTRUE(ns$.datatable.aware) ||
  tryCatch(
    &amp;quot;data.table&amp;quot; %chin% get(
      &amp;quot;.Depends&amp;quot;,
      paste(&amp;quot;package&amp;quot;, nsname, sep = &amp;quot;:&amp;quot;),
      inherits = FALSE
    ), error = function(e) FALSE
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Out of which the most relevant for us is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;&amp;quot;data.table&amp;quot; %chin% names(getNamespaceImports(ns))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When I first saw this, I was like (probably more than 50% of the sentence self-censored):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;No way. I could not possibly be so stupid to forget to import data table in the NAMESPACE! &lt;em&gt;(… of course I could)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So, about a minute later, place &lt;code&gt;@import data.table&lt;/code&gt; into the roxygen tags, regenerate the NAMESPACE, re-install the package and all works great.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-could-i-possibly-fail-to-import-anything-from-data.table-and-find-out-earlier&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How could I possibly fail to import anything from data.table and find out earlier?&lt;/h1&gt;
&lt;p&gt;I think the reason (apart from plain forgetting the obvious) is a combination of the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the subsetting operator is such second nature, that it just did not occur to me to import it with the &lt;code&gt;@importFrom&lt;/code&gt; tag and I rarely use &lt;code&gt;@import&lt;/code&gt; on entire packages&lt;/li&gt;
&lt;li&gt;&lt;code&gt;R CMD check&lt;/code&gt; was successful with no notes, warning or errors, again because even if I usually relatively strictly use qualified calls, the subsetting would seem very unnatural like that. There was therefore no mention of &lt;code&gt;data.table::&lt;/code&gt; in the entire code and the checking procedure had nothing to complain about&lt;/li&gt;
&lt;li&gt;the data table method actually did dispatch correctly, so only after a closer look we see the data frame method kicking in. The first thing to investigate (most of the time correctly) is the actual implementation of what is going on with the expressions inside the subsetting operator, especially when passing around and evaluating quoted expressions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So, if you ever see &lt;code&gt;cedta()&lt;/code&gt; making decisions about data table awareness, check your NAMESPACE. Maybe you have just missed the obvious as I did. Happy data tabling!&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>R Markdown: 3 sources of reproducibility issues and options how to tackle them</title>
      <link>https://jozef.io/r910-rmarkdown-reproducibility/</link>
      <pubDate>Sat, 02 Feb 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r910-rmarkdown-reproducibility/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;R Markdown is a great tool to use for creating reports, presentations and even websites that contain evaluated and rendered code. This can help us immensely when presenting data science type of work to audiences, while still being able to version control the content creation process.&lt;/p&gt;
&lt;p&gt;One of the challenges that stay is reproducibility of the rendered results. In this post, I will list a few sources of reproducibility issues I came across and how I tried to solve them. As an introductory disclaimer, this post is not an exhaustive guide but merely a retrospect on the issues I faced and how I tackled them when writing posts for this blog using blogdown.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For this post, we would consider an R Markdown document reproducible if we can be sure that it produces identical rendered output as long as the content of the Rmd document, the data used within it and the rendering function stay the same.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This sounds like a reasonable thing to ask, however, there are many ways in which this assumption can be broken. And they are not always trivial - that is, unless your name is Yihui Xie :)&lt;/p&gt;
&lt;blockquote class=&#34;twitter-tweet&#34; data-conversation=&#34;none&#34; data-lang=&#34;en&#34;&gt;
&lt;p lang=&#34;en&#34; dir=&#34;ltr&#34;&gt;
My guess is that you upgraded Pandoc first, saw the diffs, then updated the rmarkdown package (from a very old version), which now defaults the &lt;code&gt;html&lt;/code&gt; output format to &lt;code&gt;html4&lt;/code&gt; (which generates &lt;code&gt;&amp;amp;lt;div&amp;amp;gt;&lt;/code&gt;) instead of Pandoc&#39;s default &lt;code&gt;html5&lt;/code&gt; (which generates &lt;code&gt;&amp;amp;lt;section&amp;amp;gt;&lt;/code&gt;).
&lt;/p&gt;
— Yihui Xie (&lt;span class=&#34;citation&#34;&gt;@xieyihui&lt;/span&gt;) &lt;a href=&#34;https://twitter.com/xieyihui/status/1090693630476595200?ref_src=twsrc%5Etfw&#34;&gt;January 30, 2019&lt;/a&gt;
&lt;/blockquote&gt;
&lt;script async src=&#34;https://platform.twitter.com/widgets.js&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;
&lt;p&gt;We will try to categorize some of the reasons into groups.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#output-changes-caused-by-code-chunks-not-behaving-reproducibly&#34;&gt;Output changes caused by code chunks not behaving reproducibly&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#simple-examples-that-showcase-the-issue&#34;&gt;Simple examples that showcase the issue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#solution-1---remove-output-change-source-from-the-chunks&#34;&gt;Solution 1 - Remove output change source from the chunks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#solution-2---run-the-code-once-and-store-the-results&#34;&gt;Solution 2 - Run the code once and store the results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#output-changes-caused-by-different-package-versions&#34;&gt;Output changes caused by different package versions&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#solution---package-version-management-e.g.with-packrat&#34;&gt;Solution - Package version management, e.g. with packrat&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#output-changes-caused-by-changed-system-dependencies&#34;&gt;Output changes caused by changed system dependencies&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#solution---to-containerization-and-beyond&#34;&gt;Solution - To containerization and beyond&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;output-changes-caused-by-code-chunks-not-behaving-reproducibly&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Output changes caused by code chunks not behaving reproducibly&lt;/h1&gt;
&lt;p&gt;The first group is the one that we have full control over, as it directly relates to the content of the code chunks in our R Markdown document.&lt;/p&gt;
&lt;div id=&#34;simple-examples-that-showcase-the-issue&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Simple examples that showcase the issue&lt;/h2&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Obviously, the output can change each time we run this chunk:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;```{r}
Sys.time()
```&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Another scenario is code chunks that make use of random number generation. If we render an Rmd that contains this code chunk multiple times we will get different results, unless we take precautions:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;```{r}
runif(5)
```&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Running timing (benchmarking) code is almost certain to produce different results each time it is run, even though the benchmarked code is identical:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;```{r}
system.time(runif(1e6))
```&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;solution-1---remove-output-change-source-from-the-chunks&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Solution 1 - Remove output change source from the chunks&lt;/h2&gt;
&lt;p&gt;The most obvious and clean way to tackle the issue is to change our code such that the source of variability is removed. For random number generation, this can be achieved by setting a seed, e.g using &lt;code&gt;set.seed()&lt;/code&gt;. This solution can get more complex as the scope increases - if you are interested in reading more on the topic of reproducibility with RNG, &lt;a href=&#34;https://yihui.name/knitr/demo/cache/#reproducibility-with-rng&#34;&gt;look here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;solution-2---run-the-code-once-and-store-the-results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Solution 2 - Run the code once and store the results&lt;/h2&gt;
&lt;p&gt;For some code, such as benchmarking, fixing the code such that the output does not change is very difficult in principle, therefore we must find a workaround that would ensure the results stay untouched. One approach is to run the code once, store the results and do not run the code again on render. Some ways to do that:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;h3 id=&#34;using-the-cachetrue-chunk-option&#34;&gt;Using the cache=TRUE chunk option&lt;/h3&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In practice, this can be done nicely by using the &lt;code&gt;cache=TRUE&lt;/code&gt; option, which provides this behavior and also makes sure that the cache is updated automatically when the code chunk changes, so the correctness of results is ensured. Exceptions to this exist for some special cases, read the details in the &lt;a href=&#34;https://github.com/yihui/knitr/releases/download/doc/knitr-manual.pdf&#34;&gt;knitr manual&lt;/a&gt; for a deeper understanding. Here is an example chunk using that option:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;```{r cache=TRUE}
system.time(runif(1e6))
```&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;h3 id=&#34;storing-a-needed-representation-of-the-object-directly-in-the-rmd&#34;&gt;Storing a needed representation of the object directly in the Rmd&lt;/h3&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One property of knitr’s caching that could be considered a downside is that the cache storage uses binary files, which while being a completely natural choice is not the best for version control. Especially when this is a concern and the code chunk outputs are small in size, other options may also be considered.&lt;/p&gt;
&lt;p&gt;One such example would be to save a needed representation of the result directly in the Rmd and use &lt;code&gt;eval=FALSE&lt;/code&gt; on the code chunk. This comes with trade-offs too, notably we must pay attention for the chunk changes, as there would be no automated update similar to the one provided by knitr’s cache mechanism.&lt;/p&gt;
&lt;p&gt;As an example, we could rewrite the chunk above into two chunks like so - the first chunk shows the code in the output without running, the second makes sure that the results that we store gets shown (but the code does not):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;```{r eval=FALSE}
system.time(runif(1e6))
```

```{r echo=FALSE}
# This is pre-calculated and just shown to keep the output static

structure(
  c(0.081, 0.003, 0.084, 0, 0)
  class = &amp;quot;proc_time&amp;quot;
  .Names = c(&amp;quot;user.self&amp;quot;, &amp;quot;sys.self&amp;quot;, &amp;quot;elapsed&amp;quot;, &amp;quot;user.child&amp;quot;, &amp;quot;sys.child&amp;quot;)
)
```&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The content of the second chunk can be obtained by using &lt;code&gt;dput(system.time(runif(1e6)))&lt;/code&gt;. Naturally, this may become quite impractical to use with bigger objects.&lt;/p&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;h3 id=&#34;storing-the-rendered-output-directly-in-the-rmd&#34;&gt;Storing the rendered output directly in the Rmd&lt;/h3&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Another variation of the above would be to not even bother with obtaining the representation of the output via evaluating a code chunk, but just placing the rendered output itself into the Rmd directly. This comes with the same downsides as the previous approach with some extras. Mainly, we need to create a format-specific output, meaning this approach can be considered only if the output format is fixed and will not change.&lt;/p&gt;
&lt;p&gt;For example, we can be reasonably sure that for a blogdown website the output will be HTML. In case we are ok with all those trade-offs, we can place the following into our .Rmd to represent the above code chunks in HTML:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;&amp;lt;pre&amp;gt;&amp;lt;code&amp;gt;
##    user  system elapsed 
##   0.081   0.003   0.084
&amp;lt;/code&amp;gt;&amp;lt;/pre&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;output-changes-caused-by-different-package-versions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Output changes caused by different package versions&lt;/h1&gt;
&lt;p&gt;The tricky issues start when our Rmd content is actually reproducible under our current local setup, however the rendered output changes with a different setup, often with changed package versions. A concrete real-life example is when an update of the highcharter package slightly tweaks the output:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r910-01-updated-highchart.png&#34; alt=&#34;Slightly different highchart representation&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Slightly different highchart representation&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;solution---package-version-management-e.g.with-packrat&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Solution - Package version management, e.g. with packrat&lt;/h2&gt;
&lt;p&gt;Solving issues with package versions is a broad topic, so we will only mention one that is relatively easy to use, especially with RStudio - &lt;a href=&#34;https://rstudio.github.io/packrat/&#34;&gt;Packrat&lt;/a&gt;. Packrat is an R package that works by creating a separate library of packages on a project basis.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We can create separate R projects for Rmd files with shared package dependencies, or even a separate project per Rmd. This will ensure that we always have the intended set of packages loaded and used for them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As all of the above, apart from the extra overhead of using it, this approach can have its caveats as well. Using packrat to manage a blogdown site or a bookdown book likely means that all of the site posts or book chapters will use a shared package library, which may not be granular enough for all use cases.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r910-02-using-packrat.png&#34; alt=&#34;Using packrat with a project&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Using packrat with a project&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Care also has to be taken to make sure that the packrat managed libraries are used when rendering the content, i.e. by ensuring that the &lt;code&gt;packrat/init.R&lt;/code&gt; is sourced before the rendering happens.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;output-changes-caused-by-changed-system-dependencies&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Output changes caused by changed system dependencies&lt;/h1&gt;
&lt;p&gt;Going into deeper circles of the dreaded &lt;a href=&#34;https://en.wikipedia.org/wiki/Dependency_hell&#34;&gt;dependency hell&lt;/a&gt;, even if we manage our R packages with care, the system dependencies can still cause behavior that would change our output in an unintended way. In the case of R Markdown and knitr the most notable dependency of this type is &lt;a href=&#34;https://pandoc.org/&#34;&gt;Pandoc&lt;/a&gt;, the powerhouse behind R Markdown.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r910-03-why-oh-why.jpg&#34; alt=&#34;Why would this happen?&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Why would this happen?&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;A way this dependency can change is for example when updating RStudio Server, which comes bundled with a certain version of Pandoc:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;RStudio Server v1.1.453 comes with Pandoc 1.19.2.1&lt;/li&gt;
&lt;li&gt;RStudio Server v1.2.1234 comes with Pandoc 2.3.1&lt;/li&gt;
&lt;li&gt;You may have an even newer version on Pandoc installed if you got it separately. As of the date of writing this, the latest stable release of Pandoc is 2.5&lt;/li&gt;
&lt;/ul&gt;
&lt;div id=&#34;solution---to-containerization-and-beyond&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Solution - To containerization and beyond&lt;/h2&gt;
&lt;p&gt;Solving system dependencies is a tricky task made easier by a few tools but it is way out of the scope of this post, so we will only briefly list some options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Containerize your environment with an implementation of &lt;a href=&#34;https://en.wikipedia.org/wiki/Operating-system-level_virtualization&#34;&gt;Operating-system level virtualization&lt;/a&gt; such as the ever so popular &lt;a href=&#34;https://en.wikipedia.org/wiki/Docker_(software)&#34;&gt;Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Create a predefined VM setup, for example using &lt;a href=&#34;https://en.wikipedia.org/wiki/Comparison_of_open-source_configuration_management_software&#34;&gt;configuration management software&lt;/a&gt; such as &lt;a href=&#34;https://en.wikipedia.org/wiki/Ansible_(software)&#34;&gt;Ansible&lt;/a&gt; or &lt;a href=&#34;https://en.wikipedia.org/wiki/Puppet_(software)&#34;&gt;Puppet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Create R Markdown reports and presentations even better with these 3 practical tips</title>
      <link>https://jozef.io/r909-rmarkdown-tips/</link>
      <pubDate>Sat, 19 Jan 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r909-rmarkdown-tips/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Including R Markdown in the workflow for presenting and publishing analyses that use code in R or other languages is a great way to make presentations, dashboards or reports good looking, reproducible and version controllable.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post, we will look at three simple ways to improve that workflow even further with methods that are lesser known and can make producing results with R Markdown more efficient and reviewing them more interactive.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#live-preview-of-r-markdown-files-with-xaringans-infinite_moon_reader&#34;&gt;Live preview of R Markdown files with xaringan’s infinite_moon_reader()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#creating-beautiful-multi-format-reports-directly-from-r-scripts&#34;&gt;Creating beautiful, multi format reports directly from R scripts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#advanced-chunk-options-with-useful-effects&#34;&gt;Advanced chunk options with useful effects&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#resources&#34;&gt;Resources&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;live-preview-of-r-markdown-files-with-xaringans-infinite_moon_reader&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Live preview of R Markdown files with xaringan’s infinite_moon_reader()&lt;/h1&gt;
&lt;p&gt;If you are familiar with &lt;a href=&#34;https://bookdown.org/yihui/rmarkdown/notebook.html&#34;&gt;R notebooks&lt;/a&gt;, you probably know that as you edit the notebook in RStudio and save, the preview will automatically update in the RStudio viewer. Similarly for blogdown users, the &lt;code&gt;serve_site()&lt;/code&gt; function provides live updates of the blog as the content is edited and saved.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;However, if you are producing presentation slides or a more complex html report with R Markdown, you are stuck with re-knitting every time you want to see the updated content in action. Enter the &lt;code&gt;infinite_moon_reader()&lt;/code&gt; function from the &lt;a href=&#34;https://github.com/yihui/xaringan&#34;&gt;xaringan&lt;/a&gt; package.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Even though the xaringan package focuses on creating slides with the remark.js JavaScript library, this function works to provide a live preview with any single-file html output, be it a report, slides such as ioslides, a shiny document or another format.&lt;/p&gt;
&lt;p&gt;If using RStudio, all you need to do to get the live preview is call the function and the default values of the arguments will take care of launching the live preview of the document currently active in the RStudio editor. As if this was not handy enough, the package comes with a premade RStudio addin, so you can get the same functionality just clicking in the IDE, or assigning a keyboard shortcut to it.&lt;/p&gt;
&lt;div id=&#34;it-is-really-that-easy&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;It is really that easy&lt;/h2&gt;
&lt;p&gt;Here is how long it takes to get it up and running, package installation included:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r909-01-inf-mr.gif&#34; alt=&#34;Infinite moon reader&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Infinite moon reader&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Kind of obviously, this functionality can become a huge time saver, especially if you are tweaking the design of your slides and want to see the results quickly without the need to click/call knit over and over again.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-beautiful-multi-format-reports-directly-from-r-scripts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Creating beautiful, multi format reports directly from R scripts&lt;/h1&gt;
&lt;p&gt;When creating R Markdown documents, the workflow often looks something like the following:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Create a new .Rmd file, edit the metadata&lt;/li&gt;
&lt;li&gt;Write some content&lt;/li&gt;
&lt;li&gt;Add code chunks, test&lt;/li&gt;
&lt;li&gt;Write some more content&lt;/li&gt;
&lt;li&gt;Add some more code chunks, test&lt;/li&gt;
&lt;li&gt;Rinse and repeat&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This works, but when your goal is to first create functioning code that you can run as-is and share with others, creating an R Markdown file from such a script with that approach can become a time consuming and error-prone process of copy-pasting the code into code chunks and maintaining it in two places in case you want to also keep the runnable script version.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In comes knitr’s &lt;code&gt;spin()&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The solution to the above problem is very simple once you are aware of it. You can use knitr’s &lt;code&gt;spin()&lt;/code&gt; function to produce a beautiful report directly from an R script, with advanced formatting and options still being available - via formatted comments and the function’s parameters.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This way we can keep the script fully runnable as comments do not interfere with running the code, and still be able to produce that nice output R Markdown is known for.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;a-quick-example&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A quick example&lt;/h2&gt;
&lt;p&gt;A working example is worth more than explanations here, so here we go. Just copy the following, save for example into &lt;code&gt;script.R&lt;/code&gt; and run &lt;code&gt;knitr::spin(&amp;quot;script.R&amp;quot;)&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#&amp;#39; # This is just an R script
#&amp;#39; ## Rendered to a html report with knitr::spin()
#&amp;#39; * just by adding comments we can make a really nice output

#&amp;#39;
#&amp;#39; &amp;gt; And the code runs just like normal, eg. via `Rscript` after all
#&amp;#39; __comments__ are just *comments*.
#&amp;#39;
#&amp;#39; ## The report begins here
#+
knitr::kable(head(mtcars))

#&amp;#39; ## A chart
#+ fig.width=8, fig.height=8
heatmap(cor(mtcars))

#&amp;#39; ## Some tips
#&amp;#39;
#&amp;#39; 1. Optional chunk options are written using `#+`
#&amp;#39; 1. You can write comments between `/*` and `*/`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default, the result will something like the following:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r909-02-spinned.png&#34; alt=&#34;Spinned it right round&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Spinned it right round&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;compile-report-and-rmarkdowns-render-vs.knitrs-spin&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Compile Report and RMarkdown’s &lt;code&gt;render()&lt;/code&gt; vs. knitr’s &lt;code&gt;spin()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;We can achieve similar results in RStudio by clicking on &lt;code&gt;File -&amp;gt; Compile Report...&lt;/code&gt;, which is equivalent to using &lt;code&gt;rmarkdown::render()&lt;/code&gt; on an R script file. This will call &lt;code&gt;spin()&lt;/code&gt; and add some metadata such as title, author and time to the output.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;So why bother with &lt;code&gt;spin()&lt;/code&gt; at all?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The default behavior has some important differences between calling the functions mentioned above. One of them for HTML output is that &lt;code&gt;render()&lt;/code&gt; will by default include inline base64 representations of fonts and JavaScript sources, increasing the output file size from less them 20 KB to more than 600 KB even with the smallest amount of content.&lt;/p&gt;
&lt;p&gt;This is why I personally like to call &lt;code&gt;knitr::spin()&lt;/code&gt; to keep the output at smaller sizes by default, without having to dig in into the options passed to pandoc.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Regardless of the technical details, being able to produce good looking reports directly from R scripts can save a lot of time and error-prone copying, while keeping the content and runnable code in one place, instead of copy-pasting into code chunks of an R Markdown file.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is of course not to say that R Markdown files are not useful. To the contrary, they are great for many use cases. However, if the content is mostly code with some accompanying text, using &lt;code&gt;spin()&lt;/code&gt; can come in really handy.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;advanced-chunk-options-with-useful-effects&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Advanced chunk options with useful effects&lt;/h1&gt;
&lt;p&gt;When working with R Markdown the code chunk options provide helpful modifications to the chunk code’s behavior. The simple and widely used chunk options such as the following are well known, we mention them for a quick reference:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;eval=FALSE&lt;/code&gt; - do not evaluate the code in the chunk at all&lt;/li&gt;
&lt;li&gt;&lt;code&gt;echo=FALSE&lt;/code&gt; - do not show the chunk code in the output file&lt;/li&gt;
&lt;li&gt;&lt;code&gt;include=FALSE&lt;/code&gt; - do not show code output in the output file&lt;/li&gt;
&lt;li&gt;&lt;code&gt;message=FALSE&lt;/code&gt; - do not show messages in the output file&lt;/li&gt;
&lt;li&gt;&lt;code&gt;warning=FALSE&lt;/code&gt; - do not show warnings in the output file&lt;/li&gt;
&lt;li&gt;&lt;code&gt;error=TRUE&lt;/code&gt; - do not prevent rendering on error and show error messages in the output&lt;/li&gt;
&lt;/ul&gt;
&lt;div id=&#34;resultsasis-to-keep-content-generated-by-a-chunk-unprocessed&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;code&gt;results=&#39;asis&#39;&lt;/code&gt; to keep content generated by a chunk unprocessed&lt;/h2&gt;
&lt;p&gt;Especially when producing HTML output it may be helpful to create functions that produce output we want to include directly in the rendered document without any processing, such as HTML code produced by a pre-made function.&lt;/p&gt;
&lt;p&gt;One example I use often is a function &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/develop/R/makeHighChart.R&#34;&gt;&lt;code&gt;makeHighChart()&lt;/code&gt;&lt;/a&gt; that creates a lightweight JavaScript representation of a chart created via highcharter from an R object. The output of that function is HTML code that should be placed as-is into the output, for which the &lt;code&gt;results=&#39;as-is&#39;&lt;/code&gt; chunk option is made:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This chunk uses the results=&amp;#39;as-is&amp;#39; option like so:
# ```{r results=&amp;#39;asis&amp;#39;}
# The results is an interactive chart:
jhaddins::makeHighChart(
  highcharter::hcboxplot(mtcars$hp),
  chartname = &amp;quot;examplechart&amp;quot;,
  docat = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#examplechart&#39;).highcharts({
  title: {     
    text: null     
  },     
  yAxis: {     
    title: {     
      text: null     
    }     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    }     
  },     
  chart: {     
    type: &#34;bar&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      name: null,     
      data: [     
        {     
          name: null,     
          low: 52,     
          q1: 96,     
          median: 123,     
          q3: 180,     
          high: 264     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null     
    },     
    {     
      name: null,     
      data: [     
        {     
          name: null,     
          y: 335     
        }     
      ],     
      type: &#34;scatter&#34;,     
      linkedTo: null     
    }     
  ]     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;examplechart&#34;&gt;

&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This one does not use the results option, it is just
# ```{r}
# The result is not very useful printed HTML:
jhaddins::makeHighChart(
  highcharter::hcboxplot(mtcars$mpg),
  chartname = &amp;quot;examplechart&amp;quot;,
  docat = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;lt;script type=&amp;quot;text/javascript&amp;quot;&amp;gt;
## $(function () {
##   $(&amp;#39;#examplechart&amp;#39;).highcharts({
##   title: {     
##     text: null     
##   },     
##   yAxis: {     
##     title: {     
##       text: null     
##     }     
##   },     
##   credits: {     
##     enabled: false     
##   },     
##   exporting: {     
##     enabled: false     
##   },     
##   plotOptions: {     
##     series: {     
##       label: {     
##         enabled: false     
##       },     
##       turboThreshold: 0,     
##       marker: {     
##         symbol: &amp;quot;circle&amp;quot;     
##       },     
##       showInLegend: false     
##     },     
##     treemap: {     
##       layoutAlgorithm: &amp;quot;squarified&amp;quot;     
##     }     
##   },     
##   chart: {     
##     type: &amp;quot;bar&amp;quot;     
##   },     
##   xAxis: {     
##     type: &amp;quot;category&amp;quot;,     
##     categories: &amp;quot;&amp;quot;     
##   },     
##   series: [     
##     {     
##       name: null,     
##       data: [     
##         {     
##           name: null,     
##           low: 10.4,     
##           q1: 15.35,     
##           median: 19.2,     
##           q3: 22.8,     
##           high: 33.9     
##         }     
##       ],     
##       type: &amp;quot;boxplot&amp;quot;,     
##       id: null     
##     }     
##   ]     
## }     
##   );
## });
## &amp;lt;/script&amp;gt;
## 
## &amp;lt;div id=&amp;quot;examplechart&amp;quot;&amp;gt;&amp;lt;/div&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;class.outputsome_css_class-to-format-chunk-output-with-custom-css&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;code&gt;class.output=&amp;quot;some_css_class&amp;quot;&lt;/code&gt; to format chunk output with custom css&lt;/h2&gt;
&lt;p&gt;For HTML output, we may want to style it with our own css. This option allows to use defined css classes to style the output produced by that chunk. This can be very convenient if we want to style some chunks in a different manner elegantly. We can also provide multiple css classes in a character vector instead of just one.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;cachetrue-to-render-faster-and-more-reproducibly&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;code&gt;cache=TRUE&lt;/code&gt; to render faster and more reproducibly&lt;/h2&gt;
&lt;p&gt;In case your documents contain calculations that a take lot of time, or just cause unnecessary pain when re-executed with each render, for example when including benchmarking results in posts, it is very convenient to cache the chunk results. This will not only make the rendering faster, but also ensure that the results of the same code will stay the same in the output, even if we re-render the document.&lt;/p&gt;
&lt;p&gt;Note that for keeping reproducibility when random number generation is included with caching results, it is advised to also include &lt;code&gt;knitr::opts_chunk$set(cache.extra = knitr::rand_seed)&lt;/code&gt; in the document. More details on that are &lt;a href=&#34;https://yihui.name/knitr/demo/cache/#reproducibility-with-rng&#34;&gt;available here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;resources&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://yihui.name/knitr/demo/cache/&#34;&gt;Details on the cache feature&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://yihui.name/knitr/options/#chunk-options&#34;&gt;Complete reference for chunk options&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://rpubs.com/alobo/spintutorial&#34;&gt;Spin tutorial&lt;/a&gt; on rpubs.com&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf&#34;&gt;R Markdown reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://yihui.name/knitr/options/&#34;&gt;Details on chunk options&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>Here&#39;s why 2019 is a great year to start with R: A story of 10 year old R code then and now</title>
      <link>https://jozef.io/r908-10-year-old-code/</link>
      <pubDate>Sat, 05 Jan 2019 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r908-10-year-old-code/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;It has been more than ten years since I wrote my first R code. And in those years, the R world has changed dramatically, and mostly to the better. I believe that the current time may be one of the best times to start working with R.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this new year’s post we will look at the R world 10 years ago and today, and provide links to many tools that helped it become a great language to solve and present everyday tasks with a welcoming community of users and developers.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#my-first-exposure-to-r-more-than-ten-years-ago&#34;&gt;My first exposure to R, more than ten years ago&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-growth-of-r&#34;&gt;The growth of R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#r-now-versus-then---a-much-better-world&#34;&gt;R now versus then - A much better world&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#if-you-really-came-for-that-ugly-old-code&#34;&gt;If you really came for that ugly old code&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;my-first-exposure-to-r-more-than-ten-years-ago&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;My first exposure to R, more than ten years ago&lt;/h1&gt;
&lt;p&gt;The year was 2007 and I was studying probability and mathematical statistics at my &lt;a href=&#34;https://fmph.uniba.sk/en&#34;&gt;faculty&lt;/a&gt; when &lt;a href=&#34;https://www.researchgate.net/profile/Radoslav_Harman&#34;&gt;one&lt;/a&gt; of the professors introduced us to R - a free programming language that we could use to solve many statistical tasks, from simple matrix operations and fitting models, to data visualization. This sounded great, as many other solutions that were traditionally used such as &lt;a href=&#34;https://en.wikipedia.org/wiki/Stata&#34;&gt;Stata&lt;/a&gt; or &lt;a href=&#34;https://en.wikipedia.org/wiki/SPSS&#34;&gt;SPSS&lt;/a&gt; were even not free to use, let alone open source.&lt;/p&gt;
&lt;p&gt;Now to get a bit of context, my most recent exposures to programming at that time were using Borland’s &lt;a href=&#34;https://en.wikipedia.org/wiki/Delphi_(IDE)#Borland_Delphi_7&#34;&gt;Deplhi 7&lt;/a&gt; and &lt;a href=&#34;https://en.wikipedia.org/wiki/C%2B%2BBuilder&#34;&gt;C++ Builder&lt;/a&gt;, both mature IDEs with very pleasant and advanced user interfaces and features, where you could literally have a Windows application with a nice UI ready, compiled and running in an hour.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r908-01-delphi7.jpg&#34; alt=&#34;Deplhi 7, released in 2002&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Deplhi 7, released in 2002&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;rgui-times&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Rgui times&lt;/h2&gt;
&lt;p&gt;When I first opened the RGui it felt, well, slightly underwhelming:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r908-03-rgui-r261.jpg&#34; alt=&#34;Rgui rocking R version 2.6.1, released 26th Nov 2007&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Rgui rocking R version 2.6.1, released 26th Nov 2007&lt;/p&gt;
&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;But why did you not use RStudio?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Well, the &lt;a href=&#34;https://blog.rstudio.com/2011/02/28/rstudio-new-open-source-ide-for-r/&#34;&gt;first beta version of RStudio&lt;/a&gt; was released about 3 years later in February 2011. By the wat, those RStudio blog posts from 2011 still have comment sections available below them and I really enjoy reading through them. Anyway, I was stuck with the Rgui and it was not a pleasant experience. At that time, I disliked that experience so much, I still wrote some of the code in Delphi or C++ Builder.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;which-currently-popular-packages-existed&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Which currently popular packages existed?&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;But dplyr syntax makes everything so easy, why not use that?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Looking at the &lt;a href=&#34;https://web.archive.org/web/20080105045216/https://cran.r-project.org&#34;&gt;CRAN snapshots from the beginning of 2008&lt;/a&gt;, the latest released R version at that time was &lt;a href=&#34;https://cran-archive.r-project.org/bin/windows/base/old/2.6.1/&#34;&gt;R-2.6.1&lt;/a&gt; and there were around 1200 packages available on CRAN. At the time of writing of this post the number of packages available on CRAN reached 13600.&lt;/p&gt;
&lt;p&gt;Looking at the &lt;a href=&#34;http://cranlogs.r-pkg.org/top/last-month/40&#34;&gt;top 40&lt;/a&gt; most downloaded packages in the past month, only two of those packages existed on CRAN at that time - &lt;a href=&#34;https://github.com/tidyverse/ggplot2&#34;&gt;ggplot2&lt;/a&gt; and &lt;a href=&#34;https://github.com/eddelbuettel/digest&#34;&gt;digest&lt;/a&gt; - no &lt;code&gt;filter&lt;/code&gt;, &lt;code&gt;summarize&lt;/code&gt; or &lt;code&gt;group_by&lt;/code&gt; for me back then.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;stackoverflow-github-and-twitter-communities&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;StackOverflow, GitHub and Twitter communities&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Why did you not just ask StackOverflow, Twitter or check GitHub ?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;According to Wikipedia, StackOverflow was launched 15th September 2008 and GitHub on 10th of April 2008, so in the beginning of 2008 none of the two today’s giants even existed.&lt;/p&gt;
&lt;p&gt;Not that I was using Twitter at that time, but the first #rstats tweet I was able to find is from 4th April 2009:&lt;/p&gt;
&lt;blockquote class=&#34;twitter-tweet&#34; data-lang=&#34;en&#34;&gt;
&lt;p lang=&#34;en&#34; dir=&#34;ltr&#34;&gt;
RT &lt;a href=&#34;https://twitter.com/chrisalbon?ref_src=twsrc%5Etfw&#34;&gt;&lt;span class=&#34;citation&#34;&gt;@ChrisAlbon&lt;/span&gt;&lt;/a&gt; &lt;a href=&#34;https://twitter.com/drewconway?ref_src=twsrc%5Etfw&#34;&gt;&lt;span class=&#34;citation&#34;&gt;@drewconway&lt;/span&gt;&lt;/a&gt; &lt;a href=&#34;https://twitter.com/hashtag/rstats?src=hash&amp;amp;ref_src=twsrc%5Etfw&#34;&gt;#rstats&lt;/a&gt; is the official R statistical language hashtag. &lt;a href=&#34;https://twitter.com/hashtag/rstats?src=hash&amp;amp;ref_src=twsrc%5Etfw&#34;&gt;#rstats&lt;/a&gt; (because &lt;a href=&#34;https://twitter.com/hashtag/R?src=hash&amp;amp;ref_src=twsrc%5Etfw&#34;&gt;#R&lt;/a&gt; doesn&#39;t cut it)
&lt;/p&gt;
— brendan o&#39;connor (&lt;span class=&#34;citation&#34;&gt;@brendan642&lt;/span&gt;) &lt;a href=&#34;https://twitter.com/brendan642/status/1452719959?ref_src=twsrc%5Etfw&#34;&gt;April 4, 2009&lt;/a&gt;
&lt;/blockquote&gt;
&lt;script async src=&#34;https://platform.twitter.com/widgets.js&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;
&lt;p&gt;For comparison, R itself was &lt;a href=&#34;https://stat.ethz.ch/pipermail/r-announce/2000/000127.html&#34;&gt;first released&lt;/a&gt; 29th of February 2000, a date easily remembered.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-growth-of-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The growth of R&lt;/h1&gt;
&lt;p&gt;There are many ways to look at a growth of a programming language and this does not mean to be a comprehensive and objective growth assessment. I rather took a look at 2 metrics I found interesting that show some trends in the R world.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are interested in the topic of programming language popularity, there are indices such as &lt;a href=&#34;http://pypl.github.io/PYPL.html&#34;&gt;PYPL&lt;/a&gt; and &lt;a href=&#34;https://www.tiobe.com/tiobe-index/&#34;&gt;TIOBE&lt;/a&gt;, and of course they have their &lt;a href=&#34;https://techcrunch.com/2018/09/30/what-the-heck-is-going-on-with-measures-of-programming-language-popularity/&#34;&gt;criticisms&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;downloads-of-r-packages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Downloads of R packages&lt;/h2&gt;
&lt;p&gt;RStudio’s CRAN mirror provides a &lt;a href=&#34;https://github.com/metacran/cranlogs.app/blob/master/README.md&#34;&gt;REST API&lt;/a&gt; from which we can look at and visualize the number of monthly downloads of R packages in the past 5 years. The chart speaks for itself:&lt;/p&gt;
&lt;script type=&#34;text/javascript&#34;&gt; $(function () {
  $(&#39;#r908-01-monthly-r-downloads&#39;).highcharts({
  title: {     
    text: null     
  },     
  yAxis: {     
    title: {     
      text: null     
    }     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    }     
  },     
  series: [     
    {     
      data: [1111018, 1254236, 1585446, 1836591, 1748652, 1696828, 1835270, 1926914, 2615003, 2771729, 2822179, 2378941, 3107109, 3253037, 4130467, 4907718, 4029389, 5060976, 4823936, 5588692, 5396871, 6534368, 6682423, 5984781, 7695022, 7738873, 9937860, 11274461, 10511679, 11046909, 10525599, 11577427, 12848572, 13991958, 16414107, 14767617, 14946786, 15695514, 16110482, 15034673, 18135754, 15942551, 15159072, 14926215, 19778328, 23259892, 27638156, 21852630, 25833739, 28609044, 32522448, 32597694, 32389075, 30214297, 27995878, 28865282, 33786377, 37808497, 39820735, 31812256, 38244722, 33787943, 45966646, 51566293, 48369947, 44977277, 42285633, 41935159, 59734508, 74333034, 74764832, 58582203],     
      name: &#34;Monthly package downloads (RStudio&#39;s CRAN mirror)&#34;     
    }     
  ],     
  xAxis: {     
    categories: [&#34;2013-01&#34;, &#34;2013-02&#34;, &#34;2013-03&#34;, &#34;2013-04&#34;, &#34;2013-05&#34;, &#34;2013-06&#34;, &#34;2013-07&#34;, &#34;2013-08&#34;, &#34;2013-09&#34;, &#34;2013-10&#34;, &#34;2013-11&#34;, &#34;2013-12&#34;, &#34;2014-01&#34;, &#34;2014-02&#34;, &#34;2014-03&#34;, &#34;2014-04&#34;, &#34;2014-05&#34;, &#34;2014-06&#34;, &#34;2014-07&#34;, &#34;2014-08&#34;, &#34;2014-09&#34;, &#34;2014-10&#34;, &#34;2014-11&#34;, &#34;2014-12&#34;, &#34;2015-01&#34;, &#34;2015-02&#34;, &#34;2015-03&#34;, &#34;2015-04&#34;, &#34;2015-05&#34;, &#34;2015-06&#34;, &#34;2015-07&#34;, &#34;2015-08&#34;, &#34;2015-09&#34;, &#34;2015-10&#34;, &#34;2015-11&#34;, &#34;2015-12&#34;, &#34;2016-01&#34;, &#34;2016-02&#34;, &#34;2016-03&#34;, &#34;2016-04&#34;, &#34;2016-05&#34;, &#34;2016-06&#34;, &#34;2016-07&#34;, &#34;2016-08&#34;, &#34;2016-09&#34;, &#34;2016-10&#34;, &#34;2016-11&#34;, &#34;2016-12&#34;, &#34;2017-01&#34;, &#34;2017-02&#34;, &#34;2017-03&#34;, &#34;2017-04&#34;, &#34;2017-05&#34;, &#34;2017-06&#34;, &#34;2017-07&#34;, &#34;2017-08&#34;, &#34;2017-09&#34;, &#34;2017-10&#34;, &#34;2017-11&#34;, &#34;2017-12&#34;, &#34;2018-01&#34;, &#34;2018-02&#34;, &#34;2018-03&#34;, &#34;2018-04&#34;, &#34;2018-05&#34;, &#34;2018-06&#34;, &#34;2018-07&#34;, &#34;2018-08&#34;, &#34;2018-09&#34;, &#34;2018-10&#34;, &#34;2018-11&#34;, &#34;2018-12&#34;]     
  }     
}     
  );
}); &lt;/script&gt;
&lt;div id=&#34;r908-01-monthly-r-downloads&#34;&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;interest-on-stackoverflow&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Interest on StackOverflow&lt;/h2&gt;
&lt;p&gt;Another interesting point of view is the statistics on &lt;a href=&#34;https://stackoverflow.blog/2017/05/09/introducing-stack-overflow-trends/?_ga=2.141059506.1807840642.1545911617-1497094062.1532602192&#34;&gt;trends on StackOverflow&lt;/a&gt;, paraphrasing their blogpost:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When we see a rapid growth in the number of questions about a technology, it usually reflects a real change in what developers are using and learning.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And how does R look within the StackOverflow trends compared to other languages? Looks like the growth of R is so remarkable, even the data scientists at StackOverflow itself noticed and wrote a &lt;a href=&#34;https://stackoverflow.blog/2017/10/10/impressive-growth-r/&#34;&gt;blogpost about it&lt;/a&gt; in 2017:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;../img/r908-02-so-trends.svg&#34;&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;r-now-versus-then---a-much-better-world&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R now versus then - A much better world&lt;/h1&gt;
&lt;p&gt;Going back to that story of my first R codes, I think time has made working with R much better than it was before in many ways. I will list just few of the many reasons why with the links to relevant resources to follow:&lt;/p&gt;
&lt;div id=&#34;availability-of-free-information-and-support-is-great&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Availability of free information and support is great&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The amazing amount of &lt;strong&gt;free information&lt;/strong&gt; readily available such as (tidyverse oriented) &lt;a href=&#34;https://r4ds.had.co.nz/&#34;&gt;R for Data Science&lt;/a&gt;, or &lt;a href=&#34;http://adv-r.had.co.nz/&#34;&gt;Advanced R&lt;/a&gt; books make R more accessible to learn and use&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Communities&lt;/strong&gt; of R users such as the one &lt;a href=&#34;https://stackoverflow.com/questions/tagged/r&#34;&gt;on StackOverflow&lt;/a&gt; make it easy to ask questions and get answers, the &lt;a href=&#34;https://twitter.com/hashtag/rstats&#34;&gt;#rstats&lt;/a&gt; hashtag on Twitter is a good way to interact with the community&lt;/li&gt;
&lt;li&gt;Many &lt;strong&gt;user and developer blogs&lt;/strong&gt; on &lt;a href=&#34;https://www.r-bloggers.com&#34;&gt;r-bloggers.com&lt;/a&gt; and curated selections of content on &lt;a href=&#34;https://rweekly.org/&#34;&gt;RWeekly.org&lt;/a&gt; can serve as an inspiration and overview of the news in the community&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;software-tools-that-make-working-with-r-efficient&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Software tools that make working with R efficient&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Tools like &lt;strong&gt;&lt;a href=&#34;https://www.rstudio.com/products/RStudio/&#34;&gt;RStudio&lt;/a&gt;&lt;/strong&gt; make using R a much more pleasant experience compared to the original RGui, with many useful features and a Server version running in browser&lt;/li&gt;
&lt;li&gt;Well documented R packages that make common &lt;strong&gt;data science tasks easier&lt;/strong&gt; and/or more performant such as the popular &lt;a href=&#34;https://www.tidyverse.org/&#34;&gt;tidyverse&lt;/a&gt; or &lt;a href=&#34;https://github.com/Rdatatable/data.table/wiki&#34;&gt;data.table&lt;/a&gt; make it easier to start&lt;/li&gt;
&lt;li&gt;R packages that &lt;strong&gt;support development, testing and documentation&lt;/strong&gt; such as &lt;a href=&#34;https://www.rstudio.com/products/rpackages/devtools/&#34;&gt;devtools&lt;/a&gt;, &lt;a href=&#34;https://testthat.r-lib.org/&#34;&gt;testthat&lt;/a&gt; and &lt;a href=&#34;https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html&#34;&gt;roxygen2&lt;/a&gt; make R code efficient to develop, test and document&lt;/li&gt;
&lt;li&gt;For portability, reproducibility and &lt;strong&gt;dependency management&lt;/strong&gt;, tools such as &lt;a href=&#34;http://rstudio.github.io/packrat/&#34;&gt;packrat&lt;/a&gt; can make life less painful&lt;/li&gt;
&lt;li&gt;Code &lt;strong&gt;repository managers&lt;/strong&gt; such as &lt;a href=&#34;https://github.com/&#34;&gt;GitHub&lt;/a&gt;, &lt;a href=&#34;https://gitlab.com/&#34;&gt;GitLab&lt;/a&gt; or others make it easy to share code, collaborate and even perform CI/CD tasks where necessary&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;professionally-presenting-and-publishing-r-results-is-simple&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Professionally presenting and publishing R results is simple&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Tools like &lt;a href=&#34;https://bookdown.org/yihui/rmarkdown/&#34;&gt;RMarkdown&lt;/a&gt;, &lt;a href=&#34;https://bookdown.org/yihui/bookdown/&#34;&gt;Bookdown&lt;/a&gt;, &lt;a href=&#34;https://bookdown.org/yihui/blogdown/&#34;&gt;Blogdown&lt;/a&gt; and others make it easy to &lt;strong&gt;publish the results of your work&lt;/strong&gt;, be it an interactive dashboard, a paper in pdf, a presentation, even a book or a blog (such as this one)&lt;/li&gt;
&lt;li&gt;Many packages for generating interactive charts, maps and animations such as &lt;a href=&#34;http://jkunst.com/highcharter/&#34;&gt;highcharter&lt;/a&gt;, &lt;a href=&#34;https://rstudio.github.io/leaflet/&#34;&gt;leaflet&lt;/a&gt; and more help create amazing data visualizations&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://shiny.rstudio.com/&#34;&gt;Shiny&lt;/a&gt; takes it to the next level allowing for advanced interactive web applications&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;mature-interfaces-to-programming-languages-file-formats-and-more&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Mature interfaces to programming languages, file formats and more&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;R now has mature &lt;strong&gt;interfaces&lt;/strong&gt; to many programming languages, software libraries, database systems and file formats, just a few examples include &lt;a href=&#34;https://cran.r-project.org/package=Rcpp&#34;&gt;Rcpp&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/package=rJava&#34;&gt;rJava&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/package=httr&#34;&gt;httr&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/package=openxlsx&#34;&gt;openxlsx&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/package=XLConnect&#34;&gt;XLConnect&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/package=highcharter&#34;&gt;highcharter&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/package=jsonlite&#34;&gt;jsonlite&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/package=rJava&#34;&gt;xml2&lt;/a&gt;, &lt;a href=&#34;https://spark.rstudio.com/index.html&#34;&gt;sparklyr&lt;/a&gt; and &lt;a href=&#34;https://r-dbi.github.io/DBI/&#34;&gt;DBI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;guidance-on-packages-per-topic-on-cran&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Guidance on packages per topic on CRAN&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/views/&#34;&gt;CRAN task views&lt;/a&gt; provide guidance on &lt;strong&gt;R packages per topic&lt;/strong&gt;, such as &lt;a href=&#34;https://cran.r-project.org/web/views/WebTechnologies.html&#34;&gt;Web Technologies and Services&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/web/views/HighPerformanceComputing.html&#34;&gt;High-Performance and Parallel Computing&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/web/views/MachineLearning.html&#34;&gt;Machine Learning &amp;amp; Statistical Learning&lt;/a&gt; and many more&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;if-you-really-came-for-that-ugly-old-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;If you really came for that ugly old code&lt;/h1&gt;
&lt;p&gt;I hope this post motivated you to dive a bit deeper into the R world and check some of the many amazing contributions created by developers and users in the R community mentioned above.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But if you really feel like having a good laugh first, feel free check some of the oldest R scripts I was able to find unedited on &lt;a href=&#34;https://gitlab.com/jozefhajnala/gists/tree/master/oldestR&#34;&gt;GitLab here&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They date somewhere to the end of 2007/beginning of 2008 and, for it’s worth, should still be runnable.&lt;/p&gt;
&lt;blockquote class=&#34;xmas&#34;&gt;
Thank you for reading and&lt;br /&gt;have a happy new yeaR
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>5 amazing free tools that can help with publishing R results and blogging</title>
      <link>https://jozef.io/r907-christmas-praise/</link>
      <pubDate>Sat, 22 Dec 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r907-christmas-praise/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;It is Christmas time! And what better time than this to write about the great tools that are available to all who like R and would like to publish their R work or even blog about it. This post is meant as a praise to the tools that are helping me to write this blog and make it a very nice experience, allowing me to focus on the content.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post we will praise 5 free tools that can help anyone make blogging about R or publishing results of R work a pleasant experience.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#rstudio-r-markdown-to-prepare-the-content&#34;&gt;RStudio + R Markdown to prepare the content&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#blogdown-hugo-to-make-it-a-nice-blog&#34;&gt;Blogdown (+ Hugo) to make it a nice blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#gitlab-and-gitlab-pages-to-version-control-and-publish-it&#34;&gt;GitLab and GitLab Pages to version control and publish it&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#highcharts-and-the-highcharter-r-package-for-interactive-charts&#34;&gt;Highcharts and the highcharter R package for interactive charts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#screentogif-for-moving-screen-captures&#34;&gt;ScreenToGif for moving screen captures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#resources&#34;&gt;Resources&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;rstudio-r-markdown-to-prepare-the-content&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;RStudio + R Markdown to prepare the content&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;https://github.com/rstudio/hex-stickers/raw/master/PNG/RStudio.png&#34; alt=&#34;RStudio Sticker&#34; class=&#34;leftsmall&#34;&gt;&lt;/p&gt;
&lt;p&gt;The first is probably the most obvious, but still worth mentioning. The RStudio IDE is a good productivity tool for all R-related work, however the integration with R Markdown makes it the default environment for me to write the blog posts. I especially enjoy use the RStudio Server, which makes it easy to have one aligned environment regardless of where you are (as long as there is internet connection ;) and what computer you are using (as long as there is a recent version of a web browser).&lt;/p&gt;
&lt;p&gt;I often find myself even editing the css style sheets, HTML partials and JavaScript within RStudio itself. And probably the best thing about it is that with combination with Blogdown, you can see all those changes instantly as you make them in RStudio’s Viewer. Using &lt;a href=&#34;https://jozef.io/r905-rstudio-terminal/&#34;&gt;RStudio’s Terminal&lt;/a&gt; for the necessary git commands makes RStudio Server a unified tool with all that I need for almost all of the work.&lt;/p&gt;
&lt;p&gt;As an honorary mention, R Markdown would probably not be possible without the powerhouse behind it - &lt;a href=&#34;http://pandoc.org/&#34;&gt;Pandoc&lt;/a&gt;. Pandoc is an open-source document converter, widely used as a writing tool and as a basis for publishing workflows, and also used as a backend by &lt;a href=&#34;https://github.com/yihui/knitr&#34;&gt;knitr&lt;/a&gt;, which is in turn used by R Markdown to generate the rendered outputs from R Markdown documents.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;blogdown-hugo-to-make-it-a-nice-blog&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Blogdown (+ Hugo) to make it a nice blog&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;https://bookdown.org/yihui/blogdown/images/logo.png&#34; alt=&#34;Blogdown Logo&#34; class=&#34;leftsmall&#34;&gt;&lt;/p&gt;
&lt;p&gt;To write and review the posts for this blog, I almost exclusively use the combination of the RStudio (Server) IDE and the Blogdown package by &lt;a href=&#34;https://yihui.name/en/&#34;&gt;Yihui Xie&lt;/a&gt;. As most of the readers probably know both of those are free and very easy to setup. If you never heard of Blogdown, it is an open-source R package to generate static websites based on R Markdown and Hugo.&lt;/p&gt;
&lt;p&gt;But that short explanation does not really do it justice, so you may want to check out &lt;a href=&#34;https://awesome-blogdown.com/&#34;&gt;Awesome Blogdown&lt;/a&gt; for a curated list of blogs built using blogdown. If you want to learn more, there is even a &lt;a href=&#34;https://bookdown.org/yihui/blogdown/&#34;&gt;free online book&lt;/a&gt; written by the authors of blogdown. In terms of the design, you can find hundreds of themes to choose from in the Hugo &lt;a href=&#34;https://themes.gohugo.io/&#34;&gt;theme gallery&lt;/a&gt;, this blog uses customized &lt;a href=&#34;https://themes.gohugo.io/hugo-natrium-theme/&#34;&gt;natrium theme&lt;/a&gt;, a simple responsive blog theme for Hugo based on the &lt;a href=&#34;https://themes.gohugo.io/hugo-lithium-theme/&#34;&gt;Lithium theme&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;gitlab-and-gitlab-pages-to-version-control-and-publish-it&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;GitLab and GitLab Pages to version control and publish it&lt;/h1&gt;
&lt;p&gt;&lt;img src = &#34;https://i.ytimg.com/vi/TWqh9MtT4Bg/maxresdefault.jpg&#34; alt=&#34;GitLab Pages&#34; class=&#34;leftsmall&#34;&gt;&lt;/p&gt;
&lt;p&gt;GitLab pages enable us to create websites for our GitLab projects, groups, or user account using any static website generator, Hugo included. Since GitLab is my repository manager of choice, allowing for free private repositories, integration with the pages comes very naturally and easily. Essentially all that is needed to make it work is a &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; file similar to &lt;a href=&#34;https://gitlab.com/pages/hugo/blob/master/.gitlab-ci.yml&#34;&gt;this one&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There is even a full &lt;a href=&#34;https://gitlab.com/pages/hugo&#34;&gt;example of a Hugo page&lt;/a&gt; available to see how it may look like with a nice readme. For advanced use, it is also possible to connect your custom domain and TLS certificates and host the websites on your own GitLab instance. On GitLab.com, the hosting of the sites is free. To read more documentation and watch video tutorials, just &lt;a href=&#34;https://docs.gitlab.com/ee/user/project/pages/&#34;&gt;click here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;highcharts-and-the-highcharter-r-package-for-interactive-charts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Highcharts and the highcharter R package for interactive charts&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;https://api.highcharts.com/highcharts/mstile-310x310.png&#34; alt=&#34;Highcharts&#34; class=&#34;leftsmall&#34;&gt;&lt;/p&gt;
&lt;p&gt;I have been using &lt;a href=&#34;https://www.highcharts.com/&#34;&gt;highcharts&lt;/a&gt; for interactive charting for projects for years and was very excited when the first version of the &lt;a href=&#34;http://jkunst.com/highcharter/&#34;&gt;highcharter&lt;/a&gt; R package providing an interface between R and highcharts arrived on CRAN in 2016.&lt;/p&gt;
&lt;p&gt;For the relatively rare occurrences when I need a chart included in a blog post I happily use highcharter mainly thanks to the amazing variability and ease of use provided by the now very mature highcharts JavaScript library. For a taste, just look at the &lt;a href=&#34;https://www.highcharts.com/demo&#34;&gt;highcharts demo&lt;/a&gt;. And yes, they can do &lt;a href=&#34;https://www.highcharts.com/maps/demo&#34;&gt;pretty highmaps&lt;/a&gt; too.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;screentogif-for-moving-screen-captures&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;ScreenToGif for moving screen captures&lt;/h1&gt;
&lt;p&gt;&lt;img src=&#34;https://github.com/NickeManarin/ScreenToGif-Website/raw/master/img/ms-icon-150x150.png&#34; alt=&#34;ScreenToGif&#34; class=&#34;leftsmall&#34;&gt;&lt;/p&gt;
&lt;p&gt;ScreenToGif is an open source tool that allows you to record a selected area of your screen, edit and save it as a gif or video. I find screen recording and showing it as a gif one the best ways to easily show examples without the need to record a video, which takes much more effort and this tool does just that very conveniently. You can download it for free &lt;a href=&#34;http://www.screentogif.com/&#34;&gt;from its website&lt;/a&gt; and take a look at the code in the &lt;a href=&#34;https://github.com/NickeManarin/ScreenToGif&#34;&gt;GitHub repo&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;resources&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rstudio.com/products/rstudio/download/#download&#34;&gt;RStudio desktop&lt;/a&gt; and &lt;a href=&#34;https://www.rstudio.com/products/rstudio/download-server/&#34;&gt;RStudio Server&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/rstudio/markdown&#34;&gt;R Markdown&lt;/a&gt; on GitHub&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/rstudio/blogdown&#34;&gt;Blogdown&lt;/a&gt; on GitHub&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://bookdown.org/yihui/blogdown/gitlab-pages.html&#34;&gt;Blogdown and GitLab Pages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://gohugo.io/&#34;&gt;Hugo&lt;/a&gt;, one of the most popular open-source static site generators&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://themes.gohugo.io&#34;&gt;Hugo theme gallery&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.gitlab.com/ee/user/project/pages/&#34;&gt;GitLab pages&lt;/a&gt;, a feature that allows you to publish static websites directly from a repository in GitLab&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.highcharts.com/&#34;&gt;Highcharts&lt;/a&gt; makes it easy for developers to set up interactive charts in their web pages&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://jkunst.com/highcharter/&#34;&gt;highcharter&lt;/a&gt; is an R wrapper for Highcharts JavaScript library and its modules&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.screentogif.com&#34;&gt;ScreenToGif&lt;/a&gt;, a screen, webcam and sketchboard recorder with an integrated editor.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote class=&#34;xmas&#34;&gt;
Thank you for reading and&lt;br /&gt; have a verry merry Christmas :o)
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How to sort data by one or more columns with base R, dplyr and data.table</title>
      <link>https://jozef.io/r008-sorting-data/</link>
      <pubDate>Sat, 08 Dec 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r008-sorting-data/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In this post in the &lt;a href=&#34;https://jozef.io/categories/rcase4base/&#34;&gt;R:case4base&lt;/a&gt; series we will examine sorting (ordering) data in base R. We will learn to sort our data based on one or multiple columns, with ascending or descending order and as always look at alternatives to base R, namely the tidyverse’s dplyr and data.table to show how we can achieve the same results.&lt;/p&gt;
&lt;p&gt;It is recommended to first have a look at the &lt;a href=&#34;https://jozef.io/r002-data-manipulation/&#34;&gt;post on subsetting&lt;/a&gt; to understand the concepts underlying the sorting process in more detail.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#how-to-use-this-article&#34;&gt;How to use this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#subsetting-as-a-mechanism-for-sorting-data&#34;&gt;Subsetting as a mechanism for sorting data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#sorting-data-by-contents-of-a-column&#34;&gt;Sorting data by contents of a column&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#sorting-by-multiple-vectors-with-different-order&#34;&gt;Sorting by multiple vectors with different order&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#using-the-tidyverse&#34;&gt;Using the tidyverse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-data.table&#34;&gt;Using data.table&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#quick-benchmarking&#34;&gt;Quick benchmarking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-want-the-code&#34;&gt;TL;DR - Just want the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;how-to-use-this-article&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How to use this article&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a &lt;code&gt;case4base&lt;/code&gt; after all!&lt;/li&gt;
&lt;li&gt;If you have no time for reading, you can &lt;a href=&#34;https://jozef.io/post/data/r008-sorting-data.r&#34;&gt;click here to get just the code with commentary&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;First, let’s read in yearly data on gross disposable income of household in the EU countries into R (&lt;a href=&#34;https://jozef.io/post/data/ESA2010_GDI.csv&#34;&gt;click here to download&lt;/a&gt;):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi &amp;lt;- read.csv(
  stringsAsFactors = FALSE
, url(&amp;quot;https://jozef.io/post/data/ESA2010_GDI.csv&amp;quot;)
              )
head(gdi[, 1:6, drop = FALSE])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;subsetting-as-a-mechanism-for-sorting-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Subsetting as a mechanism for sorting data&lt;/h1&gt;
&lt;p&gt;Sorting a data frame is loosely coupled with subsetting. To get the rows of a data frame in order reverse to the current one, we can just subset the rows with an index that goes from the last row to the very first (or safer, zeroth) like so:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi_reversed_rows &amp;lt;- gdi[nrow(gdi):0, ]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can take a very similar approach to reverse order the columns:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi_reversed_cols &amp;lt;- gdi[, ncol(gdi):0]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or both rows and columns at the same time. We also add the &lt;code&gt;drop = FALSE&lt;/code&gt; for safety here as we omitted it in the 2 above examples for readability:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi_reversed &amp;lt;- gdi[nrow(gdi):0, ncol(gdi):0, drop = FALSE]
head(gdi_reversed)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     Y.2016  Y.2015    Y.2014    Y.2013    Y.2012    Y.2011    Y.2010
## 35      NA      NA        NA        NA        NA        NA        NA
## 34      NA 1631795 1438281.4 1268729.8 1081744.9  971545.3  807128.5
## 33  458641  447094  449119.3  437596.6  428131.2  420404.9  412363.1
## 32 1627136 1606745 1496128.0 1419380.0 1347970.0 1272065.0 1204442.0
## 31      NA      NA 1055733.5  980494.9  934077.3  872900.3  798916.7
## 30 1330854 1298475 1269177.0 1219699.0 1195227.0 1160813.0 1151812.0
##       Y.2009    Y.2008    Y.2007    Y.2006   Y.2005   Y.2004   Y.2003
## 35        NA        NA        NA        NA       NA       NA       NA
## 34  689431.6        NA        NA        NA       NA       NA       NA
## 33  404446.9  399834.1  389468.0  368868.0 352620.1 341709.9 337742.9
## 32 1150829.0 1105563.0 1021911.0  943515.0 975153.0 894892.0 854026.0
## 31  858678.9  909995.1  827339.5  681058.3 631210.9 536194.9 478645.8
## 30 1101109.0 1080225.0 1063178.0 1005630.0 966175.0 926670.0 893528.0
##      Y.2002   Y.2001   Y.2000   Y.1999   Y.1998   Y.1997   Y.1996   Y.1995
## 35       NA       NA       NA       NA       NA       NA       NA       NA
## 34       NA       NA       NA       NA       NA       NA       NA       NA
## 33 335845.6 336581.4 326269.3 312478.7 303239.5 296324.6 291208.4 287865.4
## 32 800130.0 727228.0 704697.0 660196.0 630865.0 582597.0 549694.0 522981.0
## 31 447572.6 400145.0 369181.0       NA       NA       NA       NA       NA
## 30 857352.0 829908.0 789615.0 737419.0 715396.0 691951.0 656455.0 618959.0
##           country
## 35         Serbia
## 34         Turkey
## 33    Switzerland
## 32         Norway
## 31        Iceland
## 30 United Kingdom&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;sorting-data-by-contents-of-a-column&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Sorting data by contents of a column&lt;/h1&gt;
&lt;p&gt;To order the rows (countries) by GDI in 2016, we use the function &lt;code&gt;order&lt;/code&gt;, which finds the permutation that rearranges the values into ascending order and save that order into a variable called &lt;code&gt;rowidx&lt;/code&gt;. Then we simply use &lt;code&gt;rowidx&lt;/code&gt; to subset the rows of &lt;code&gt;gdi&lt;/code&gt; in the order we wanted:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rowidx &amp;lt;- order(gdi[, &amp;quot;Y.2016&amp;quot;])
rowidx&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] 13  8 16 18 17 26 27  4  9 28 24 22  3 21 33 11  6 23 14 30 12 32  7
## [24] 29  5  2  1 10 15 19 20 25 31 34 35&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi_sorted &amp;lt;- gdi[rowidx, , drop = FALSE]

# We can of course do it in one go:
gdi_sorted &amp;lt;- gdi[order(gdi[, &amp;quot;Y.2016&amp;quot;]), , drop = FALSE]

# Look at the 2 relevant columns of the result 
gdi_sorted[, c(1, 23)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To order in descending order, we can use &lt;code&gt;decreasing = TRUE&lt;/code&gt;, to see &lt;code&gt;NA&lt;/code&gt;s first we can use &lt;code&gt;na.last = FALSE&lt;/code&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rowidx &amp;lt;- order(gdi[, &amp;quot;Y.2016&amp;quot;], decreasing = TRUE, na.last = FALSE)
gdi[rowidx, c(1, 23), drop = FALSE]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country     Y.2016
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;sorting-by-multiple-vectors-with-different-order&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Sorting by multiple vectors with different order&lt;/h1&gt;
&lt;p&gt;That looks good, but we may want to order the rows that have &lt;code&gt;NA&lt;/code&gt; as GDI in 2016 alphabetically by country (or generalize even further). To use multiple vectors for ordering is also very simple:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rowidx &amp;lt;- order(gdi[, &amp;quot;Y.2016&amp;quot;], gdi[, &amp;quot;country&amp;quot;])
gdi[rowidx, c(1, 23), drop = FALSE]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 15         Cyprus         NA
## 10         Greece         NA
## 19        Hungary         NA
## 31        Iceland         NA
## 20          Malta         NA
## 25        Romania         NA
## 35         Serbia         NA
## 34         Turkey         NA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To order by multiple columns in different orders, for numeric vectors we can use a simple &lt;code&gt;-&lt;/code&gt;, since negated numeric vector will order in reverse order. To order our GDI dataset by GDI in 2016 descending and then by country alphabetically:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rowidx &amp;lt;- order(-gdi[, &amp;quot;Y.2016&amp;quot;], gdi[, &amp;quot;country&amp;quot;])
gdi[rowidx, c(1, 23), drop = FALSE]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country     Y.2016
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00
## 15         Cyprus         NA
## 10         Greece         NA
## 19        Hungary         NA
## 31        Iceland         NA
## 20          Malta         NA
## 25        Romania         NA
## 35         Serbia         NA
## 34         Turkey         NA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For non-numeric vectors, we can take advantage of the &lt;code&gt;xtfrm&lt;/code&gt; function, which returns a numeric vector which will sort in the same order as the one provided to it. Then we just use &lt;code&gt;-&lt;/code&gt; to get a vector that will order in reverse order. To order our GDI dataset by GDI ascending in 2016 and then by country reverse-alphabetically:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rowidx &amp;lt;- order(gdi[, &amp;quot;Y.2016&amp;quot;], -xtfrm(gdi[, &amp;quot;country&amp;quot;]))
gdi[rowidx, c(1, 23), drop = FALSE]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 34         Turkey         NA
## 35         Serbia         NA
## 25        Romania         NA
## 20          Malta         NA
## 31        Iceland         NA
## 19        Hungary         NA
## 10         Greece         NA
## 15         Cyprus         NA&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;div id=&#34;using-the-tidyverse&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using the tidyverse&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;dplyr&lt;/code&gt; package comes with a set of very user-friendly functions that are very easy to use, especially in an interactive setting where we know the column names up front, so we can take advantage of the non-standard evaluation:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
gdi %&amp;gt;% 
  arrange(Y.2016, desc(country)) %&amp;gt;% 
  select(1, 23)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country     Y.2016
## 1         Croatia       0.00
## 2         Estonia   12548.30
## 3          Latvia   15737.79
## 4      Luxembourg   20155.80
## 5       Lithuania   24743.49
## 6        Slovenia   24756.63
## 7        Slovakia   48882.91
## 8        Bulgaria   60237.00
## 9         Ireland   97318.90
## 10        Finland  126590.00
## 11       Portugal  128789.39
## 12        Austria  214980.60
## 13        Belgium  243825.50
## 14    Netherlands  357383.00
## 15    Switzerland  458641.00
## 16          Spain  698701.00
## 17        Denmark 1091542.00
## 18         Poland 1136916.00
## 19          Italy 1142273.40
## 20 United Kingdom 1330854.00
## 21         France 1425435.00
## 22         Norway 1627136.00
## 23        Germany 2019917.00
## 24         Sweden 2402587.00
## 25 Czech Republic 2523229.00
## 26   Euro area 19 6736686.43
## 27          EU 28 9454683.60
## 28         Turkey         NA
## 29         Serbia         NA
## 30        Romania         NA
## 31          Malta         NA
## 32        Iceland         NA
## 33        Hungary         NA
## 34         Greece         NA
## 35         Cyprus         NA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we need to provide the names of the columns instead, we can use &lt;code&gt;arrange_at&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi %&amp;gt;% 
  arrange_at(&amp;quot;country&amp;quot;, desc) %&amp;gt;%
  arrange_at(&amp;quot;Y.2016&amp;quot;) %&amp;gt;%
  select(1, 23)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country     Y.2016
## 1         Croatia       0.00
## 2         Estonia   12548.30
## 3          Latvia   15737.79
## 4      Luxembourg   20155.80
## 5       Lithuania   24743.49
## 6        Slovenia   24756.63
## 7        Slovakia   48882.91
## 8        Bulgaria   60237.00
## 9         Ireland   97318.90
## 10        Finland  126590.00
## 11       Portugal  128789.39
## 12        Austria  214980.60
## 13        Belgium  243825.50
## 14    Netherlands  357383.00
## 15    Switzerland  458641.00
## 16          Spain  698701.00
## 17        Denmark 1091542.00
## 18         Poland 1136916.00
## 19          Italy 1142273.40
## 20 United Kingdom 1330854.00
## 21         France 1425435.00
## 22         Norway 1627136.00
## 23        Germany 2019917.00
## 24         Sweden 2402587.00
## 25 Czech Republic 2523229.00
## 26   Euro area 19 6736686.43
## 27          EU 28 9454683.60
## 28         Turkey         NA
## 29         Serbia         NA
## 30        Romania         NA
## 31          Malta         NA
## 32        Iceland         NA
## 33        Hungary         NA
## 34         Greece         NA
## 35         Cyprus         NA&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-data.table&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using data.table&lt;/h2&gt;
&lt;p&gt;There are multiple ways to achieve the desired results with data.table, the one syntactically similar to base R is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(data.table)
gdidt &amp;lt;- as.data.table(gdi)
gdidt[order(Y.2016, -country), c(1, 23)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            country     Y.2016
##  1:        Croatia       0.00
##  2:        Estonia   12548.30
##  3:         Latvia   15737.79
##  4:     Luxembourg   20155.80
##  5:      Lithuania   24743.49
##  6:       Slovenia   24756.63
##  7:       Slovakia   48882.91
##  8:       Bulgaria   60237.00
##  9:        Ireland   97318.90
## 10:        Finland  126590.00
## 11:       Portugal  128789.39
## 12:        Austria  214980.60
## 13:        Belgium  243825.50
## 14:    Netherlands  357383.00
## 15:    Switzerland  458641.00
## 16:          Spain  698701.00
## 17:        Denmark 1091542.00
## 18:         Poland 1136916.00
## 19:          Italy 1142273.40
## 20: United Kingdom 1330854.00
## 21:         France 1425435.00
## 22:         Norway 1627136.00
## 23:        Germany 2019917.00
## 24:         Sweden 2402587.00
## 25: Czech Republic 2523229.00
## 26:   Euro area 19 6736686.43
## 27:          EU 28 9454683.60
## 28:         Turkey         NA
## 29:         Serbia         NA
## 30:        Romania         NA
## 31:          Malta         NA
## 32:        Iceland         NA
## 33:        Hungary         NA
## 34:         Greece         NA
## 35:         Cyprus         NA
##            country     Y.2016&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Another option is to take advantage of the &lt;code&gt;setorderv&lt;/code&gt; method provided by data.table. The important distinction is that this will sort the existing data.table in place, changing the source object. The other methods used above leave the source object untouched:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This will sort the gdidt by reference - changing the input object
setorderv(gdidt, c(&amp;quot;Y.2016&amp;quot;, &amp;quot;country&amp;quot;), c(1, -1), na.last = TRUE)
# So we now just subset the (already sorted) gdidt
gdidt[, c(1, 23)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            country     Y.2016
##  1:        Croatia       0.00
##  2:        Estonia   12548.30
##  3:         Latvia   15737.79
##  4:     Luxembourg   20155.80
##  5:      Lithuania   24743.49
##  6:       Slovenia   24756.63
##  7:       Slovakia   48882.91
##  8:       Bulgaria   60237.00
##  9:        Ireland   97318.90
## 10:        Finland  126590.00
## 11:       Portugal  128789.39
## 12:        Austria  214980.60
## 13:        Belgium  243825.50
## 14:    Netherlands  357383.00
## 15:    Switzerland  458641.00
## 16:          Spain  698701.00
## 17:        Denmark 1091542.00
## 18:         Poland 1136916.00
## 19:          Italy 1142273.40
## 20: United Kingdom 1330854.00
## 21:         France 1425435.00
## 22:         Norway 1627136.00
## 23:        Germany 2019917.00
## 24:         Sweden 2402587.00
## 25: Czech Republic 2523229.00
## 26:   Euro area 19 6736686.43
## 27:          EU 28 9454683.60
## 28:         Turkey         NA
## 29:         Serbia         NA
## 30:        Romania         NA
## 31:          Malta         NA
## 32:        Iceland         NA
## 33:        Hungary         NA
## 34:         Greece         NA
## 35:         Cyprus         NA
##            country     Y.2016&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-benchmarking&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick benchmarking&lt;/h1&gt;
&lt;p&gt;For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned methods. To do the benchmarking, we will use a very slightly modified &lt;code&gt;flights&lt;/code&gt; data frame provided by Hadley Wickham’s &lt;a href=&#34;https://cran.r-project.org/package=nycflights13&#34;&gt;nycflights13&lt;/a&gt; package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bench &amp;lt;- microbenchmark::microbenchmark(times = 100,
  base_order   = {flights[order(flights[, &amp;quot;flight&amp;quot;], -xtfrm(flights[, &amp;quot;carrier&amp;quot;])), ] },
  dt_oder      = {flightsdt[order(flight, -carrier), ] },
  dplyr_nse    = {flights %&amp;gt;% arrange(flight, desc(carrier)) },
  dplyr_scoped = {flights %&amp;gt;% arrange_at(&amp;quot;carrier&amp;quot;, desc) %&amp;gt;% arrange_at(&amp;quot;flight&amp;quot;) }
)&lt;/code&gt;&lt;/pre&gt;
&lt;script type=&#34;text/javascript&#34;&gt; $(function () {
  $(&#39;#r008-01-bench-boxplot&#39;).highcharts({
  title: {     
    text: &#34;Sorting 2 columns, 336 776 rows&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;time (milliseconds)&#34;     
    },     
    min: 0     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    boxplot: {     
      fillColor: &#34;#C9E4FF&#34;,     
      lineWidth: 1,     
      medianWidth: 2,     
      stemDashStyle: &#34;dot&#34;,     
      stemWidth: 1,     
      whiskerLength: &#34;40%&#34;,     
      whiskerWidth: 1.5     
    }     
  },     
  chart: {     
    type: &#34;column&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      name: null,     
      data: [     
        {     
          name: &#34;base_order&#34;,     
          low: 1302,     
          q1: 1339,     
          median: 1377,     
          q3: 1424,     
          high: 1541     
        },     
        {     
          name: &#34;dt_oder&#34;,     
          low: 69,     
          q1: 74.5,     
          median: 79,     
          q3: 85,     
          high: 100     
        },     
        {     
          name: &#34;dplyr_nse&#34;,     
          low: 218,     
          q1: 228.5,     
          median: 235,     
          q3: 247.5,     
          high: 268     
        },     
        {     
          name: &#34;dplyr_scoped&#34;,     
          low: 308,     
          q1: 321.5,     
          median: 345.5,     
          q3: 428.5,     
          high: 490     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null,     
      color: &#34;blue&#34;     
    }     
  ]     
}     
  );
}); &lt;/script&gt;
&lt;div id=&#34;r008-01-bench-boxplot&#34;&gt;

&lt;/div&gt;
&lt;p&gt;Under our particular circumstances, base R’s method seems to be the slowest of the options with data.table being the fastest.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-want-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just want the code&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;No time for reading? &lt;a href=&#34;https://jozef.io/post/data/r008-sorting-data.r&#34;&gt;Click here to get just the code with commentary&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/r002-data-manipulation/&#34;&gt;post on subsetting&lt;/a&gt; on this blog&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/order&#34;&gt;documentation for base::order&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://dplyr.tidyverse.org/reference/arrange.html&#34;&gt;dplyr’s arrange&lt;/a&gt; function reference&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html&#34;&gt;introduction to data.table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://ec.europa.eu/eurostat/web/sector-accounts/data/annual-data&#34;&gt;original eurostat data source&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How to work with strings in base R - An overview of 20&#43; methods for daily use</title>
      <link>https://jozef.io/r007-string-manipulation/</link>
      <pubDate>Sat, 24 Nov 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r007-string-manipulation/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In this post in the &lt;a href=&#34;https://jozef.io/categories/rcase4base/&#34;&gt;R:case4base&lt;/a&gt; series we will look at string manipulation with base R, and provide an overview of a wide range of functions for our string working needs.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We will use simple examples to learn to perform basic string operations, concatenate strings, work with substrings, switch cases, quote, find and replace within strings and more. Some interesting bonuses will also be included.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As always, some popular alternatives to base R will also be suggested and many useful references provided for further reading.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#quick-overview-of-the-very-basics&#34;&gt;Quick overview of the very basics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#string-concatenation&#34;&gt;String concatenation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#string-manipulation-and-properties&#34;&gt;String manipulation and properties&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#basic-pattern-matching-and-replacement&#34;&gt;Basic pattern matching and replacement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bonuses&#34;&gt;Bonuses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-want-the-code&#34;&gt;TL;DR - Just want the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-overview-of-the-very-basics&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick overview of the very basics&lt;/h1&gt;
&lt;p&gt;This post is aimed to serve as an overview of functionality provided by base R to work with strings. Note that the term “string” is used somewhat loosely and refers to character vectors and character strings. In R documentation, references to &lt;code&gt;character string&lt;/code&gt;, refer to character vectors of length 1.&lt;/p&gt;
&lt;p&gt;Also since this is an overview, we will not examine the details of the functions, but rather list examples with simple, intuitive explanations trading off technical precision.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# String constants can be assigned using
# double quotes 
a &amp;lt;- &amp;quot;this is a character string&amp;quot;
# or single quotes 
b &amp;lt;- &amp;#39;this is a character string, too&amp;#39;
# To use literal quotes, we can escape with `\`: 
c &amp;lt;- &amp;quot;this is \&amp;quot;it\&amp;quot;&amp;quot;

# To make a character vector with multiple elements:
d &amp;lt;- c(&amp;quot;this&amp;quot;, &amp;quot;vector&amp;quot;, &amp;quot;has&amp;quot;, &amp;quot;five&amp;quot;, &amp;quot;elements&amp;quot;)

# To get the length of a character vector
# (how many elements are in a character vector)
length(d)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 5&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# To get the number of characters in elemets of a vector
# (&amp;quot;how many characters in each of the elements&amp;quot;)
nchar(d)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 4 6 3 4 8&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# To create a missing character value
NA_character_&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] NA&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# To test if an object is a character vector
is.character(&amp;quot;is this a character vector?&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# To convert other objects to character vectors
# Can surprise the unwary
as.character(c(
  42,
  Sys.time(),
  factor(&amp;quot;A&amp;quot;, levels = LETTERS)
))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;42&amp;quot;         &amp;quot;1543050000&amp;quot; &amp;quot;1&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# One of the ways to output a vector is `cat`
cat(&amp;quot;Show me this&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Show me this&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# To include line breaks use `&amp;quot;\n&amp;quot;`
# To include tabs use `&amp;quot;\t&amp;quot;`:
cat(&amp;quot;Break\ta\ta\nline&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Break    a   a
## line&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# When in doubt about an object
# str or summary may help
weirdList &amp;lt;- list(
  &amp;quot;What is this?&amp;quot;,
  Sys.time(),
  b = 5L,
  c = c(&amp;quot;one&amp;quot;, 2),
  d = factor(c(&amp;quot;red&amp;quot;, &amp;quot;blue&amp;quot;)),
  e = NA_character_,
  f = NA_integer_
)

str(weirdList)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## List of 7
##  $  : chr &amp;quot;What is this?&amp;quot;
##  $  : POSIXct[1:1], format: &amp;quot;2018-11-24 09:00:00&amp;quot;
##  $ b: int 5
##  $ c: chr [1:2] &amp;quot;one&amp;quot; &amp;quot;2&amp;quot;
##  $ d: Factor w/ 2 levels &amp;quot;blue&amp;quot;,&amp;quot;red&amp;quot;: 2 1
##  $ e: chr NA
##  $ f: int NA&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(weirdList)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   Length Class   Mode     
##   1      -none-  character
##   1      POSIXct numeric  
## b 1      -none-  numeric  
## c 2      -none-  character
## d 2      factor  numeric  
## e 1      -none-  character
## f 1      -none-  numeric&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;string-concatenation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;String concatenation&lt;/h1&gt;
&lt;p&gt;String concatenation is the process of “joining” two strings together and one the most common operations.&lt;/p&gt;
&lt;div id=&#34;simple-concatenation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Simple concatenation&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# We will use these vectors for our examples:
1:3&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 1 2 3&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;month.name&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;January&amp;quot;   &amp;quot;February&amp;quot;  &amp;quot;March&amp;quot;     &amp;quot;April&amp;quot;     &amp;quot;May&amp;quot;      
##  [6] &amp;quot;June&amp;quot;      &amp;quot;July&amp;quot;      &amp;quot;August&amp;quot;    &amp;quot;September&amp;quot; &amp;quot;October&amp;quot;  
## [11] &amp;quot;November&amp;quot;  &amp;quot;December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Use paste to concatenate
# R recycles 1:3 4 times to fit the length of month.name
paste(1:3, month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;1 January&amp;quot;   &amp;quot;2 February&amp;quot;  &amp;quot;3 March&amp;quot;     &amp;quot;1 April&amp;quot;     &amp;quot;2 May&amp;quot;      
##  [6] &amp;quot;3 June&amp;quot;      &amp;quot;1 July&amp;quot;      &amp;quot;2 August&amp;quot;    &amp;quot;3 September&amp;quot; &amp;quot;1 October&amp;quot;  
## [11] &amp;quot;2 November&amp;quot;  &amp;quot;3 December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Specify the sep argument to 
# separate the elements differently
paste(1:3, month.name, sep = &amp;quot;: &amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;1: January&amp;quot;   &amp;quot;2: February&amp;quot;  &amp;quot;3: March&amp;quot;     &amp;quot;1: April&amp;quot;    
##  [5] &amp;quot;2: May&amp;quot;       &amp;quot;3: June&amp;quot;      &amp;quot;1: July&amp;quot;      &amp;quot;2: August&amp;quot;   
##  [9] &amp;quot;3: September&amp;quot; &amp;quot;1: October&amp;quot;   &amp;quot;2: November&amp;quot;  &amp;quot;3: December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# A shorthard for sep = &amp;quot;&amp;quot;
paste0(1:3, month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;1January&amp;quot;   &amp;quot;2February&amp;quot;  &amp;quot;3March&amp;quot;     &amp;quot;1April&amp;quot;     &amp;quot;2May&amp;quot;      
##  [6] &amp;quot;3June&amp;quot;      &amp;quot;1July&amp;quot;      &amp;quot;2August&amp;quot;    &amp;quot;3September&amp;quot; &amp;quot;1October&amp;quot;  
## [11] &amp;quot;2November&amp;quot;  &amp;quot;3December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Alternatively, sprintf is very useful
sprintf(&amp;quot;%s: %s&amp;quot;, 1:3, month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;1: January&amp;quot;   &amp;quot;2: February&amp;quot;  &amp;quot;3: March&amp;quot;     &amp;quot;1: April&amp;quot;    
##  [5] &amp;quot;2: May&amp;quot;       &amp;quot;3: June&amp;quot;      &amp;quot;1: July&amp;quot;      &amp;quot;2: August&amp;quot;   
##  [9] &amp;quot;3: September&amp;quot; &amp;quot;1: October&amp;quot;   &amp;quot;2: November&amp;quot;  &amp;quot;3: December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;concatenate-a-vector-into-a-single-character-string&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Concatenate a vector into a single character string&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Provide the collapse argument to paste
# to get a character string (length 1 vector):
paste(1:3, month.name, sep = &amp;quot;: &amp;quot;, collapse = &amp;quot;, &amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1: January, 2: February, 3: March, 1: April, 2: May, 3: June, 1: July, 2: August, 3: September, 1: October, 2: November, 3: December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Or, use toString
toString(paste(1:3, month.name, sep = &amp;quot;: &amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1: January, 2: February, 3: March, 1: April, 2: May, 3: June, 1: July, 2: August, 3: September, 1: October, 2: November, 3: December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;string-manipulation-and-properties&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;String manipulation and properties&lt;/h1&gt;
&lt;div id=&#34;string-lengths&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;String lengths&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# How many elements does a vector have?
length(month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 12&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# To get the number of characters in elemets of a vector
# (&amp;quot;how many characters in each of the elements?&amp;quot;)
nchar(month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] 7 8 5 5 3 4 4 6 9 7 8 8&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Are the elements non-empty strings?
nzchar(month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;switching-to-upperlower-case&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Switching to upper/lower case&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Switch to all lower case
tolower(month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;january&amp;quot;   &amp;quot;february&amp;quot;  &amp;quot;march&amp;quot;     &amp;quot;april&amp;quot;     &amp;quot;may&amp;quot;      
##  [6] &amp;quot;june&amp;quot;      &amp;quot;july&amp;quot;      &amp;quot;august&amp;quot;    &amp;quot;september&amp;quot; &amp;quot;october&amp;quot;  
## [11] &amp;quot;november&amp;quot;  &amp;quot;december&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Switch to all upper case
toupper(month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;JANUARY&amp;quot;   &amp;quot;FEBRUARY&amp;quot;  &amp;quot;MARCH&amp;quot;     &amp;quot;APRIL&amp;quot;     &amp;quot;MAY&amp;quot;      
##  [6] &amp;quot;JUNE&amp;quot;      &amp;quot;JULY&amp;quot;      &amp;quot;AUGUST&amp;quot;    &amp;quot;SEPTEMBER&amp;quot; &amp;quot;OCTOBER&amp;quot;  
## [11] &amp;quot;NOVEMBER&amp;quot;  &amp;quot;DECEMBER&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Casefold is a wrapper for S-PLUS compatibility
casefold(month.name, upper = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;january&amp;quot;   &amp;quot;february&amp;quot;  &amp;quot;march&amp;quot;     &amp;quot;april&amp;quot;     &amp;quot;may&amp;quot;      
##  [6] &amp;quot;june&amp;quot;      &amp;quot;july&amp;quot;      &amp;quot;august&amp;quot;    &amp;quot;september&amp;quot; &amp;quot;october&amp;quot;  
## [11] &amp;quot;november&amp;quot;  &amp;quot;december&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;casefold(month.name, upper = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;JANUARY&amp;quot;   &amp;quot;FEBRUARY&amp;quot;  &amp;quot;MARCH&amp;quot;     &amp;quot;APRIL&amp;quot;     &amp;quot;MAY&amp;quot;      
##  [6] &amp;quot;JUNE&amp;quot;      &amp;quot;JULY&amp;quot;      &amp;quot;AUGUST&amp;quot;    &amp;quot;SEPTEMBER&amp;quot; &amp;quot;OCTOBER&amp;quot;  
## [11] &amp;quot;NOVEMBER&amp;quot;  &amp;quot;DECEMBER&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Also, custom translation:
chartr(&amp;quot;OIZEASGTC&amp;quot;, &amp;quot;01234567(&amp;quot; , toupper(month.name))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;J4NU4RY&amp;quot;   &amp;quot;F3BRU4RY&amp;quot;  &amp;quot;M4R(H&amp;quot;     &amp;quot;4PR1L&amp;quot;     &amp;quot;M4Y&amp;quot;      
##  [6] &amp;quot;JUN3&amp;quot;      &amp;quot;JULY&amp;quot;      &amp;quot;4U6U57&amp;quot;    &amp;quot;53P73MB3R&amp;quot; &amp;quot;0(70B3R&amp;quot;  
## [11] &amp;quot;N0V3MB3R&amp;quot;  &amp;quot;D3(3MB3R&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;removing-white-spaces&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Removing white spaces&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Remove all leading and trailing whitespaces
trimws(&amp;quot; This has trailing spaces.  &amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;This has trailing spaces.&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Remove leading whitespaces
trimws(&amp;quot; This has trailing spaces.  &amp;quot;, which = &amp;quot;left&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;This has trailing spaces.  &amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Remove trailing whitespaces
trimws(&amp;quot; This has trailing spaces.  &amp;quot;, which = &amp;quot;right&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot; This has trailing spaces.&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;encoding-conversion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Encoding conversion&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Convert a character vector between encodings
iconv(&amp;quot;šibrinkuje&amp;quot;, &amp;quot;UTF-8&amp;quot;, &amp;quot;ASCII&amp;quot;, &amp;quot;?&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;??ibrinkuje&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;quoting&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Quoting&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Quoting text for fancier priting:
sQuote(month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;&amp;#39;January&amp;#39;&amp;quot;   &amp;quot;&amp;#39;February&amp;#39;&amp;quot;  &amp;quot;&amp;#39;March&amp;#39;&amp;quot;     &amp;quot;&amp;#39;April&amp;#39;&amp;quot;     &amp;quot;&amp;#39;May&amp;#39;&amp;quot;      
##  [6] &amp;quot;&amp;#39;June&amp;#39;&amp;quot;      &amp;quot;&amp;#39;July&amp;#39;&amp;quot;      &amp;quot;&amp;#39;August&amp;#39;&amp;quot;    &amp;quot;&amp;#39;September&amp;#39;&amp;quot; &amp;quot;&amp;#39;October&amp;#39;&amp;quot;  
## [11] &amp;quot;&amp;#39;November&amp;#39;&amp;quot;  &amp;quot;&amp;#39;December&amp;#39;&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dQuote(month.name)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;\&amp;quot;January\&amp;quot;&amp;quot;   &amp;quot;\&amp;quot;February\&amp;quot;&amp;quot;  &amp;quot;\&amp;quot;March\&amp;quot;&amp;quot;     &amp;quot;\&amp;quot;April\&amp;quot;&amp;quot;    
##  [5] &amp;quot;\&amp;quot;May\&amp;quot;&amp;quot;       &amp;quot;\&amp;quot;June\&amp;quot;&amp;quot;      &amp;quot;\&amp;quot;July\&amp;quot;&amp;quot;      &amp;quot;\&amp;quot;August\&amp;quot;&amp;quot;   
##  [9] &amp;quot;\&amp;quot;September\&amp;quot;&amp;quot; &amp;quot;\&amp;quot;October\&amp;quot;&amp;quot;   &amp;quot;\&amp;quot;November\&amp;quot;&amp;quot;  &amp;quot;\&amp;quot;December\&amp;quot;&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Not to be confused with quoting strings for passing to OS shell
system(paste(&amp;quot;echo&amp;quot;, shQuote(&amp;quot;Weird\nstuff&amp;quot;)))

# Also not be confused with quoting expressions
str(quote(1 + 1))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  language 1 + 1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;retrieving-and-working-with-substrings&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Retrieving and working with substrings&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get the first three characters from all the month.names
substr(month.name, 1, 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;Jan&amp;quot; &amp;quot;Feb&amp;quot; &amp;quot;Mar&amp;quot; &amp;quot;Apr&amp;quot; &amp;quot;May&amp;quot; &amp;quot;Jun&amp;quot; &amp;quot;Jul&amp;quot; &amp;quot;Aug&amp;quot; &amp;quot;Sep&amp;quot; &amp;quot;Oct&amp;quot; &amp;quot;Nov&amp;quot;
## [12] &amp;quot;Dec&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get the last three characters from all the month.names
substr(month.name, nchar(month.name) - 2, nchar(month.name))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;ary&amp;quot; &amp;quot;ary&amp;quot; &amp;quot;rch&amp;quot; &amp;quot;ril&amp;quot; &amp;quot;May&amp;quot; &amp;quot;une&amp;quot; &amp;quot;uly&amp;quot; &amp;quot;ust&amp;quot; &amp;quot;ber&amp;quot; &amp;quot;ber&amp;quot; &amp;quot;ber&amp;quot;
## [12] &amp;quot;ber&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Wrapper around substr for S Compability:
substring(month.name, 1, 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;Jan&amp;quot; &amp;quot;Feb&amp;quot; &amp;quot;Mar&amp;quot; &amp;quot;Apr&amp;quot; &amp;quot;May&amp;quot; &amp;quot;Jun&amp;quot; &amp;quot;Jul&amp;quot; &amp;quot;Aug&amp;quot; &amp;quot;Sep&amp;quot; &amp;quot;Oct&amp;quot; &amp;quot;Nov&amp;quot;
## [12] &amp;quot;Dec&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Check whether elements start with a string
startsWith(month.name, &amp;quot;J&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [12] FALSE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Check whether elements end with a string
endsWith(month.name, &amp;quot;ember&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [12]  TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Trim character strings to specified display widths.
strtrim(month.name, 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;Jan&amp;quot; &amp;quot;Feb&amp;quot; &amp;quot;Mar&amp;quot; &amp;quot;Apr&amp;quot; &amp;quot;May&amp;quot; &amp;quot;Jun&amp;quot; &amp;quot;Jul&amp;quot; &amp;quot;Aug&amp;quot; &amp;quot;Sep&amp;quot; &amp;quot;Oct&amp;quot; &amp;quot;Nov&amp;quot;
## [12] &amp;quot;Dec&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Abbreviate strings to at least minlength characters
abbreviate(month.name, minlength = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   January  February     March     April       May      June      July 
##     &amp;quot;Jnr&amp;quot;     &amp;quot;Fbr&amp;quot;     &amp;quot;Mrc&amp;quot;     &amp;quot;Apr&amp;quot;     &amp;quot;May&amp;quot;     &amp;quot;Jun&amp;quot;     &amp;quot;Jly&amp;quot; 
##    August September   October  November  December 
##     &amp;quot;Ags&amp;quot;     &amp;quot;Spt&amp;quot;     &amp;quot;Oct&amp;quot;     &amp;quot;Nvm&amp;quot;     &amp;quot;Dcm&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;basic-pattern-matching-and-replacement&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Basic pattern matching and replacement&lt;/h1&gt;
&lt;p&gt;Pattern matching and replacement using regular expressions in an extremely powerful feature, however it is out of scope of this overview to cover them.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Check the &lt;a href=&#34;#references&#34;&gt;references&lt;/a&gt; for better resources if you are interested. A lot more useful detail can also be found in R’s documentation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The following is just to show very basic use and list useful functions.&lt;/p&gt;
&lt;div id=&#34;replace-substring-with-other-strings&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Replace substring with other strings&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;myStrings &amp;lt;- paste(1:3, month.name, sep = &amp;quot;. &amp;quot;)

# Replace all ones with zeros:
# fixed will match the first argument as is
gsub(&amp;quot;1&amp;quot;, &amp;quot;0&amp;quot;, myStrings, fixed = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;0. January&amp;quot;   &amp;quot;2. February&amp;quot;  &amp;quot;3. March&amp;quot;     &amp;quot;0. April&amp;quot;    
##  [5] &amp;quot;2. May&amp;quot;       &amp;quot;3. June&amp;quot;      &amp;quot;0. July&amp;quot;      &amp;quot;2. August&amp;quot;   
##  [9] &amp;quot;3. September&amp;quot; &amp;quot;0. October&amp;quot;   &amp;quot;2. November&amp;quot;  &amp;quot;3. December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Replace only the first &amp;quot;a&amp;quot; in each for &amp;quot;A&amp;quot;
sub(&amp;quot;a&amp;quot;, &amp;quot;A&amp;quot;, myStrings, fixed = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;1. JAnuary&amp;quot;   &amp;quot;2. FebruAry&amp;quot;  &amp;quot;3. MArch&amp;quot;     &amp;quot;1. April&amp;quot;    
##  [5] &amp;quot;2. MAy&amp;quot;       &amp;quot;3. June&amp;quot;      &amp;quot;1. July&amp;quot;      &amp;quot;2. August&amp;quot;   
##  [9] &amp;quot;3. September&amp;quot; &amp;quot;1. October&amp;quot;   &amp;quot;2. November&amp;quot;  &amp;quot;3. December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Replace any number with 0
# note that the fixed argument is now FALSE (default)
gsub(&amp;quot;[0-9]&amp;quot;, &amp;quot;0&amp;quot;, myStrings)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;0. January&amp;quot;   &amp;quot;0. February&amp;quot;  &amp;quot;0. March&amp;quot;     &amp;quot;0. April&amp;quot;    
##  [5] &amp;quot;0. May&amp;quot;       &amp;quot;0. June&amp;quot;      &amp;quot;0. July&amp;quot;      &amp;quot;0. August&amp;quot;   
##  [9] &amp;quot;0. September&amp;quot; &amp;quot;0. October&amp;quot;   &amp;quot;0. November&amp;quot;  &amp;quot;0. December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Replace literal dots with 0
gsub(&amp;quot;.&amp;quot;, &amp;quot;0&amp;quot;, myStrings, fixed = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;10 January&amp;quot;   &amp;quot;20 February&amp;quot;  &amp;quot;30 March&amp;quot;     &amp;quot;10 April&amp;quot;    
##  [5] &amp;quot;20 May&amp;quot;       &amp;quot;30 June&amp;quot;      &amp;quot;10 July&amp;quot;      &amp;quot;20 August&amp;quot;   
##  [9] &amp;quot;30 September&amp;quot; &amp;quot;10 October&amp;quot;   &amp;quot;20 November&amp;quot;  &amp;quot;30 December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This will replace all characters (except &amp;quot;\n&amp;quot;) with zeros
gsub(&amp;quot;.&amp;quot;, &amp;quot;0&amp;quot;, myStrings)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;0000000000&amp;quot;   &amp;quot;00000000000&amp;quot;  &amp;quot;00000000&amp;quot;     &amp;quot;00000000&amp;quot;    
##  [5] &amp;quot;000000&amp;quot;       &amp;quot;0000000&amp;quot;      &amp;quot;0000000&amp;quot;      &amp;quot;000000000&amp;quot;   
##  [9] &amp;quot;000000000000&amp;quot; &amp;quot;0000000000&amp;quot;   &amp;quot;00000000000&amp;quot;  &amp;quot;00000000000&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Also replace literal dots but without &amp;quot;fixed = TRUE&amp;quot;
# by escaping &amp;quot;.&amp;quot; using &amp;quot;\\.&amp;quot; instead.
# This will treat &amp;quot;.&amp;quot; literally instead of its special meaning
gsub(&amp;quot;\\.&amp;quot;, &amp;quot;0&amp;quot;, myStrings)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;10 January&amp;quot;   &amp;quot;20 February&amp;quot;  &amp;quot;30 March&amp;quot;     &amp;quot;10 April&amp;quot;    
##  [5] &amp;quot;20 May&amp;quot;       &amp;quot;30 June&amp;quot;      &amp;quot;10 July&amp;quot;      &amp;quot;20 August&amp;quot;   
##  [9] &amp;quot;30 September&amp;quot; &amp;quot;10 October&amp;quot;   &amp;quot;20 November&amp;quot;  &amp;quot;30 December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;check-if-a-pattern-is-present-within-elements-of-a-character-vector&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Check if a pattern is present within elements of a character vector&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;myStrings &amp;lt;- paste(1:3, month.name, sep = &amp;quot;. &amp;quot;)

# Is a pattern present (returns a logical vector)?
grepl(&amp;quot;ember&amp;quot;, myStrings)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [12]  TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# In which elements is a pattern present (returns indices)?
grep(&amp;quot;ember&amp;quot;, myStrings)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1]  9 11 12&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# In which elements is a pattern present (returns the values)?
grep(&amp;quot;ember&amp;quot;, myStrings, value = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;3. September&amp;quot; &amp;quot;2. November&amp;quot;  &amp;quot;3. December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;check-where-the-matches-are-within-the-elements-of-a-character-vector&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Check where the matches are within the elements of a character vector&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;myStrings &amp;lt;- paste(1:3, month.name, sep = &amp;quot;. &amp;quot;)

# Where is the first &amp;quot;a&amp;quot; located in each of the elements?
# pattern if not found in that element, returns -1
regexpr(&amp;quot;a&amp;quot;, myStrings)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1]  5  9  5 -1  5 -1 -1 -1 -1 -1 -1 -1
## attr(,&amp;quot;match.length&amp;quot;)
##  [1]  1  1  1 -1  1 -1 -1 -1 -1 -1 -1 -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Where are all the &amp;quot;a&amp;quot; located in each of the elements?
# If pattern not found in that element, returns -1
gregexpr(&amp;quot;a&amp;quot;, myStrings)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## [1] 5 8
## attr(,&amp;quot;match.length&amp;quot;)
## [1] 1 1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[2]]
## [1] 9
## attr(,&amp;quot;match.length&amp;quot;)
## [1] 1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[3]]
## [1] 5
## attr(,&amp;quot;match.length&amp;quot;)
## [1] 1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[4]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[5]]
## [1] 5
## attr(,&amp;quot;match.length&amp;quot;)
## [1] 1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[6]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[7]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[8]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[9]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[10]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[11]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE
## 
## [[12]]
## [1] -1
## attr(,&amp;quot;match.length&amp;quot;)
## [1] -1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Where are all the &amp;quot;a&amp;quot; located in the first element?
gregexpr(&amp;quot;a&amp;quot;, myStrings[1])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## [1] 5 8
## attr(,&amp;quot;match.length&amp;quot;)
## [1] 1 1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# or also 
gregexpr(&amp;quot;a&amp;quot;, myStrings)[[1]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 5 8
## attr(,&amp;quot;match.length&amp;quot;)
## [1] 1 1
## attr(,&amp;quot;useBytes&amp;quot;)
## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We skip &lt;code&gt;regexec&lt;/code&gt; here as parenthesized sub-expressions are very much out of scope of this post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;extract-the-matching-substrings&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Extract the matching substrings&lt;/h2&gt;
&lt;p&gt;The above &lt;code&gt;regexpr()&lt;/code&gt; and &lt;code&gt;gregexpr()&lt;/code&gt; tell us where the patterns we are looking for are located. It is often useful to extract the actual substrings that are at those locations and &lt;code&gt;regmatches()&lt;/code&gt; does that for us:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;myStrings &amp;lt;- paste(1:3, month.name, sep = &amp;quot;. &amp;quot;)

# Find substrings that start with 1 or 2 and end
# in &amp;quot;ber&amp;quot; within myStrings
regmatches(
  myStrings,
  regexpr(&amp;quot;^[1-2].*ber$&amp;quot;, myStrings)
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1. October&amp;quot;  &amp;quot;2. November&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Alternatively, the same as a list of the same
# length as myStrings
regmatches(
  myStrings,
  gregexpr(&amp;quot;^[1-2].*ber$&amp;quot;, myStrings)
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## character(0)
## 
## [[9]]
## character(0)
## 
## [[10]]
## [1] &amp;quot;1. October&amp;quot;
## 
## [[11]]
## [1] &amp;quot;2. November&amp;quot;
## 
## [[12]]
## character(0)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# We can also get the non-matched substrings
# using invert = TRUE
regmatches(
  myStrings,
  regexpr(&amp;quot;^[1-2].*ber$&amp;quot;, myStrings),
  invert = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## [1] &amp;quot;1. January&amp;quot;
## 
## [[2]]
## [1] &amp;quot;2. February&amp;quot;
## 
## [[3]]
## [1] &amp;quot;3. March&amp;quot;
## 
## [[4]]
## [1] &amp;quot;1. April&amp;quot;
## 
## [[5]]
## [1] &amp;quot;2. May&amp;quot;
## 
## [[6]]
## [1] &amp;quot;3. June&amp;quot;
## 
## [[7]]
## [1] &amp;quot;1. July&amp;quot;
## 
## [[8]]
## [1] &amp;quot;2. August&amp;quot;
## 
## [[9]]
## [1] &amp;quot;3. September&amp;quot;
## 
## [[10]]
## [1] &amp;quot;&amp;quot; &amp;quot;&amp;quot;
## 
## [[11]]
## [1] &amp;quot;&amp;quot; &amp;quot;&amp;quot;
## 
## [[12]]
## [1] &amp;quot;3. December&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;bonuses&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Bonuses&lt;/h1&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r007-01-strings.gif&#34; alt=&#34;Strings&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Strings&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# The Levenshtein distance between strings
adist(c(&amp;quot;lazy&amp;quot;, &amp;quot;lasso&amp;quot;, &amp;quot;lassie&amp;quot;), c(&amp;quot;lazy&amp;quot;, &amp;quot;lazier&amp;quot;, &amp;quot;laser&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      [,1] [,2] [,3]
## [1,]    0    3    3
## [2,]    3    4    2
## [3,]    4    3    3&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Repeat elements of a character vector a given number of times 
strrep(c(&amp;quot;:)&amp;quot;, &amp;quot;:P &amp;quot;, &amp;quot;;) &amp;quot;), 1:3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;:)&amp;quot;        &amp;quot;:P :P &amp;quot;    &amp;quot;;) ;) ;) &amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Convert strings to integers of a given base
strtoi(c(&amp;quot;101010&amp;quot;, &amp;quot;11111000101&amp;quot;), base =  2L)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1]   42 1989&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;strtoi(c(&amp;quot;2A&amp;quot;, &amp;quot;7C5&amp;quot;), base = 16L)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1]   42 1989&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Symbolic Number Coding
cors &amp;lt;- lapply(split(iris, iris$Species), function(x) cor(x[, 1:4]))
lapply(cors, symnum, abbr.colnames = 6)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## $setosa
##              Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1                          
## Sepal.Width  ,      1                   
## Petal.Length               1            
## Petal.Width                .      1     
## attr(,&amp;quot;legend&amp;quot;)
## [1] 0 &amp;#39; &amp;#39; 0.3 &amp;#39;.&amp;#39; 0.6 &amp;#39;,&amp;#39; 0.8 &amp;#39;+&amp;#39; 0.9 &amp;#39;*&amp;#39; 0.95 &amp;#39;B&amp;#39; 1
## 
## $versicolor
##              Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1                          
## Sepal.Width  .      1                   
## Petal.Length ,      .      1            
## Petal.Width  .      ,      ,      1     
## attr(,&amp;quot;legend&amp;quot;)
## [1] 0 &amp;#39; &amp;#39; 0.3 &amp;#39;.&amp;#39; 0.6 &amp;#39;,&amp;#39; 0.8 &amp;#39;+&amp;#39; 0.9 &amp;#39;*&amp;#39; 0.95 &amp;#39;B&amp;#39; 1
## 
## $virginica
##              Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1                          
## Sepal.Width  .      1                   
## Petal.Length +      .      1            
## Petal.Width         .      .      1     
## attr(,&amp;quot;legend&amp;quot;)
## [1] 0 &amp;#39; &amp;#39; 0.3 &amp;#39;.&amp;#39; 0.6 &amp;#39;,&amp;#39; 0.8 &amp;#39;+&amp;#39; 0.9 &amp;#39;*&amp;#39; 0.95 &amp;#39;B&amp;#39; 1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;div id=&#34;using-the-tidyverses-stringr-and-glue&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using the tidyverse’s stringr and glue&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href=&#34;https://github.com/tidyverse/stringr&#34;&gt;Stringr&lt;/a&gt; is built on top of &lt;code&gt;stringi&lt;/code&gt;and focuses on the most important and commonly used string manipulation functions whereas &lt;code&gt;stringi&lt;/code&gt; provides a comprehensive set covering almost anything you can imagine.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href=&#34;https://github.com/tidyverse/glue&#34;&gt;glue&lt;/a&gt; strings to data in R. Small, fast, dependency free interpreted string literals.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;using-stringi&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using stringi&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/gagolews/stringi&#34;&gt;Stringi&lt;/a&gt; is an R package for very fast, correct, consistent, and convenient string/text processing in each locale and any native character encoding.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-want-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just want the code&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;No time for reading? &lt;a href=&#34;https://jozef.io/post/data/r007-string-manipulation.r&#34;&gt;Click here to get just the code with commentary&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/grep&#34;&gt;Pattern Matching And Replacement&lt;/a&gt; R documentation&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/regex&#34;&gt;Regular Expressions As Used In R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://bookdown.org/rdpeng/rprogdatascience/regular-expressions.html&#34;&gt;Regular Expressions&lt;/a&gt; in R Programming for Data Science&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.youtube.com/watch?v=NvHjYOilOf8&#34;&gt;Regular Expressions (video)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.gastonsanchez.com/r4strings/&#34;&gt;Handling Strings with R&lt;/a&gt; by Gaston Sanchez&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf&#34;&gt;Cheat Sheet&lt;/a&gt; for basic regular expressions in R&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>4 ways to be more efficient using RStudio&#39;s Code Snippets, with 11 ready to use examples</title>
      <link>https://jozef.io/r906-rstudio-snippets/</link>
      <pubDate>Sat, 10 Nov 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r906-rstudio-snippets/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In this post we will look at yet another productivity increasing feature of the RStudio IDE - Code Snippets. Code Snippets let us easily insert and potentially execute predefined pieces of code and work not just for R code, but many other languages as well.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post we will cover 4 different ways to increase productivity using Code Snippets and provide 11 real-life examples of their use that you can take advantage of instantly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;how-do-code-snippets-work&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How do Code Snippets work&lt;/h1&gt;
&lt;div id=&#34;using-viewing-and-editing-snippets&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using, Viewing and editing snippets&lt;/h2&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;In RStudio, we can browse and define snippets under &lt;code&gt;Tools -&amp;gt; Global Options... -&amp;gt; Code -&amp;gt; Edit Snippets&lt;/code&gt; window&lt;/li&gt;
&lt;li&gt;When typing code, the snippet will appear as an auto-complete option (similar to function names) if we type the first few letters of its name&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;Shift+Tab&lt;/code&gt; to insert the snippet immediately or pick the snippet from the auto-complete list (by clicking or scrolling on it and pressing &lt;code&gt;Tab&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that as there is no auto-completion when editing R Markdown documents, we need to use the &lt;code&gt;Shift+Tab&lt;/code&gt; method exclusively in that case.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;sharing-and-exporting-snippets&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Sharing and exporting Snippets&lt;/h2&gt;
&lt;p&gt;Once we customize some snippets they are automatically saved to our &lt;code&gt;~/.R/snippets&lt;/code&gt; directory, one file per language. We can use these files to share our snippets with others, but also to edit them directly in the respecive file, without the need to click through the RStudio menus. RStudio will automatically load the Snippets from the &lt;code&gt;.snippets&lt;/code&gt; files first, for languages that have the file present.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;four-common-use-case-scenarios&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Four common use-case scenarios&lt;/h1&gt;
&lt;div id=&#34;automatically-insert-boilerplate-or-template-style-code&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;1. Automatically insert boilerplate or template-style code&lt;/h2&gt;
&lt;p&gt;The first and probably most frequent use of the Code Snippets feature is to quickly insert predefined pieces of code that require a lot of typing with little alternation, a.k.a. boilerplate code. A good illustration is a snippet covering a &lt;code&gt;tryCatch&lt;/code&gt; block:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;snippet tryc
    ${1:variable} &amp;lt;- tryCatch({
        ${2}
    }, warning = function(w) {
        message(sprintf(&amp;quot;Warning in %s: %s&amp;quot;, deparse(w[[&amp;quot;call&amp;quot;]]), w[[&amp;quot;message&amp;quot;]]))
        ${3}
    }, error = function(e) {
        message(sprintf(&amp;quot;Error in %s: %s&amp;quot;, deparse(e[[&amp;quot;call&amp;quot;]]), e[[&amp;quot;message&amp;quot;]]))
        ${4}
    }, finally = {
        ${5}
    })&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that the snippet definition is intended using &lt;code&gt;&amp;lt;tab&amp;gt;&lt;/code&gt; instead of spaces.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After defining this Snippet and running it we will automatically get a good template for the block and we can focus on writing the important parts:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;../img/r906-01-snippet-trycatch.gif&#34; /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The numbered sections prefixed with &lt;code&gt;$&lt;/code&gt; such as &lt;code&gt;${2}&lt;/code&gt; let us define sections to which the cursor will jump after pressing &lt;code&gt;Tab&lt;/code&gt;. We can also use &lt;code&gt;${1:predefinedvalue}&lt;/code&gt; to predefine a value for the sections.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Another example of this type of use may be a &lt;code&gt;testthat&lt;/code&gt; block that quickly prepares a unit-testing file:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;snippet tt
    context(&amp;quot;${1}&amp;quot;)

    # ${2} ----------
    test_that(
      &amp;quot;${2}&amp;quot;,
      expect_${3}(${4})
    )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;pre-fill-code-to-be-ran-quickly&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;2. Pre-fill code to be ran quickly&lt;/h2&gt;
&lt;p&gt;The second use case scenario where the Code Snippets come in really handy is to use them in the console when we want to run a block of code that we execute often in some scenarios. One such example is to attach the packages we use in a particular context. For example, when developing an R package, the following may be handy:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;snippet dd
    &amp;quot;library(&amp;#39;devtools&amp;#39;); library(&amp;#39;testthat&amp;#39;); library(&amp;#39;pryr&amp;#39;)&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this snippet, after pressing &lt;code&gt;dd&lt;/code&gt; and then &lt;code&gt;Shift+Tab&lt;/code&gt; in the console, the &lt;code&gt;library&lt;/code&gt; statements will appear and we can just press enter to run them and attach the mentioned packages. We can of course make separate snippets for example for attaching packages we use for interactive data analysis and plotting. This is one way to keep our &lt;code&gt;.Rprofile&lt;/code&gt; clean and still have packages easily available when needed.&lt;/p&gt;
&lt;p&gt;Another example for this scenario is to quickly run a benchmark comparing two or more pieces of code and visualize the results with a boxplot to get an overview:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;snippet mm
    bench &amp;lt;- microbenchmark::microbenchmark(
        times = ${1:1:100},
        ${2:one} = ${3},
        ${4:two} = ${5}
        )
    if (requireNamespace(&amp;quot;highcharter&amp;quot;)) {
      highcharter::hcboxplot(bench[[&amp;quot;time&amp;quot;]], bench[[&amp;quot;expr&amp;quot;]], outliers = FALSE)
    } else {
      boxplot(bench, outline = FALSE)
    }&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;execute-code-combined-with-rstudioapi&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;3. Execute code combined with &lt;code&gt;rstudioapi&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;The one scenario where RStudio really shines is combining multiple features it offers. We can neatly combine the use of snippets, &lt;code&gt;rstudioapi&lt;/code&gt; and the Terminal feature that we &lt;a href=&#34;https://jozef.io/r905-rstudio-terminal/&#34;&gt;discussed previously&lt;/a&gt; for an amazing variety of productivity boosts.&lt;/p&gt;
&lt;p&gt;Just one practical example convenient when writing a blogdown site is to instantly serve a preview of the blog in a separate session via the Terminal and use the RStudio Viewer in one go to view the site. This is handy especially in the RStudio Server setting, where the site serving in the same session can make the IDE behave slow:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;snippet ss
    `r eval({
      nocon &amp;lt;- function(link = &amp;#39;http://127.0.0.1:9999&amp;#39;) {
        inherits(suppressWarnings(try({
            con &amp;lt;- url(link, open = &amp;#39;rb&amp;#39;)
            close(con)
        }, silent = TRUE)), &amp;#39;try-error&amp;#39;)
      }
      if (nocon()) {
        termId &amp;lt;- rstudioapi::terminalExecute(
          &amp;#39;R -q -e \&amp;quot;blogdown::serve_site(port = 9999,  browser = FALSE)\&amp;quot;&amp;#39;,
          show = FALSE
        )
        while (nocon() &amp;amp;&amp;amp; !identical(rstudioapi::terminalExitCode(termId), 1L)) {
            Sys.sleep(0.25)
            cat(&amp;quot;.&amp;quot;)
        }
      }
      if (identical(rstudioapi::terminalExitCode(termId), 1L)) {
        cat(rstudioapi::terminalBuffer(termId), sep = &amp;quot;\n&amp;quot;)
      } else {
        rstudioapi::viewer(&amp;#39;http://127.0.0.1:9999&amp;#39;)
      }
    })`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After pressing &lt;code&gt;ss&lt;/code&gt; and &lt;code&gt;Shift+Tab&lt;/code&gt;, the site will be served in a separate R Session and previewed in the viewer.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Using &lt;code&gt;eval(expression)&lt;/code&gt; like above lets us execute R code in snippets. This gives a lot of flexibility, even more extensive when combined with &lt;code&gt;eval(parse(text = &amp;quot;code as character string&amp;quot;))&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;execute-code-and-paste-result-at-cursor&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;4. Execute code and paste result at cursor&lt;/h2&gt;
&lt;p&gt;The fourth option is to inject text following the cursor using &lt;code&gt;$$&lt;/code&gt;. An example simple but potentially powerful use of this feature is to pass commands to be executed via base R’s &lt;code&gt;system&lt;/code&gt; and getting the results directly at our cursor:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;snippet $$
    `r eval(parse(text = &amp;quot;system(&amp;#39;$$&amp;#39;, intern = TRUE)&amp;quot;))`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the above, when typing &lt;code&gt;$$ls&lt;/code&gt; into the editor and pressing &lt;code&gt;Shift+Tab&lt;/code&gt;, we will see the list of files present in our working directory placed at our cursor.&lt;/p&gt;
&lt;p&gt;Another handy use of this feature is to be able to quickly get a reproducible object definition by deparsing it:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;snippet $$
    `r paste(&amp;quot;$$ &amp;lt;-&amp;quot;, deparse(eval(parse(text=&amp;quot;$$&amp;quot;)), width.cutoff = 500L))`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;../img/r906-02-snippet-dollars.gif&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-snippets&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the snippets&lt;/h1&gt;
&lt;p&gt;The promised 11 potentially helpful &lt;a href=&#34;https://gitlab.com/jozefhajnala/gists/tree/master/rstudio/snippets&#34;&gt;snippets can be found here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;resources&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://support.rstudio.com/hc/en-us/articles/204463668-Code-Snippets&#34;&gt;Code Snippets&lt;/a&gt; by J.J. Allaire at the RStudio support&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/r905-rstudio-terminal/&#34;&gt;4 ways to be more productive, using RStudio’s terminal&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How to perform merges (joins) on two or more data frames with base R, tidyverse and data.table</title>
      <link>https://jozef.io/r006-merge/</link>
      <pubDate>Sat, 27 Oct 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r006-merge/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In this post in the &lt;a href=&#34;https://jozef.io/categories/rcase4base/&#34;&gt;R:case4base&lt;/a&gt; series we will look at one of the most common operations on multiple data frames - merge, also known as JOIN in SQL terms.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We will learn how to do the 4 basic types of join - inner, left, right and full join with base R and show how to perform the same with tidyverse’s dplyr and data.table’s methods. A quick benchmark will also be included.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r006-01-joins-anim.gif&#34; alt=&#34;Joins&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Joins&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#merging-joining-two-data-frames-with-base-r&#34;&gt;Merging (joining) two data frames with base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-arguments-of-merge&#34;&gt;The arguments of merge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#merging-multiple-data-frames&#34;&gt;Merging multiple data frames&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#quick-benchmarking&#34;&gt;Quick benchmarking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-want-the-code&#34;&gt;TL;DR - Just want the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;merging-joining-two-data-frames-with-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Merging (joining) two data frames with base R&lt;/h1&gt;
&lt;p&gt;To showcase the merging, we will use a very slightly modified dataset provided by Hadley Wickham’s &lt;a href=&#34;https://cran.r-project.org/package=nycflights13&#34;&gt;nycflights13&lt;/a&gt; package, mainly the &lt;code&gt;flights&lt;/code&gt; and &lt;code&gt;weather&lt;/code&gt; data frames. Let’s get right into it and simply show how to perform the different types of joins with base R.&lt;/p&gt;
&lt;p&gt;First, we prepare the data and store the columns we will merge by (join on) into &lt;code&gt;mergeCols&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dataurl &amp;lt;- &amp;quot;https://jozef.io/post/data/&amp;quot;
weather &amp;lt;- readRDS(url(paste0(dataurl, &amp;quot;r006/weather.rds&amp;quot;)))
flights &amp;lt;- readRDS(url(paste0(dataurl, &amp;quot;r006/flights.rds&amp;quot;)))

mergeCols &amp;lt;- c(&amp;quot;time_hour&amp;quot;, &amp;quot;origin&amp;quot;)

head(flights)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute           time_hour
## 1     15 2013-01-01 05:00:00
## 2     29 2013-01-01 05:00:00
## 3     40 2013-01-01 05:00:00
## 4     45 2013-01-01 05:00:00
## 5      0 2013-01-01 06:00:00
## 6     58 2013-01-01 05:00:00&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(weather)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   origin year month day hour  temp  dewp humid wind_dir wind_speed
## 1    EWR 2013     1   1    1 39.02 26.06 59.37      270   10.35702
## 2    EWR 2013     1   1    2 39.02 26.96 61.63      250    8.05546
## 3    EWR 2013     1   1    3 39.02 28.04 64.43      240   11.50780
## 4    EWR 2013     1   1    4 39.92 28.04 62.21      250   12.65858
## 5    EWR 2013     1   1    5 39.02 28.04 64.43      260   12.65858
## 6    EWR 2013     1   1    6 37.94 28.04 67.21      240   11.50780
##   wind_gust precip pressure visib           time_hour
## 1        NA      0   1012.0    10 2013-01-01 01:00:00
## 2        NA      0   1012.3    10 2013-01-01 02:00:00
## 3        NA      0   1012.5    10 2013-01-01 03:00:00
## 4        NA      0   1012.2    10 2013-01-01 04:00:00
## 5        NA      0   1011.9    10 2013-01-01 05:00:00
## 6        NA      0   1012.4    10 2013-01-01 06:00:00&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, we show how to perform the 4 merges (joins):&lt;/p&gt;
&lt;div id=&#34;inner-join&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Inner join&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;inner &amp;lt;- merge(flights, weather, by = mergeCols)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;left-outer-join&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Left (outer) join&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;left  &amp;lt;- merge(flights, weather, by = mergeCols, all.x = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;right-outer-join&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Right (outer) join&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;right &amp;lt;- merge(flights, weather, by = mergeCols, all.y = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;full-outer-join&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Full (outer) join&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;full &amp;lt;- merge(flights, weather, by = mergeCols, all = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;other-join-types&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Other join types&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Cross Join (Cartesian product)
cross &amp;lt;- merge(flights, weather, by = NULL)

# Natural Join
natural &amp;lt;- merge(flights, weather)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-arguments-of-merge&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The arguments of merge&lt;/h1&gt;
&lt;p&gt;The key arguments of base &lt;code&gt;merge&lt;/code&gt; data.frame method are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;x, y&lt;/code&gt; - the 2 data frames to be merged&lt;/li&gt;
&lt;li&gt;&lt;code&gt;by&lt;/code&gt; - names of the columns to merge on. If the column names are different in the two data frames to merge, we can specify &lt;code&gt;by.x&lt;/code&gt; and &lt;code&gt;by.y&lt;/code&gt; with the names of the columns in the respective data frames. The &lt;code&gt;by&lt;/code&gt; argument can also be specified by number, logical vector or left unspecified, in which case it defaults to the intersection of the names of the two data frames. From best practice perspective it is advisable to always specify the argument explicitly, ideally by column names.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;all&lt;/code&gt;, &lt;code&gt;all.x&lt;/code&gt;, &lt;code&gt;all.y&lt;/code&gt; - default to &lt;code&gt;FALSE&lt;/code&gt; and can be used specify the type of join we want to perform:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;all = FALSE&lt;/code&gt; (the default) - gives an inner join - combines the rows in the two data frames that match on the &lt;code&gt;by&lt;/code&gt; columns&lt;/li&gt;
&lt;li&gt;&lt;code&gt;all.x = TRUE&lt;/code&gt; - gives a left (outer) join - adds rows that are present in &lt;code&gt;x&lt;/code&gt;, even though they do not have a matching row in &lt;code&gt;y&lt;/code&gt; to the result for &lt;code&gt;all = FALSE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;all.y = TRUE&lt;/code&gt; - gives a right (outer) join - adds rows that are present in &lt;code&gt;y&lt;/code&gt;, even though they do not have a matching row in &lt;code&gt;x&lt;/code&gt; to the result for &lt;code&gt;all = FALSE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;all = TRUE&lt;/code&gt; - gives a full (outer) join. This is a shorthand for &lt;code&gt;all.x = TRUE&lt;/code&gt; and &lt;code&gt;all.y = TRUE&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Other arguments include&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sort&lt;/code&gt; - if &lt;code&gt;TRUE&lt;/code&gt; (default), results are sorted on the &lt;code&gt;by&lt;/code&gt; columns&lt;/li&gt;
&lt;li&gt;&lt;code&gt;suffixes&lt;/code&gt; - length 2 character vector, specifying the suffixes to be used for making the names of columns in the result which are not used for merging unique&lt;/li&gt;
&lt;li&gt;&lt;code&gt;incomparables&lt;/code&gt; - for single-column merging only, a vector of values that cannot be matched. Any value in &lt;code&gt;x&lt;/code&gt; matching a value in this vector is assigned the &lt;code&gt;nomatch&lt;/code&gt; value (which can be passed using &lt;code&gt;...&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;merging-multiple-data-frames&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Merging multiple data frames&lt;/h1&gt;
&lt;p&gt;For this example, let us have a list of all the data frames included in the &lt;code&gt;nycflights13&lt;/code&gt; package, slightly updated such that they can me merged with the default value for &lt;code&gt;by&lt;/code&gt;, purely for this exercise, and store them into a list called &lt;code&gt;flightsList&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;flightsList &amp;lt;- readRDS(url(paste0(dataurl, &amp;quot;r006/nycflights13-list.rds&amp;quot;)))
lapply(flightsList, function(x) c(toString(dim(x)), toString(names(x))))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## $flights
## [1] &amp;quot;336776, 19&amp;quot;                                                                                                                                                                     
## [2] &amp;quot;year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour&amp;quot;
## 
## $weather
## [1] &amp;quot;26115, 15&amp;quot;                                                                                                             
## [2] &amp;quot;origin, year, month, day, hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib, time_hour&amp;quot;
## 
## $airlines
## [1] &amp;quot;16, 2&amp;quot;         &amp;quot;carrier, name&amp;quot;
## 
## $airports
## [1] &amp;quot;1458, 8&amp;quot;                                           
## [2] &amp;quot;origin, airportname, lat, lon, alt, tz, dst, tzone&amp;quot;
## 
## $planes
## [1] &amp;quot;3322, 9&amp;quot;                                                                            
## [2] &amp;quot;tailnum, yearmanufactured, type, manufacturer, model, engines, seats, speed, engine&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since &lt;code&gt;merge&lt;/code&gt; is designed to work with 2 data frames, merging multiple data frames can of course be achieved by nesting the calls to merge:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;multiFull &amp;lt;- merge(merge(merge(merge(
  flightsList[[1L]],
  flightsList[[2L]], all = TRUE),
  flightsList[[3L]], all = TRUE),
  flightsList[[4L]], all = TRUE),
  flightsList[[5L]], all = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can however achieve this same goal much more elegantly, taking advantage of base R’s &lt;code&gt;Reduce&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# For Inner Join
multi_inner &amp;lt;- Reduce(
  function(x, y, ...) merge(x, y, ...), 
  flightsList
)

# For Full (Outer) Join
multi_full &amp;lt;- Reduce(
  function(x, y, ...) merge(x, y, all = TRUE, ...),
  flightsList
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that this example is oversimplified and the data was updated such that the default values for &lt;code&gt;by&lt;/code&gt; give meaningful joins. For example, in the original &lt;code&gt;planes&lt;/code&gt; data frame the column &lt;code&gt;year&lt;/code&gt; would have been matched onto the &lt;code&gt;year&lt;/code&gt; column of the &lt;code&gt;flights&lt;/code&gt; data frame, which is nonsensical as the years have different meanings in the two data frames. This is why we renamed the &lt;code&gt;year&lt;/code&gt; column in the &lt;code&gt;planes&lt;/code&gt; data frame to &lt;code&gt;yearmanufactured&lt;/code&gt; for the above example.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;div id=&#34;using-the-tidyverse&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using the tidyverse&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;dplyr&lt;/code&gt; package comes with a set of very user-friendly functions that seem quite self-explanatory:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
inner_dplyr &amp;lt;- inner_join(flights, weather, by = mergeCols)
left_dplyr  &amp;lt;- left_join(flights,  weather, by = mergeCols)
right_dplyr &amp;lt;- right_join(flights, weather, by = mergeCols)
full_dplyr  &amp;lt;- full_join(flights,  weather, by = mergeCols)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also use the “forward pipe” operator &lt;code&gt;%&amp;gt;%&lt;/code&gt; that becomes very convenient when merging multiple data frames:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;inner_dplyr &amp;lt;- flights %&amp;gt;% inner_join(weather, by = mergeCols)
left_dplyr  &amp;lt;- flights %&amp;gt;% left_join(weather,  by = mergeCols)
right_dplyr &amp;lt;- flights %&amp;gt;% right_join(weather, by = mergeCols)
full_dplyr  &amp;lt;- flights %&amp;gt;% full_join(weather,  by = mergeCols)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-data.table&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using data.table&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;data.table&lt;/code&gt; package provides an S3 method for the &lt;code&gt;merge&lt;/code&gt; generic that has a very similar structure to the base method for data frames, meaning its use is very convenient for those familiar with that method. In fact the code is exactly the same as the base one for our example use.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One important difference worth noting is that the &lt;code&gt;by&lt;/code&gt; argument is by default constructed differently with data.table.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We however provide it explicitly, therefore this difference does not directly affect our example:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;setkeyv(weather, mergeCols)
setkeyv(flights, mergeCols)

# Note that this is identical to the code for base 
# The data.table method is called automatically for objects of class data.table
inner_dt &amp;lt;- merge(flights, weather, by = mergeCols)
left_dt  &amp;lt;- merge(flights, weather, by = mergeCols, all.x = TRUE)
right_dt &amp;lt;- merge(flights, weather, by = mergeCols, all.y = TRUE)
full_dt  &amp;lt;- merge(flights, weather, by = mergeCols, all = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, we can write &lt;code&gt;data.table&lt;/code&gt; joins as subsets:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;inner_dt &amp;lt;- flights[weather, on = mergeCols, nomatch = 0]
left_dt  &amp;lt;- weather[flights, on = mergeCols]
right_dt &amp;lt;- flights[weather, on = mergeCols]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-benchmarking&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick benchmarking&lt;/h1&gt;
&lt;p&gt;For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned packages:&lt;/p&gt;
&lt;div id=&#34;inner-join-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Inner join&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bench_inner &amp;lt;- microbenchmark::microbenchmark(times = 100,
  base        = base::merge.data.frame(flights, weather, by = mergeCols),
  base_nosort = base::merge.data.frame(flights, weather, by = mergeCols, sort = FALSE),
  dt_merge    = merge(flights, weather, by = mergeCols),
  dt_subset   = flights[weather, on = mergeCols, nomatch = 0], 
  dplyr       = inner_join(flights, weather, by = mergeCols),
  dplyr_pipe  = flights %&amp;gt;% inner_join(weather, by = mergeCols)
)&lt;/code&gt;&lt;/pre&gt;
&lt;script type=&#34;text/javascript&#34;&gt; $(function () {
  $(&#39;#r006-01-bench-inner-boxplot&#39;).highcharts({
  title: {     
    text: &#34;microbenchmark&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;time (milliseconds)&#34;     
    },     
    min: 0     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    boxplot: {     
      fillColor: &#34;#C9E4FF&#34;,     
      lineWidth: 1,     
      medianWidth: 2,     
      stemDashStyle: &#34;dot&#34;,     
      stemWidth: 1,     
      whiskerLength: &#34;40%&#34;,     
      whiskerWidth: 1.5     
    }     
  },     
  chart: {     
    type: &#34;column&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      g2: null,     
      data: [     
        {     
          name: &#34;base&#34;,     
          low: 2656,     
          q1: 2808.5,     
          median: 2886,     
          q3: 3056.5,     
          high: 3387     
        },     
        {     
          name: &#34;base_nosort&#34;,     
          low: 1582,     
          q1: 1731,     
          median: 1794.5,     
          q3: 1958,     
          high: 2273     
        },     
        {     
          name: &#34;dt_merge&#34;,     
          low: 66,     
          q1: 68,     
          median: 70,     
          q3: 74.5,     
          high: 80     
        },     
        {     
          name: &#34;dt_subset&#34;,     
          low: 55,     
          q1: 56.5,     
          median: 57.5,     
          q3: 60.5,     
          high: 65     
        },     
        {     
          name: &#34;dplyr&#34;,     
          low: 82,     
          q1: 85,     
          median: 88,     
          q3: 93.5,     
          high: 103     
        },     
        {     
          name: &#34;dplyr_pipe&#34;,     
          low: 83,     
          q1: 86,     
          median: 88,     
          q3: 91,     
          high: 98     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null,     
      color: &#34;blue&#34;,     
      name: &#34;microbenchmark&#34;     
    }     
  ]     
}     
  );
}); &lt;/script&gt;
&lt;div id=&#34;r006-01-bench-inner-boxplot&#34;&gt;

&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;full-outer-join-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Full (outer) join&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bench_outer &amp;lt;- microbenchmark::microbenchmark(times = 100,
  base        = base::merge.data.frame(flights, weather, by = mergeCols, all = TRUE),
  base_nosort = base::merge.data.frame(flights, weather, by = mergeCols, all = TRUE, sort = FALSE),
  dt_merge    = merge(flights, weather, by = mergeCols, all = TRUE),
  dplyr       = full_join(flights, weather, by = mergeCols),
  dplyr_pipe  = flights %&amp;gt;% full_join(weather, by = mergeCols)
)&lt;/code&gt;&lt;/pre&gt;
&lt;script type=&#34;text/javascript&#34;&gt; $(function () {
  $(&#39;#r006-02-bench-full-boxplot&#39;).highcharts({
  title: {     
    text: &#34;microbenchmark&#34;     
  },     
  yAxis: {     
    title: {     
      text: &#34;time (milliseconds)&#34;     
    },     
    min: 0     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      marker: {     
        symbol: &#34;circle&#34;     
      },     
      showInLegend: false     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    boxplot: {     
      fillColor: &#34;#C9E4FF&#34;,     
      lineWidth: 1,     
      medianWidth: 2,     
      stemDashStyle: &#34;dot&#34;,     
      stemWidth: 1,     
      whiskerLength: &#34;40%&#34;,     
      whiskerWidth: 1.5     
    }     
  },     
  chart: {     
    type: &#34;column&#34;     
  },     
  xAxis: {     
    type: &#34;category&#34;,     
    categories: &#34;&#34;     
  },     
  series: [     
    {     
      g2: null,     
      data: [     
        {     
          name: &#34;base&#34;,     
          low: 2592,     
          q1: 2707,     
          median: 2786.5,     
          q3: 2896,     
          high: 3110     
        },     
        {     
          name: &#34;base_nosort&#34;,     
          low: 2106,     
          q1: 2256.5,     
          median: 2331.5,     
          q3: 2429,     
          high: 2659     
        },     
        {     
          name: &#34;dt_merge&#34;,     
          low: 121,     
          q1: 125,     
          median: 128.5,     
          q3: 202.5,     
          high: 316     
        },     
        {     
          name: &#34;dplyr&#34;,     
          low: 143,     
          q1: 150,     
          median: 153.5,     
          q3: 160,     
          high: 169     
        },     
        {     
          name: &#34;dplyr_pipe&#34;,     
          low: 144,     
          q1: 150,     
          median: 154,     
          q3: 161.5,     
          high: 174     
        }     
      ],     
      type: &#34;boxplot&#34;,     
      id: null,     
      color: &#34;blue&#34;,     
      name: &#34;microbenchmark&#34;     
    }     
  ]     
}     
  );
}); &lt;/script&gt;
&lt;div id=&#34;r006-02-bench-full-boxplot&#34;&gt;

&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Visualizing the results in this case shows base R comes way behind the two alternatives, even with &lt;code&gt;sort = FALSE&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note: The benchmarks are ran on a &lt;a href=&#34;https://www.digitalocean.com/docs/droplets/overview/&#34;&gt;standard droplet&lt;/a&gt; by DigitalOcean, with 2GB of memory a 2vCPUs.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-want-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just want the code&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;No time for reading? &lt;a href=&#34;https://jozef.io/post/data/r006-merge.r&#34;&gt;Click here to get just the code with commentary&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Animated &lt;a href=&#34;https://github.com/gadenbuie/tidyexplain/raw/master/images/inner-join.gif&#34;&gt;inner join&lt;/a&gt;, &lt;a href=&#34;https://github.com/gadenbuie/tidyexplain/raw/master/images/left-join.gif&#34;&gt;left join&lt;/a&gt;, &lt;a href=&#34;https://github.com/gadenbuie/tidyexplain/raw/master/images/right-join.gif&#34;&gt;right join&lt;/a&gt; and &lt;a href=&#34;https://video.twimg.com/tweet_video/DknFKJfU8AA9H_i.mp4&#34;&gt;full join&lt;/a&gt; by Garrick Aden-Buie for an easier understanding&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html&#34;&gt;Base merge help&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://dplyr.tidyverse.org/reference/join.html&#34;&gt;Join two tbls together with dplyr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/data.table/versions/1.11.8/topics/merge&#34;&gt;Merge two data.tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://rpubs.com/williamsurles/293454&#34;&gt;Joining Data in R with dplyr&lt;/a&gt; by Wiliam Surles&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Join_(SQL)&#34;&gt;Join (SQL)&lt;/a&gt; Wikipedia page&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=nycflights13&#34;&gt;The nycflights13&lt;/a&gt; package on CRAN&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;img alt=&#34;Coat of arms of Slovakia&#34; class=&#34;svk&#34; /&gt;
Exactly 100 years ago tomorrow, October 28th, 1918 the &lt;a href=&#34;https://en.wikipedia.org/wiki/First_Czechoslovak_Republic&#34;&gt;independence of Czechoslovakia&lt;/a&gt; was proclaimed by the Czechoslovak National Council, resulting in the creation of the &lt;a href=&#34;https://en.wikipedia.org/wiki/History_of_Czechoslovakia_(1918%E2%80%9338)&#34;&gt;first democratic state of Czechs and Slovaks&lt;/a&gt; in history.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>How to import a directory of csvs at once with base R and data.table. Can you guess which way is the fastest?</title>
      <link>https://jozef.io/r005-import-csvs/</link>
      <pubDate>Sat, 13 Oct 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r005-import-csvs/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Inspired by a recent post on how to import a directory of csv files at once &lt;a href=&#34;https://www.gerkelab.com/blog/2018/09/import-directory-csv-purrr-readr/&#34;&gt;using purrr and readr&lt;/a&gt; by Garrick, in this post we will try achieving the same using base R with no extra packages, and with data·table, another very popular package and as an added bonus, we will play a bit with benchmarking to see which of the methods is the fastest, including the tidyverse approach in the benchmark.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Let us show how to import all csvs from a folder into a data frame, with nothing but base R&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To get the source data, download the zip file from &lt;a href=&#34;https://www.gerkelab.com/data/ie-general-referrals-by-hospital.zip&#34;&gt;this link&lt;/a&gt; and unzip it into a folder, we will refer to the folder path as &lt;code&gt;data_dir&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#quick-import-of-all-csvs-with-base-r&#34;&gt;Quick import of all csvs with base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#reconstructing-the-results-of-the-original-post&#34;&gt;Reconstructing the results of the original post&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-want-the-code&#34;&gt;TL;DR - Just want the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#quick-benchmarking&#34;&gt;Quick benchmarking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-import-of-all-csvs-with-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick import of all csvs with base R&lt;/h1&gt;
&lt;p&gt;To import all .csv files from the &lt;code&gt;data_dir&lt;/code&gt; directory and place them into a single data frame called &lt;code&gt;result&lt;/code&gt;, all we have to do is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;filePaths &amp;lt;- list.files(data_dir, &amp;quot;\\.csv$&amp;quot;, full.names = TRUE)
result &amp;lt;- do.call(rbind, lapply(filePaths, read.csv))

# View part of the result
head(result)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   Month_Year           Hospital_Name Hospital_ID
## 1     Aug-15                   AMNCH        1049
## 2     Aug-15                   AMNCH        1049
## 3     Aug-15                   AMNCH        1049
## 4     Aug-15 Bantry General Hospital         704
## 5     Aug-15 Bantry General Hospital         704
## 6     Aug-15 Bantry General Hospital         704
##           Hospital_Department     ReferralType TotalReferrals
## 1              Paediatric ENT General Referral              2
## 2 Paediatric Gastroenterology General Referral              4
## 3  Paediatric General Surgery General Referral              4
## 4            Gastroenterology General Referral             12
## 5            General Medicine General Referral             18
## 6             General Surgery General Referral             43&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;a-quick-explanation-of-the-code&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A quick explanation of the code:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;list.files&lt;/code&gt; - produces a character vector of the names of the files in the named directory, in our case &lt;code&gt;data_dir&lt;/code&gt;. We have also passed a &lt;code&gt;pattern&lt;/code&gt; argument &lt;code&gt;&amp;quot;\\.csv$&amp;quot;&lt;/code&gt; to make sure we only process files with .csv at the end of the name and &lt;code&gt;full.names = TRUE&lt;/code&gt; to get the file path and not just the name.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;read.csv&lt;/code&gt; - reads a file in table format and creates a data frame from its content&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lapply(X, FUN, ...)&lt;/code&gt;- Gives us a list of data.frames, one for each of the files found by &lt;code&gt;list.files&lt;/code&gt;. More generally, it returns a list of the same length as &lt;code&gt;X&lt;/code&gt;, each element of which is the result of applying &lt;code&gt;FUN&lt;/code&gt; to the corresponding element of &lt;code&gt;X&lt;/code&gt;. In our case &lt;code&gt;X&lt;/code&gt; is the vector of file names in data_dir (returned by &lt;code&gt;list.files&lt;/code&gt;) and &lt;code&gt;FUN&lt;/code&gt; is &lt;code&gt;read.csv&lt;/code&gt;, so we are applying &lt;code&gt;read.csv&lt;/code&gt; to each of the file paths&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rbind&lt;/code&gt; - in our case combines the rows of multiple data frames into one, similarly (even though a bit more rigidly) to &lt;code&gt;UNION&lt;/code&gt; in &lt;code&gt;SQL&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;do.call&lt;/code&gt; - will combine all the data frames produced by &lt;code&gt;lapply&lt;/code&gt; into one using &lt;code&gt;rbind&lt;/code&gt;. More generally, it constructs and executes a function call from a name or a function and a list of arguments to be passed to it. In our case the function is &lt;code&gt;rbind&lt;/code&gt; and the list is the list of data frames containing the data loaded from the csvs, produced by &lt;code&gt;lapply&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;reconstructing-the-results-of-the-original-post&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Reconstructing the results of the original post&lt;/h1&gt;
&lt;p&gt;To fully reconstruct the results from the original post, we need to do two extra operations&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Add the source file names to the data frame&lt;/li&gt;
&lt;li&gt;Fix and reformat the dates&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To do this, we will simply adjust the &lt;code&gt;FUN&lt;/code&gt; in the &lt;code&gt;lapply&lt;/code&gt; - in the above example, we have only used &lt;code&gt;read.csv&lt;/code&gt;. Below, we will make a small function to do the extra steps:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;filePaths &amp;lt;- list.files(data_dir, &amp;quot;\\.csv$&amp;quot;, full.names = TRUE)
result &amp;lt;- do.call(rbind, lapply(filePaths, function(path) {
    df &amp;lt;- read.csv(path, stringsAsFactors = FALSE)
    df[[&amp;quot;source&amp;quot;]] &amp;lt;- rep(path, nrow(df))
    df[[&amp;quot;Month_Year&amp;quot;]] &amp;lt;- as.Date(
      paste0(sub(&amp;quot;-20&amp;quot;, &amp;quot;-&amp;quot;, df[[&amp;quot;Month_Year&amp;quot;]], fixed = TRUE), &amp;quot;-01&amp;quot;),
      format = &amp;quot;%b-%y-%d&amp;quot;
    )
    df
}))

# View part of the result
head(result)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   Month_Year           Hospital_Name Hospital_ID
## 1 2015-08-01                   AMNCH        1049
## 2 2015-08-01                   AMNCH        1049
## 3 2015-08-01                   AMNCH        1049
## 4 2015-08-01 Bantry General Hospital         704
## 5 2015-08-01 Bantry General Hospital         704
## 6 2015-08-01 Bantry General Hospital         704
##           Hospital_Department     ReferralType TotalReferrals
## 1              Paediatric ENT General Referral              2
## 2 Paediatric Gastroenterology General Referral              4
## 3  Paediatric General Surgery General Referral              4
## 4            Gastroenterology General Referral             12
## 5            General Medicine General Referral             18
## 6             General Surgery General Referral             43
##                                                                                          source
## 1 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 2 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 3 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 4 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 5 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 6 data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Lets look at the extra code in the &lt;code&gt;lapply&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Instead of just using &lt;code&gt;read.csv&lt;/code&gt;, we have defined our own little function that will do the extra work for each of the file paths, which are passed to the function as &lt;code&gt;path&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We read the data into a data frame called &lt;code&gt;df&lt;/code&gt; using &lt;code&gt;read.csv&lt;/code&gt;, and can we specify &lt;code&gt;stringsAsFactors = FALSE&lt;/code&gt;, as the tidyverse packages do this by default, while base R’s default is different&lt;/li&gt;
&lt;li&gt;We add a new column &lt;code&gt;source&lt;/code&gt; with the file name stored in &lt;code&gt;path&lt;/code&gt;, repeated as many times as &lt;code&gt;df&lt;/code&gt; has rows. This is a bit overkill here and could be done simpler, but it is quite robust and will also work with 0-row data frames&lt;/li&gt;
&lt;li&gt;We transform the &lt;code&gt;Month_Year&lt;/code&gt; into the requested date format with &lt;code&gt;as.Date&lt;/code&gt;. Note that the relatively ugly &lt;code&gt;sub()&lt;/code&gt; part is caused mostly by inconsistency in the source data itself&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;[[&lt;/code&gt; instead of &lt;code&gt;$&lt;/code&gt; is less pleasing to the eye, but we find it to be good practice, so sacrifice a bit of readability&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;div id=&#34;using-data.table&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using data.table&lt;/h2&gt;
&lt;p&gt;Another popular package that can help us achieve the same is &lt;code&gt;data.table&lt;/code&gt;, so let’s have a look and reconstruct the results with data.table’s features:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(data.table)
filePaths &amp;lt;- list.files(data_dir, &amp;quot;\\.csv$&amp;quot;, full.names = TRUE)
result &amp;lt;- lapply(filePaths, fread)
names(result) &amp;lt;- filePaths
result &amp;lt;- rbindlist(result, use.names = TRUE, idcol = &amp;quot;source&amp;quot;)
result[, Month_Year := as.Date(
  paste0(sub(&amp;quot;-20&amp;quot;, &amp;quot;-&amp;quot;, Month_Year, fixed = TRUE), &amp;quot;-01&amp;quot;),
  format = &amp;quot;%b-%y-%d&amp;quot;
)]


# View part of the result
head(result)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                                                                                           source
## 1: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 2: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 3: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 4: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 5: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
## 6: data/r005/ie-general-referrals-by-hospital//general-referrals-by-hospital-department-2015.csv
##    Month_Year           Hospital_Name Hospital_ID
## 1: 2015-08-01                   AMNCH        1049
## 2: 2015-08-01                   AMNCH        1049
## 3: 2015-08-01                   AMNCH        1049
## 4: 2015-08-01 Bantry General Hospital         704
## 5: 2015-08-01 Bantry General Hospital         704
## 6: 2015-08-01 Bantry General Hospital         704
##            Hospital_Department     ReferralType TotalReferrals
## 1:              Paediatric ENT General Referral              2
## 2: Paediatric Gastroenterology General Referral              4
## 3:  Paediatric General Surgery General Referral              4
## 4:            Gastroenterology General Referral             12
## 5:            General Medicine General Referral             18
## 6:             General Surgery General Referral             43&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;rbindlist&lt;/code&gt; does the same as &lt;code&gt;do.call(&amp;quot;rbind&amp;quot;, l)&lt;/code&gt; on data frames, but much faster&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fread&lt;/code&gt; is similar to &lt;code&gt;read.table&lt;/code&gt; (and &lt;code&gt;read.csv&lt;/code&gt;, which uses &lt;code&gt;read.table&lt;/code&gt;) but faster and more convenient&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&#39;:=&#39;()&lt;/code&gt; is the data.table syntax to create new columns in a data.table&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;using-the-tidyverse&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using the tidyverse&lt;/h2&gt;
&lt;p&gt;This is covered in much detail in the &lt;a href=&#34;https://www.gerkelab.com/blog/2018/09/import-directory-csv-purrr-readr/&#34;&gt;post that inspired this one&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-want-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just want the code&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;No time for reading? &lt;a href=&#34;https://jozef.io/post/data/r005-import-csvs.r&#34;&gt;Click here to get just the code with commentary&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-benchmarking&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick benchmarking&lt;/h1&gt;
&lt;p&gt;First off we are mostly looking at it for the fun of reacting &lt;a href=&#34;https://twitter.com/_ColinFay/status/1046832479288676360&#34;&gt;to Twitter discussion&lt;/a&gt;, so take it for what it’s worth, by no means this is what we would call proper benchmarking.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now that we have seen 3 ways to achieve the same goal, let’s look at speed. Note that we will be friendly to the &lt;code&gt;tidyverse&lt;/code&gt; and not attach the entire package as is done in the &lt;a href=&#34;https://www.gerkelab.com/blog/2018/09/import-directory-csv-purrr-readr/&#34;&gt;original post&lt;/a&gt;, however only those packages that we really need to get a more appropriate benchmark.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;full-script-run-benchmark&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Full script run benchmark&lt;/h2&gt;
&lt;p&gt;First, we will perform an execution of an R script containing just the above code chunks (and &lt;a href=&#34;https://jozef.io/post/data/r005/benchmarking/tidyverse.R&#34;&gt;the tidyverse one&lt;/a&gt;) a thousand times. The timing will also include overhead for launching the process, but this effect is present for all three scenarios and the variance should be safely covered by the fact that we execute 1000 times:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;time for i in {1..1000}; 
do Rscript --vanilla data/r005/benchmarking/base.R &amp;amp;&amp;gt;/dev/null;
done&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;
time for i in {1..1000};
do Rscript --vanilla data/r005/benchmarking/datatable.R &amp;amp;&amp;gt;/dev/null;
done&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;
time for i in {1..1000};
do Rscript --vanilla data/r005/benchmarking/tidyverse.R &amp;amp;&amp;gt;/dev/null;
done&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Visualizing the results shows that base R is the clear winner here, largely due to package loading overhead. Any performance benefits of the other packages are not enough to catch up in this very small use case:&lt;/p&gt;
&lt;script type=&#34;text/javascript&#34;&gt;
$(function () {
  $(&#39;#r005-01-bench-bar&#39;).highcharts({
  title: {     
    text: null     
  },     
  yAxis: {     
    title: {     
      text: &#34;time_milisecs&#34;     
    },     
    type: &#34;linear&#34;     
  },     
  credits: {     
    enabled: false     
  },     
  exporting: {     
    enabled: false     
  },     
  plotOptions: {     
    series: {     
      label: {     
        enabled: false     
      },     
      turboThreshold: 0,     
      showInLegend: true     
    },     
    treemap: {     
      layoutAlgorithm: &#34;squarified&#34;     
    },     
    scatter: {     
      marker: {     
        symbol: &#34;circle&#34;     
      }     
    }     
  },     
  series: [     
    {     
      name: &#34;base&#34;,     
      data: [     
        {     
          package: &#34;base&#34;,     
          method: &#34;real&#34;,     
          time_milisecs: 229.485,     
          y: 229.485,     
          name: &#34;real&#34;     
        },     
        {     
          package: &#34;base&#34;,     
          method: &#34;user&#34;,     
          time_milisecs: 195.48,     
          y: 195.48,     
          name: &#34;user&#34;     
        },     
        {     
          package: &#34;base&#34;,     
          method: &#34;sys&#34;,     
          time_milisecs: 23.404,     
          y: 23.404,     
          name: &#34;sys&#34;     
        }     
      ],     
      type: &#34;column&#34;,     
      color: &#34;#C9E4FF&#34;     
    },     
    {     
      name: &#34;data.table&#34;,     
      data: [     
        {     
          package: &#34;data.table&#34;,     
          method: &#34;real&#34;,     
          time_milisecs: 399.008,     
          y: 399.008,     
          name: &#34;real&#34;     
        },     
        {     
          package: &#34;data.table&#34;,     
          method: &#34;user&#34;,     
          time_milisecs: 349.928,     
          y: 349.928,     
          name: &#34;user&#34;     
        },     
        {     
          package: &#34;data.table&#34;,     
          method: &#34;sys&#34;,     
          time_milisecs: 37.52,     
          y: 37.52,     
          name: &#34;sys&#34;     
        }     
      ],     
      type: &#34;column&#34;,     
      color: &#34;#C9D8FF&#34;     
    },     
    {     
      name: &#34;tidyverse&#34;,     
      data: [     
        {     
          package: &#34;tidyverse&#34;,     
          method: &#34;real&#34;,     
          time_milisecs: 1230.951,     
          y: 1230.951,     
          name: &#34;real&#34;     
        },     
        {     
          package: &#34;tidyverse&#34;,     
          method: &#34;user&#34;,     
          time_milisecs: 1147.252,     
          y: 1147.252,     
          name: &#34;user&#34;     
        },     
        {     
          package: &#34;tidyverse&#34;,     
          method: &#34;sys&#34;,     
          time_milisecs: 67.964,     
          y: 67.964,     
          name: &#34;sys&#34;     
        }     
      ],     
      type: &#34;column&#34;,     
      color: &#34;#8D98FF&#34;     
    }     
  ],     
  xAxis: {     
    type: &#34;category&#34;,     
    title: {     
      text: &#34;method&#34;     
    },     
    categories: null     
  }     
}     
  );
});
&lt;/script&gt;
&lt;div id=&#34;r005-01-bench-bar&#34;&gt;

&lt;/div&gt;
&lt;p&gt;If interested, you can look at the scripts ran above:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/post/data/r005/benchmarking/base.R&#34;&gt;base.R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/post/data/r005/benchmarking/datatable.R&#34;&gt;datatable.R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://jozef.io/post/data/r005/benchmarking/tidyverse.R&#34;&gt;tidyverse.R&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;benchmarking-without-package-loading-overhead&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Benchmarking without package loading overhead&lt;/h2&gt;
&lt;p&gt;We could argue that it is not fair to include the &lt;code&gt;library&lt;/code&gt; statements in the benchmark, as the overhead can be relatively big considering how small the actual action done by the code is, as we are only processing 4 small files. Here is a benchmark omitting the overhead and only executing the relevant code with the packages pre-loaded, using microbenchmark with a 100 iterations:&lt;/p&gt;
&lt;script type=&#34;text/javascript&#34;&gt; $(function () {   $(&#39;#r005-02-bench-boxplot&#39;).highcharts({   title: {          text: &#34;microbenchmark&#34;        },        yAxis: {          title: {            text: &#34;time (milliseconds)&#34;          },          min: 0        },        credits: {          enabled: false        },        exporting: {          enabled: false        },        plotOptions: {          series: {            label: {              enabled: false            },            turboThreshold: 0,            marker: {              symbol: &#34;circle&#34;            },            showInLegend: false          },          treemap: {            layoutAlgorithm: &#34;squarified&#34;          },          boxplot: {            fillColor: &#34;#C9E4FF&#34;,            lineWidth: 1,            medianWidth: 2,            stemDashStyle: &#34;dot&#34;,            stemWidth: 1,            whiskerLength: &#34;40%&#34;,            whiskerWidth: 1.5          }        },        chart: {          type: &#34;column&#34;        },        xAxis: {          type: &#34;category&#34;,          categories: &#34;&#34;        },        series: [          {            g2: null,            data: [              {                name: &#34;base&#34;,                low: 106,                q1: 108,                median: 110,                q3: 112,                high: 116              },              {                name: &#34;data.table&#34;,                low: 35,                q1: 35,                median: 36,                q3: 36,                high: 37              },              {                name: &#34;tidyverse&#34;,                low: 58,                q1: 59,                median: 60,                q3: 64.5,                high: 70              }            ],            type: &#34;boxplot&#34;,            id: null,            color: &#34;blue&#34;,            name: &#34;microbenchmark&#34;          }        ]      }        ); }); &lt;/script&gt;
&lt;div id=&#34;r005-02-bench-boxplot&#34;&gt;

&lt;/div&gt;
&lt;p&gt;Visualizing the results in this case shows that data.table is a winner, with base R being the slowest of the options.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.gerkelab.com/blog/2018/09/import-directory-csv-purrr-readr/&#34;&gt;the inspiration for this post&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html&#34;&gt;introduction to data.table&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>4 ways to be more productive, using RStudio&#39;s terminal</title>
      <link>https://jozef.io/r905-rstudio-terminal/</link>
      <pubDate>Sat, 29 Sep 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r905-rstudio-terminal/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;RStudio version 1.1 introduced the &lt;code&gt;Terminal&lt;/code&gt; functionality, which does not seem to be getting enough deserved attention and love even though it is very well integrated with the rest of the IDE and can be extremely useful for several daily use-cases.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In this post we will try to cover 4 very common scenarios where the Terminal can be very useful and productive, and how to get the most of it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r905-01-terminal.gif&#34; alt=&#34;RStudio Terminal Fun&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;RStudio Terminal Fun&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In short, the RStudio Terminal provides access to the system shell directly from the RStudio IDE, supporting xterm emulation, full-screen terminal applications, command line operations and more. It also has useful customizable keyboard shortcut bindings to make frequent usage more efficient and enables usage of multiple such Terminals simultaneously.&lt;/p&gt;
&lt;p&gt;The experience may vary based on each user’s setup, this experience comes mostly from using RStudio server on a Linux-based system.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;four-common-use-cases&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Four common use-cases&lt;/h1&gt;
&lt;div id=&#34;execute-resource-heavy-r-code-in-the-terminal-quickly&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;1. Execute resource-heavy R code in the Terminal quickly&lt;/h2&gt;
&lt;p&gt;A very common use case where the Terminal makes my life a lot easier is when I need to execute a longer running or resource-heavy tasks in R. Using the RStudio IDE’s session for such tasks can be challenging because running them can slow the entire IDE down, sometimes even so much that it is barely usable. We can easily prevent this by running such tasks in a separate R process within the Terminal. We could of course do this using &lt;code&gt;putty&lt;/code&gt; or other software, however doing it within RStudio brings&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;seamless keyboard shortcut integration between the editor window and the Terminal&lt;/li&gt;
&lt;li&gt;ability to use multiple Terminals easily&lt;/li&gt;
&lt;li&gt;no need to use other software&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To run commands in the terminal, we simply press:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Shift + Alt + R&lt;/code&gt; to open a new terminal&lt;/li&gt;
&lt;li&gt;launch &lt;code&gt;R&lt;/code&gt; in the Terminal&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Ctrl + 1&lt;/code&gt; to focus back to the editor window&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Ctrl + Alt + Enter&lt;/code&gt; to send commands to be executed directly to the Terminal&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can also do this with multiple Terminals if we need to run multiple such “jobs”, and easily switch between Terminal windows using keyboard shortcuts&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Ctrl + Alt + F11&lt;/code&gt; - Previous terminal&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Ctrl + Alt + F12&lt;/code&gt; - Next terminal&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note the shortcuts mentioned above are default and more than likely not Mac-relevant, but you can easily find those as well in case you are a Mac user, and change them to your liking as well.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;advanced-version-control-directly-within-rstudio&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;2. Advanced version control directly within RStudio&lt;/h2&gt;
&lt;p&gt;RStudio has a neat version control integration which is a very nice addition to the IDE, however there are some advanced version control operations that are not possible to handle there directly, &lt;code&gt;git rebase&lt;/code&gt; and &lt;code&gt;git push --force&lt;/code&gt; being just a couple of examples. Thanks to the Terminal, you can very easily do all those operations without ever leaving your RStudio IDE.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;serving-your-shiny-appblogdown-site-without-blocking-or-slowdown&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;3. Serving your Shiny app/Blogdown site without blocking or slowdown&lt;/h2&gt;
&lt;p&gt;My favourite use of the Terminal when writing this blog is to serve the site via the Terminal and see the changes I make live, without the IDE being slowed down and laggy, which often happens when serving the site directly from RStudio’s R session. A very similar point also applies when running a Shiny app from within RStudio. This simple use of the Terminal makes things more convenient for me.&lt;/p&gt;
&lt;p&gt;Running a Shiny app. We can use a pre-selected port to make viewing later easier:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Send to the terminal with Ctrl + Alt + Enter:
R -e &amp;#39;library(shiny); runApp(&amp;quot;appdir&amp;quot;, port = 9999, launch.browser = FALSE)&amp;#39;
# Then show in the viewer with Ctrl + Enter
rstudioapi::viewer(&amp;quot;http://127.0.0.1:9999&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similarly, serving a Blogdown site:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Send to the terminal with Ctrl + Alt + Enter:
R -e &amp;#39;library(blogdown); blogdown::serve_site(port = 9999, browser = FALSE)&amp;#39;
# Then show in the viewer with Ctrl + Enter
rstudioapi::viewer(&amp;quot;http://127.0.0.1:9999&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, we can also use &lt;code&gt;rstudioapi&lt;/code&gt; to send commands to the Terminal:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;termId &amp;lt;- rstudioapi::terminalExecute(&amp;quot;R -e &amp;#39;getwd(); library(shiny); runApp(\&amp;quot;appdir\&amp;quot;, port = 9999, launch.browser = FALSE)&amp;#39;&amp;quot;)
# Then show in the viewer with Ctrl + Enter
rstudioapi::viewer(&amp;quot;http://127.0.0.1:9999&amp;quot;)
# When done, we can kill that terminal
rstudioapi::terminalKill(termId)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;test-your-bash-python-and-much-more-conveniently&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;4. Test your bash, python and much more conveniently&lt;/h2&gt;
&lt;p&gt;Since the Terminal is really just system shell access, you can get very creative with its use. To me, the key here is the keyboard shortcut integration between the editor and the terminal.&lt;/p&gt;
&lt;p&gt;Very basic example using the Terminal to run python code. Note that this (somewhat obviously) works around R and the need for an R to Python interface package:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;# Ctrl + Alt + Enter to send to the Terminal
# Launch python
python
# Run some python code
1 + 1
# When done with python, exit
exit()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Testing a random bash script&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;# Ctrl + Alt + Enter to send to the Terminal
echo &amp;quot;Run tmux, split window and run top&amp;quot;
tmux new -s &amp;quot;Fun&amp;quot;
tmux switch -t &amp;quot;Fun&amp;quot;
tmux split-window -h
tmux select-pane -t 0
top&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-notes&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick notes&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;By default, the processes in Terminal run as child processes of the main &lt;code&gt;rsession&lt;/code&gt; process, therefore restarting R session will kill those. We can workaround this fact using tools like &lt;code&gt;screen&lt;/code&gt; or &lt;code&gt;tmux&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;You can specify what the Terminal sessions are open with under &lt;code&gt;Tools -&amp;gt; Global Options... -&amp;gt; Terminal&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The Terminal can be interfaced with using the &lt;code&gt;rstudioapi&lt;/code&gt; package functionality. Read the &lt;a href=&#34;https://cran.rstudio.com/web/packages/rstudioapi/vignettes/terminal.html&#34;&gt;Interacting with Terminals&lt;/a&gt; vignette to learn more.&lt;/li&gt;
&lt;li&gt;If the default keyboard shortcuts are not the most convenient for you, they can be updated and more added under the &lt;code&gt;Tools -&amp;gt; Modify Keyboard Shortcuts...&lt;/code&gt; menu in RStudio&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;resources&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Resources&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://support.rstudio.com/hc/en-us/articles/115010737148-Using-the-RStudio-Terminal&#34;&gt;Using the RStudio Terminal&lt;/a&gt;, a great guide by Gary Ritchie&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.rstudio.com/web/packages/rstudioapi/vignettes/terminal.html&#34;&gt;Interacting with Terminals&lt;/a&gt; vignette of the rstudioapi package&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://support.rstudio.com/hc/en-us/articles/206382178?version=1.2.792&amp;amp;mode=server&#34;&gt;Customizing Keyboard Shortcuts&lt;/a&gt; in RStudio by Kevin Ushey&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.gnu.org/software/screen/&#34;&gt;Introduction to GNU Screen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://leanpub.com/the-tao-of-tmux/read&#34;&gt;The Tao of tmux&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;img alt=&#34;Coat of arms of Slovakia&#34; class=&#34;svk&#34; /&gt;
The last Saturday of September 20 years ago &lt;a href=&#34;https://en.wikipedia.org/wiki/Slovak_parliamentary_election,_1998&#34;&gt;a key parliamentary election&lt;/a&gt; was held in Slovakia, resulting in the end of the reign of Vladimír Mečiar’s government and Slovakia being able to &lt;a href=&#34;https://www.theguardian.com/world/2004/apr/28/eu.politics2&#34;&gt;conduct crucial reforms and become a member of the EU and NATO&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>3 reasons to not write that new code, and how I failed at it</title>
      <link>https://jozef.io/r904-dont-write-that-code/</link>
      <pubDate>Sat, 15 Sep 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r904-dont-write-that-code/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;We all know that feeling. We have this great idea about a new project, feature, function, piece of code.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;What&lt;/em&gt; do we want? &lt;strong&gt;Write that amazing new code&lt;/strong&gt;! &lt;br /&gt; &lt;em&gt;When&lt;/em&gt; do we want it? &lt;strong&gt;Right NOW&lt;/strong&gt;!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The aim of this post is to try and give you at 3 good reasons to resist that urge and consider other options, be it in your business projects or your private projects. With an example of how I failed and how I tried to remedy that failure, on a very small scale.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://i.imgur.com/7U5Qpii.jpg&#34; alt=&#34;He knows&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;He knows&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-3-reasons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The 3 reasons&lt;/h1&gt;
&lt;div id=&#34;new-code-takes-time-and-money&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;1. New code takes time (and money)&lt;/h2&gt;
&lt;p&gt;Writing new code is an investment. Time and money will be spent on designing, implementation and code review. These introductory investments are however only a minor part of the total cost of writing new code. The code must be well documented and maintained. The code must be integrated to other parts of the systems. Last but not least, the code must be tested, and writing tests usually involves writing, well, more new code.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;new-code-means-new-bugs&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;2. New code means new bugs&lt;/h2&gt;
&lt;p&gt;Even through our best efforts and testing, bugs will be found and will need fixing. Numbers on this seem to vary a lot, &lt;a href=&#34;https://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670&#34;&gt;Code Complete&lt;/a&gt; by Steve McConnel estimates an industry average of 15-50 bugs per 1 000 lines of code.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;we-write-what-we-know&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;3. We write what we know&lt;/h2&gt;
&lt;p&gt;Perhaps the most compelling reason to reconsider and resist the code-writing is not in the numbers and statistics, but in the simple realization that we usually write new code using our current knowledge.&lt;/p&gt;
&lt;p&gt;Pausing for a while and spending time investigating on the current best practices and methods of solving the issue we are aiming to solve with our new code may not only save us and our business owners valuable resources, but also increase our knowledge base thanks to that investigation.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;putting-it-to-practice-in-the-r-world&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Putting it to practice in the R world&lt;/h1&gt;
&lt;p&gt;So we have this brilliant new idea. Instead of starting to write that shiny new code, we can also start with:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://www.google.com&#34;&gt;Google&lt;/a&gt; - It is more than likely that someone has already stumbled upon this very same, or a very similar problem. How have they implemented it? What functionality have they used? What are the best practices and approaches to tackling similar issues?&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stackoverflow.com/questions/tagged/r&#34;&gt;Stackoverflow&lt;/a&gt; and &lt;a href=&#34;https://rseek.org&#34;&gt;Rseek&lt;/a&gt; for R solutions - Can we find solutions to our problem there? Are those solutions good? Can we build upon them?&lt;/li&gt;
&lt;li&gt;Evaluate the options - If we have found any, which of them are the most suitable for us? If stability and maintainability is a major concern, can we find a solution with as few dependencies as possible ? If performance is a major concern, are benchmarks available (can we make them)?&lt;/li&gt;
&lt;li&gt;Propose a solution - After this research, do we still need to write the new functionality? If so, how much can we build on existing solutions? Are they easy to integrate?&lt;/li&gt;
&lt;li&gt;Do we care about dependencies? - The R world is special, one of the reasons for this is &lt;a href=&#34;https://cran.r-project.org&#34;&gt;CRAN&lt;/a&gt;. The number of packages available on CRAN passed 13 000 and it is very convenient to just reach out and grab one more. This approach however has its caveats.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;a-simplest-example---learning-from-my-own-mistakes&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A simplest example - learning from my own mistakes&lt;/h1&gt;
&lt;div id=&#34;how-i-did-it-wrong&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How I did it wrong&lt;/h2&gt;
&lt;p&gt;One of the first RStudio addins I have written for my own use was to &lt;a href=&#34;https://jozef.io/r101-addin-reproducibility/&#34;&gt;run a script open in RStudio with &lt;code&gt;R --vanilla&lt;/code&gt; via a keyboard shortcut&lt;/a&gt; and open a file with the script’s output in RStudio. If I had to guess, my &lt;em&gt;thought process&lt;/em&gt; was likely similar to the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I &lt;em&gt;will to write a new function&lt;/em&gt; to serve as the addin binding&lt;/li&gt;
&lt;li&gt;I &lt;em&gt;will to write a new function&lt;/em&gt; to serve as command executor for both Unix-like systems via &lt;code&gt;system&lt;/code&gt; and Windows via &lt;code&gt;shell&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;I &lt;em&gt;will to write a new function&lt;/em&gt; to create the command to be executed by the above&lt;/li&gt;
&lt;li&gt;Maybe some utilities, like the ones converting &lt;code&gt;~&lt;/code&gt; to a full path, figure out integrating the 4 together, passing arguments, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So, there I was, some time and &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/71e08c0eca50f6b6ea55bd25ba658113784306a2/R/makeCmd.R&#34;&gt;92 lines of code and doc later&lt;/a&gt;, with a new useful RStudio addin. Oh and yes, there was also &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/6f7564028eee9ea9d6d4d220ee1513b1de18b3f3/tests/testthat/test.makeCmd.R&#34;&gt;102 lines of test code&lt;/a&gt;, fixed a couple of times, too.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-could-i-do-it-better&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How could I do it better&lt;/h2&gt;
&lt;p&gt;After a second look a few months later when actually reviewing this supposedly &lt;em&gt;good&lt;/em&gt; functionality, I realized that&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;There is a base function called &lt;code&gt;system2&lt;/code&gt;, which seems like a much more user-friendly and easy to use version of &lt;code&gt;system&lt;/code&gt; (and &lt;code&gt;shell&lt;/code&gt;), with no real need to write system-specific code and even though less configurable than &lt;code&gt;system&lt;/code&gt;, still perfectly sufficient for my purpose&lt;/li&gt;
&lt;li&gt;I do not actually need to make the command, as extra options can be passed to &lt;code&gt;system2&lt;/code&gt; as arguments, including redirecting output&lt;/li&gt;
&lt;li&gt;Oh, and I definitely do not need a function to convert &lt;code&gt;~&lt;/code&gt; to full path, there is &lt;code&gt;path.expand&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So after a quick rewrite, we end up with a very similar functionality, only we suddenly need &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/master/R/makeCmd.R&#34;&gt;35 rows of code, doc included&lt;/a&gt; and the tests shrink to &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/master/tests/testthat/test.makeCmd.R&#34;&gt;10 lines&lt;/a&gt;, as there is only 1 function to test instead of 4. That is &lt;em&gt;less than a quarter&lt;/em&gt; of the original amount of code to be maintained and bug-fixed, with 0 new dependencies added.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This was of course a very trivial example. Real life problems of real-life projects will be much more difficult to solve. However, as complexity scales, the potential amount of time and resources saved will also scale.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Good luck resisting that urge the next time it comes ;-)&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>cRafty tRicks - No more typing brackets!</title>
      <link>https://jozef.io/r903-tricks-bracketless/</link>
      <pubDate>Sat, 01 Sep 2018 13:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r903-tricks-bracketless/</guid>
      <description>


&lt;p&gt;Calling functions in R usually involves typing brackets. And since many of our actions in R involve calling a function, we will have to type a lot of brackets working with R. Often it would make our life a lot easier if we could omit the need to type brackets where convenient. We will do exactly that today.&lt;/p&gt;
&lt;div id=&#34;work-in-r-faster-with-custom-bracketless-commands&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Work in R faster with custom bracketless commands&lt;/h1&gt;
&lt;p&gt;A good starting example is, well, quitting R altogether. Usually, one may do:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;quit()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which will in turn likely get you and extra question regarding saving a workspace image. So you then finally type &lt;code&gt;n&lt;/code&gt; and are done with it. If you want to be a bit faster, you may do:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;q(&amp;quot;no&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Better, but still an awful lot of typing just to quit &lt;code&gt;R&lt;/code&gt;, especially when working in a terminal-like environment with multiple sessions.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Let us be a bit craftier and make &lt;code&gt;R&lt;/code&gt; quit just by typing &lt;code&gt;qq&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To make a bracketless command, we will (mis)use the fact that typing an object name into R console and pressing enter will often invoke a print method specific for the class of that object.&lt;/p&gt;
&lt;p&gt;All we have to do to create our very first bracketless command is to create a custom print method for a funky class made for this single purpose. Then we make an object of that class and type its name to the console:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;qq &amp;lt;- structure(&amp;quot;no&amp;quot;, class = &amp;quot;quitter&amp;quot;)
print.quitter &amp;lt;- function(quitter) base::quit(&amp;quot;no&amp;quot;)

# This will quit your session NOT saving a workspace image!
qq&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r903-01-ooops.gif&#34; alt=&#34;Oops…I Did It Again&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Oops…I Did It Again&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;switching-debugging-modes-with-ease&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Switching debugging modes with ease&lt;/h1&gt;
&lt;p&gt;Quitting &lt;code&gt;R&lt;/code&gt; quickly is more useful then it may sound when using multiple sessions in a terminal environment, but we can use the above approach to create different useful shortcuts making our life much easier.&lt;/p&gt;
&lt;p&gt;One example I use very frequently is to change the &lt;code&gt;error&lt;/code&gt; option, which governs how &lt;code&gt;R&lt;/code&gt; behaves when encountering non-catastrophic errors such as those generated by &lt;code&gt;stop&lt;/code&gt;, etc.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I find setting the option to &lt;code&gt;options(error = utils::recover)&lt;/code&gt; very useful for debugging and at the same time very annoying when undesired.&lt;/li&gt;
&lt;li&gt;Typing &lt;code&gt;options(error = NULL)&lt;/code&gt; to change it back is however even more annoying. Or is it &lt;code&gt;options(&amp;quot;error&amp;quot;) = NULL&lt;/code&gt;? Or maybe even &lt;code&gt;options(error) = NULL&lt;/code&gt;?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In comes the &lt;code&gt;gg&lt;/code&gt; shortcut:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gg &amp;lt;- structure(FALSE, class = &amp;quot;debuggerclass&amp;quot;)
print.debuggerclass &amp;lt;-  function(debugger) {
  if (!identical(getOption(&amp;quot;error&amp;quot;), as.call(list(utils::recover)))) {
    options(error = recover)
    message(&amp;quot; * debugging is now ON - option error set to recover&amp;quot;)
  } else {
    options(error = NULL)
    message(&amp;quot; * debugging is now OFF - option error set to NULL&amp;quot;)
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we switch between the options with ease:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# When in need of debugging
gg&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  * debugging is now ON - option error set to recover&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# The option is now set to recover
getOption(&amp;quot;error&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## (function () 
## {
##     if (.isMethodsDispatchOn()) {
##         tState &amp;lt;- tracingState(FALSE)
##         on.exit(tracingState(tState))
##     }
##     calls &amp;lt;- sys.calls()
##     from &amp;lt;- 0L
##     n &amp;lt;- length(calls)
##     if (identical(sys.function(n), recover)) 
##         n &amp;lt;- n - 1L
##     for (i in rev(seq_len(n))) {
##         calli &amp;lt;- calls[[i]]
##         fname &amp;lt;- calli[[1L]]
##         if (!is.na(match(deparse(fname)[1L], c(&amp;quot;methods::.doTrace&amp;quot;, 
##             &amp;quot;.doTrace&amp;quot;)))) {
##             from &amp;lt;- i - 1L
##             break
##         }
##     }
##     if (from == 0L) 
##         for (i in rev(seq_len(n))) {
##             calli &amp;lt;- calls[[i]]
##             fname &amp;lt;- calli[[1L]]
##             if (!is.name(fname) || is.na(match(as.character(fname), 
##                 c(&amp;quot;recover&amp;quot;, &amp;quot;stop&amp;quot;, &amp;quot;Stop&amp;quot;)))) {
##                 from &amp;lt;- i
##                 break
##             }
##         }
##     if (from &amp;gt; 0L) {
##         if (!interactive()) {
##             try(dump.frames())
##             cat(gettext(&amp;quot;recover called non-interactively; frames dumped, use debugger() to view\n&amp;quot;))
##             return(NULL)
##         }
##         else if (identical(getOption(&amp;quot;show.error.messages&amp;quot;), 
##             FALSE)) 
##             return(NULL)
##         calls &amp;lt;- limitedLabels(calls[1L:from])
##         repeat {
##             which &amp;lt;- menu(calls, title = &amp;quot;\nEnter a frame number, or 0 to exit  &amp;quot;)
##             if (which) 
##                 eval(substitute(browser(skipCalls = skip), list(skip = 7 - 
##                   which)), envir = sys.frame(which))
##             else break
##         }
##     }
##     else cat(gettext(&amp;quot;No suitable frames for recover()\n&amp;quot;))
## })()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# When done debugging
gg &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  * debugging is now OFF - option error set to NULL&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# The option is now back to NULL
getOption(&amp;quot;error&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NULL&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;making-it-practical-and-a-bit-less-barbaric&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Making it practical (and a bit less barbaric)&lt;/h1&gt;
&lt;p&gt;Defining all the shortcuts in the way shown above every time is both tedious and ugly, making a mess in our global environment. We can therefore decrease the tedium and ugliness by:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Adding the definitions into our &lt;code&gt;.Rprofile&lt;/code&gt; with a proper notice, which will run the definitions and make the shortcuts available every time we start R standardly&lt;/li&gt;
&lt;li&gt;Enclosing the definitions into a separate environment attached to the search path, potentially with a command to detach it easily&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Such an &lt;code&gt;.Rprofile&lt;/code&gt; can look similar to:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;message(&amp;quot;________________________________________&amp;quot;)
message(&amp;quot;|                                      |&amp;quot;)
message(&amp;quot;|      SOURCING CUSTOM .Rprofile       |&amp;quot;)
message(&amp;quot;|                                      |&amp;quot;)
message(&amp;quot;|  * qq =&amp;gt; quit(&amp;#39;no&amp;#39;)                  |&amp;quot;)
message(&amp;quot;|  * gg =&amp;gt; toggle error = recover/NULL |&amp;quot;)
message(&amp;quot;|  * dd =&amp;gt; detach this madness         |&amp;quot;)
message(&amp;quot;|______________________________________|&amp;quot;)
message(&amp;quot;\n&amp;quot;)

customCommands &amp;lt;- new.env()

assign(&amp;quot;qq&amp;quot;, structure(&amp;quot;no&amp;quot;, class = &amp;quot;quitterclass&amp;quot;), envir = customCommands)
assign(&amp;quot;print.quitterclass&amp;quot;, function(quitter) {
  message(&amp;quot; * quitting, not saving workspace&amp;quot;)
  base::quit(quitter[1L])
}, envir = customCommands)

assign(&amp;quot;gg&amp;quot;, structure(&amp;quot;&amp;quot;, class = &amp;quot;debuggerclass&amp;quot;), envir = customCommands)
assign(&amp;quot;print.debuggerclass&amp;quot;, function(debugger) {
  if (!identical(getOption(&amp;quot;error&amp;quot;), as.call(list(utils::recover)))) {
    options(error = recover)
    message(&amp;quot; * debugging is now ON - option error set to recover&amp;quot;)
  } else {
    options(error = NULL)
    message(&amp;quot; * debugging is now OFF - option error set to NULL&amp;quot;)
  }
}, envir = customCommands)

assign(&amp;quot;dd&amp;quot;, structure(&amp;quot;&amp;quot;, class = &amp;quot;detacherclass&amp;quot;), envir = customCommands)
assign(&amp;quot;print.detacherclass&amp;quot;, function(detacher) {
  detach(customCommands, unload = TRUE, force = TRUE)
})

attach(customCommands)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In terminal environments, shortcuts like this can be even more useful:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r903-02-terminal.gif&#34; alt=&#34;Tends to be more useful in the terminal&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Tends to be more useful in the terminal&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://csgillespie.github.io/efficientR/3-3-r-startup.html#rprofile&#34;&gt;Rprofile chapter&lt;/a&gt; of Efficient R programming&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/print.html&#34;&gt;Documentation on print&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html&#34;&gt;Documentation on options&lt;/a&gt; to set and examine a variety of global options.&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;img alt=&#34;Coat of arms of Slovakia&#34; class=&#34;svk&#34; /&gt;
Today, September 1st 2018 the &lt;a href=&#34;https://en.wikipedia.org/wiki/Constitution_of_Slovakia&#34;&gt;Constitution of the Slovak Republic&lt;/a&gt; celebrates its 26th anniversary. Happy Birthday!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>R:case4base - code profiling with base R</title>
      <link>https://jozef.io/r004-profiling/</link>
      <pubDate>Sat, 18 Aug 2018 13:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r004-profiling/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In this summertime post in the &lt;a href=&#34;https://jozef.io/categories/rcase4base/&#34;&gt;case4base series&lt;/a&gt;, we will look at useful tools in base R, which let us profile our code without any extra packages needed to be installed. We will cover simple and easy to use speed profiling, more complex profiling of performance and memory and, as always, look at alternatives to base R as well, with a special shout out to profiling integration in RStudio.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#simple-time-profiling-with-system.time&#34;&gt;Simple time profiling with system.time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#profile-r-execution-with-rprof&#34;&gt;Profile R execution with Rprof&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#non-sampling-memory-use-profiling-with-rprofmem&#34;&gt;Non-sampling memory use profiling with Rprofmem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#profiling-integration-within-rstudio&#34;&gt;Profiling integration within RStudio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#background-profiling-with-base-r-via-an-rstudio-addin&#34;&gt;Background profiling with base R via an RStudio addin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;simple-time-profiling-with-system.time&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Simple time profiling with &lt;code&gt;system.time&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;Base function &lt;code&gt;system.time&lt;/code&gt; returns the difference between two &lt;code&gt;proc.time&lt;/code&gt; calls within which it evaluates an expression provided as argument. The simplest usage can be seen below:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;system.time(runif(10^8))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    user  system elapsed 
##   4.376   0.448   4.836&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the purpose of processing the results, we can of course store and examine them within a variable where we can see that it is in fact a numeric vector with 5 elements with a &lt;code&gt;proc_time&lt;/code&gt; class. It uses &lt;code&gt;summary&lt;/code&gt; as its print method via the &lt;code&gt;print.proc_time&lt;/code&gt;. For most our purposes, we would be interested in the “elapsed” element of the result, giving us the ‘real’ elapsed time since the process was started:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tm &amp;lt;- system.time(runif(10^8))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;str(tm)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Class &amp;#39;proc_time&amp;#39;  Named num [1:5] 4.376 0.448 4.836 0 0
##   ..- attr(*, &amp;quot;names&amp;quot;)= chr [1:5] &amp;quot;user.self&amp;quot; &amp;quot;sys.self&amp;quot; &amp;quot;elapsed&amp;quot; &amp;quot;user.child&amp;quot; ...&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tm[&amp;quot;elapsed&amp;quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## elapsed 
##   4.836&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also very simply run multiple observations for an expression and investigate the results:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;expr &amp;lt;- rep(expression(runif(10^8)), 10L)
tm &amp;lt;- unlist(lapply(expr, function(x) system.time(eval(x))[&amp;quot;elapsed&amp;quot;]))
summary(tm)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.779   4.804   4.816   4.828   4.854   4.893&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With a little tweaking we can also run it in a separate process to not block our R session:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;script &amp;lt;- shQuote(paste(
  &amp;#39;expr &amp;lt;- rep(expression(runif(10^7)), 10L)&amp;#39;,
  &amp;#39;tm &amp;lt;- unlist(lapply(expr, function(x) system.time(eval(x))[&amp;quot;elapsed&amp;quot;]))&amp;#39;,
  &amp;#39;print(summary(tm))&amp;#39;,
  sep = &amp;#39;;&amp;#39;
))

system2(&amp;#39;Rscript&amp;#39;, args = c(&amp;#39;-e&amp;#39;, script), wait = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;profile-r-execution-with-rprof&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Profile R execution with &lt;code&gt;Rprof&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;The utils package included in the base R releases contains a very useful pair of functions for profiling by sampling every interval of seconds:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;utils::Rprof()&lt;/code&gt; to enable the R profiling, run the code to be profiled and use &lt;code&gt;utils::Rprof(NULL)&lt;/code&gt; to disable profiling&lt;/li&gt;
&lt;li&gt;Afterwards, use &lt;code&gt;utils::summaryRprof()&lt;/code&gt; to investigate the results&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The most simplistic usage is really this straight-forward:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Enable profiling
utils::Rprof()

# Run the code to be profiled
x &amp;lt;- lapply(10^(6:7),  runif)
y &amp;lt;- lapply(x, summary)
z &amp;lt;- sort(x[[2]])

# Disable profiling
utils::Rprof(NULL)

# Read the profiling results and view
res &amp;lt;- utils::summaryRprof()
res[[&amp;quot;by.self&amp;quot;]]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The profiling can be customized with arguments such as &lt;code&gt;filename&lt;/code&gt;, which specifies to which file will the results be written (and also serves as the off switch if set to &lt;code&gt;NULL&lt;/code&gt; or &lt;code&gt;&amp;quot;&amp;quot;&lt;/code&gt;), &lt;code&gt;interval&lt;/code&gt;, which governs the time between profiling samples. More can be found in the function’s help.&lt;/p&gt;
&lt;p&gt;Perhaps the most interesting argument is &lt;code&gt;memory.profiling&lt;/code&gt; which if set to &lt;code&gt;TRUE&lt;/code&gt; will add memory information into the results file:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Enable profiling with memory profiling
utils::Rprof(filename = &amp;quot;ProfwMemory.out&amp;quot;, memory.profiling = TRUE)

# Run the code to be profiled
x &amp;lt;- lapply(10^(6:7),  runif)
y &amp;lt;- lapply(x, summary)
z &amp;lt;- sort(x[[2]])

# Disable profiling
utils::Rprof(NULL)

# Read the profiling results and view results in different ways
utils::summaryRprof(
  filename = &amp;quot;ProfwMemory.out&amp;quot;,
  memory = c(&amp;quot;stats&amp;quot;),
  lines = &amp;quot;show&amp;quot;
)

utils::summaryRprof(
  filename = &amp;quot;ProfwMemory.out&amp;quot;,
  memory = c(&amp;quot;both&amp;quot;)
)[[&amp;quot;by.self&amp;quot;]]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;non-sampling-memory-use-profiling-with-rprofmem&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Non-sampling memory use profiling with &lt;code&gt;Rprofmem&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;Base R also offers an option to profile memory use (if R is compiled with &lt;code&gt;R_MEMORY_PROFILING&lt;/code&gt; defined) using &lt;code&gt;Rprofmem&lt;/code&gt; - a pure memory use profiler. Results are written as simple text into a file, from which they can be read, however the processing of the result may use a bit more polishing here:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Enable memory profiling profiling
utils::Rprofmem(&amp;quot;Rprofmem.out&amp;quot;, threshold = 10240)

# Run the code to be profiled
x &amp;lt;- runif(10^5)
y &amp;lt;- runif(10^6)
z &amp;lt;- runif(10^7)

# Disable profiling
utils::Rprofmem(NULL)

# Read the results
readLines(&amp;quot;Rprofmem.out&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If our concern is specifically copying of (large) objects which negatively impact the memory requirements of our work, we can (provided that R is compiled with &lt;code&gt;--enable-memory-profiling&lt;/code&gt;). Use &lt;code&gt;tracemem(object)&lt;/code&gt; to mark &lt;code&gt;object&lt;/code&gt; for tracking and print a stack trace it is duplicated. &lt;code&gt;untracemem(object)&lt;/code&gt; untraces the object.&lt;/p&gt;
&lt;p&gt;For more details see the &lt;a href=&#34;#references&#34;&gt;references&lt;/a&gt; section.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;profiling-integration-within-rstudio&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Profiling integration within RStudio&lt;/h1&gt;
&lt;p&gt;Even though this does not really adhere to the &lt;code&gt;case4base&lt;/code&gt; rules, we still mention the RStudio profiling integration, which is done using the &lt;code&gt;profvis&lt;/code&gt; package and if successful, works really well and provides informative graphical outputs. All we have to to it either select a chunk of code and click on &lt;code&gt;Profile -&amp;gt; Profile Selected Line(s)&lt;/code&gt;, or click on &lt;code&gt;Profile -&amp;gt; Start Profiling&lt;/code&gt;, run our code and then &lt;code&gt;Profile -&amp;gt; Stop profiling&lt;/code&gt;. RStudio should then automatically use &lt;code&gt;profvis&lt;/code&gt; to produce an interactive output that allows nice exploration of the results:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r004-01-rstudio-profvis.gif&#34; alt=&#34;RStudio+profvis&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;RStudio+profvis&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;background-profiling-with-base-r-via-an-rstudio-addin&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Background profiling with base R via an RStudio addin&lt;/h1&gt;
&lt;p&gt;We have also created and written about an RStudio addin that let users profile R code selected in RStudio, with the advantage that the profiling runs asynchronously in a separate process not blocking the current R session and also not requiring external packages such as profvis. You can &lt;a href=&#34;https://jozef.io/r105-async-profiler/&#34;&gt;read more about it and get it here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/proftools/index.html&#34;&gt;proftools&lt;/a&gt; and its &lt;a href=&#34;https://cran.r-project.org/web/packages/proftools/vignettes/proftools.pdf&#34;&gt;package vignette&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/rstudio/profvis&#34;&gt;profvis::profvis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/microbenchmark/&#34;&gt;microbenchmark::microbenchmark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Profiling-R-code-for-speed&#34;&gt;Profiling R code for speed&lt;/a&gt; at Writing R Extensions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Profiling-R-code-for-memory-use&#34;&gt;Profiling R code for memory use&lt;/a&gt; at Writing R Extensions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.time.html&#34;&gt;system.time help&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://developer.r-project.org/memory-profiling.html&#34;&gt;Memory profiling&lt;/a&gt; in R&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>RStudio:addins part 5 - Profile your code on keypress in the background, with no dependencies</title>
      <link>https://jozef.io/r105-async-profiler/</link>
      <pubDate>Sat, 04 Aug 2018 12:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r105-async-profiler/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Profiling our code is a very useful tool to determine how well the code performs on different metrics.&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;the-addin-we-will-create-in-this-article-will-let-us-use-a-keyboard-shortcut-to-run-profiling-on-r-code-selected-in-rstudio-without-blocking-the-session-or-requiring-any-external-packages.&#34;&gt;The addin we will create in this article will let us use a keyboard shortcut to run profiling on R code selected in RStudio without blocking the session or requiring any external packages.&lt;/h4&gt;
&lt;/blockquote&gt;
&lt;p&gt;Specifically for very simple overview use, it may be beneficial to look at the time needed for a set of expressions to compute, e.g. how fast the code is. Secondly, especially important in case of computing on big datasets in-memory, the amount of memory utilized, e.g. how much RAM was used.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r105-01-run-profiler.gif&#34; alt=&#34;The addin in action&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;The addin in action&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#profiling-options-provided-by-base-r&#34;&gt;Profiling options provided by base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#asynchronous-execution-and-communication-of-the-results-with-the-session&#34;&gt;Asynchronous execution and communication of the results with the session&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#results-of-the-profiling&#34;&gt;Results of the profiling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-addin-formalities&#34;&gt;The addin formalities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-give-me-the-package&#34;&gt;TL;DR - Just give me the package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;profiling-options-provided-by-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Profiling options provided by base R&lt;/h1&gt;
&lt;p&gt;Without going into any detail at all, we have 2 very nice options to profile our code with base R:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;base::system.time(expr)&lt;/code&gt; - returns CPU and other times that &lt;code&gt;expr&lt;/code&gt; used&lt;/li&gt;
&lt;li&gt;&lt;code&gt;utils::Rprof&lt;/code&gt; - can serve as a switch to enable and disable profiling, with a variety of options, saving the results into a file on disk, by default &lt;code&gt;&amp;quot;Rprof.out&amp;quot;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the use of our addin, we will utilize the second approach, as we are interested not only in time spent, but also in memory utilization of the profiled expressions.&lt;/p&gt;
&lt;p&gt;After finishing the profiling, we will use &lt;code&gt;utils::summaryRprof&lt;/code&gt; to summarize the results provided to us by the &lt;code&gt;Rprof&lt;/code&gt; functionality mentioned above. To get an overview, we will examine only the total time the selected expressions took to execute and the maximum memory.&lt;/p&gt;
&lt;p&gt;The very simplistic implementation can look as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;profileExpression &amp;lt;- function(expr) {
  on.exit({
    unlink(&amp;quot;Rprof.out&amp;quot;)
    utils::Rprof(NULL)
  })

  if (!is.expression(expr)) {
    message(&amp;quot;epxr must be an expression in profileExpression()&amp;quot;)
    return(data.frame(
      totalTime = numeric(0),
      maxMemory = numeric(0)
    ))
  }
  gc()
  utils::Rprof(
    memory.profiling = TRUE,
    interval = 0.01,
    append = FALSE
  )
  evalRes &amp;lt;- try(eval(expr), silent = TRUE)
  utils::Rprof(NULL)
  if (inherits(evalRes, &amp;quot;try-error&amp;quot;)) {
    return(data.frame(stringsAsFactors = FALSE,
                      totalTime = &amp;quot;EvalError&amp;quot;,
                      maxMemory = &amp;quot;EvalError&amp;quot;
    ))
  }
  res &amp;lt;- utils::summaryRprof(memory = &amp;quot;both&amp;quot;)
  data.frame(
    totalTime = max(res[[&amp;quot;by.total&amp;quot;]][, 1L]),
    maxMemory = max(res[[&amp;quot;by.total&amp;quot;]][, 5L])
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since we maybe be interested in more than one execution of the expressions to be profiled and the profiling will be running in background, a wrapper executing the profiling itself multiple times may come in handy. Except the number of times to execute, which is a very standard argument, we can also attempt to provide a time frame we want to invest into the profiling:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;multiProfile &amp;lt;- function(
  expr,
  times = 10L,
  maxtime = getOption(&amp;quot;jhaddins_profiler_maxtime&amp;quot;, default = NULL)
){
  if (!(is.integer(times) || is.integer(maxtime))) {
    message(&amp;quot;Times or maxtime must be integer in multiProfile()&amp;quot;)
    return(data.frame(
      totalTime = numeric(0),
      maxMemory = numeric(0)
    ))
  }

  first &amp;lt;- profileExpression(expr)
  if (!is.null(maxtime)) {
    if (is.numeric(first[[&amp;quot;totalTime&amp;quot;]])) {
      times &amp;lt;- floor(maxtime / first[[&amp;quot;totalTime&amp;quot;]])
    } else {
      message(&amp;quot;Eval failed, cannot compute times from maxtime.&amp;quot;)
      return(first)
    }
  }
  if (times &amp;lt;= 1L) {
    return(first)
  }
  rest &amp;lt;- do.call(
    rbind,
    lapply(rep(list(expr), times - 1L), profileExpression)
  )
  rbind(first, rest)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;asynchronous-execution-and-communication-of-the-results-with-the-session&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Asynchronous execution and communication of the results with the session&lt;/h1&gt;
&lt;p&gt;Since we are only using base R functionality without taking advantage of external packages that would help us execute the profiling asynchronously, we have 3 challenges:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Asynchronous execution of the profiling&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We can take advantage of base R’s convenient interface &lt;code&gt;system2&lt;/code&gt;, which allows us to invoke OS commands, with the option to run asynchronously providing &lt;code&gt;wait = FALSE&lt;/code&gt; as argument.&lt;/p&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Communicating the results between our R session and the one running via &lt;code&gt;system2&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To kill two birds with one stone, we can simply use the &lt;code&gt;rstudioapi&lt;/code&gt; to navigate to a created file, into which we will later write the profiling results using the asynchronously running process. This way we have the results immediately available within in RStudio and we can keep working conveniently on the tasks at hand. Since our application is very simple, we also avoid complications with communication between the processes for example via sockets.&lt;/p&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Contents of the workspace&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When selecting a code chunk to profile in RStudio, it will likely happen very soon that the execution of expressions included in the selected code will rely on the current state of the global environment (aka. workspace). We can therefore make our functionality more convenient by storing the contents of the global environment on disk and loading it before running the profiler in our asynchronous process.&lt;/p&gt;
&lt;p&gt;A simple example implementation of the thoughts above it once again presented below. Note that this implementation is very bare-bones and could use much polishing, which may happen sometime after publishing this article:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;runProfiler &amp;lt;- function(
  inpContext = rstudioapi::getActiveDocumentContext()
){
  force(inpContext)
  inpString &amp;lt;- inpContext[[&amp;quot;selection&amp;quot;]][[1L]][[&amp;quot;text&amp;quot;]]
  cat(inpString, file = file.path(&amp;quot;~/temp.R&amp;quot;))
  expr &amp;lt;- try(parse(&amp;quot;~/temp.R&amp;quot;), silent = TRUE)
  if (inherits(expr, &amp;quot;try-error&amp;quot;)) {
    message(&amp;quot;Selected text cannot be parsed, cannot profile.&amp;quot;)
    unlink(file.path(&amp;quot;~/temp.R&amp;quot;))
    return(1L)
  }
  save(
    list = ls(all.names = TRUE, envir = .GlobalEnv),
    file = &amp;quot;~/tmp.RData&amp;quot;,
    envir = .GlobalEnv
  )
  script &amp;lt;- paste(sep = &amp;quot;; &amp;quot;,
    &amp;quot;load(&amp;#39;~/tmp.RData&amp;#39;)&amp;quot;,
    &amp;quot;res &amp;lt;- jhaddins:::multiProfile(parse(&amp;#39;~/temp.R&amp;#39;))&amp;quot;,
    &amp;quot;jhaddins:::writeProfileDf(res)&amp;quot;,
    &amp;quot;unlink(&amp;#39;~/temp.R&amp;#39;)&amp;quot;,
    &amp;quot;unlink(&amp;#39;~/tmp.RData&amp;#39;)&amp;quot;
  )
  file.create(&amp;quot;~/tmp_prof.txt&amp;quot;)
  rstudioapi::navigateToFile(&amp;quot;~/tmp_prof.txt&amp;quot;)
  system2(
    command = &amp;#39;Rscript&amp;#39;,
    args = c(&amp;#39;-e&amp;#39;, shQuote(script)),
    wait = FALSE
  )
  message(&amp;quot;Profiler running in the background&amp;quot;)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;results-of-the-profiling&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Results of the profiling&lt;/h1&gt;
&lt;p&gt;For the use that this simple functionality was developed, the main interest is knowing 2 very simple sets of information - how fast did the expressions execute and how much maximum memory was utilized. This is why the results are extracted and written in an extremely simplistic way, as can be seen below:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r105-02-results.png&#34; alt=&#34;“quand il n’y a plus rien à retrancher”&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;“quand il n’y a plus rien à retrancher”&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Based on real-life usage we may still improve the presentation (a bit ;) in the future.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-addin-formalities&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The addin formalities&lt;/h1&gt;
&lt;p&gt;If you follow this blog for a bit, you can safely skip this part. A few things to make our new addin available and easy to use:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Add the addin bindings into &lt;code&gt;inst/addins.dcf&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Name: runProfiler
Description: experimental, runProfiler
Binding: runProfiler
Interactive: false&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Re-install the package&lt;/li&gt;
&lt;li&gt;Assign a keyboard shortcut in the &lt;code&gt;Tools -&amp;gt; Addins -&amp;gt; Browse Addins... -&amp;gt; Keyboard Shortcuts...&lt;/code&gt; menu in RStudio:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r103-03-key-binding.gif&#34; alt=&#34;Assigning a keyboard shortcut to use the Addin&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Assigning a keyboard shortcut to use the Addin&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the package&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins.git&#34;&gt;&lt;code&gt;https://gitlab.com/jozefhajnala/jhaddins.git&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Profiling-R-code-for-speed&#34;&gt;Profiling R code for speed&lt;/a&gt; at Writing R Extensions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Profiling-R-code-for-memory-use&#34;&gt;Profiling R code for memory use&lt;/a&gt; at Writing R Extensions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.time.html&#34;&gt;system.time help&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/rstudio/profvis&#34;&gt;Profvis package&lt;/a&gt; with useful graphical overviews.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/microbenchmark/&#34;&gt;Microbenchmark package&lt;/a&gt; infrastructure to accurately measure and compare the execution time of R expressions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf&#34;&gt;parallel package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/r-lib/callr&#34;&gt;callR package&lt;/a&gt; - to perform a computation in a separate R process&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>RStudio:addins part 4 - Unit testing coverage investigation and improvement, made easy</title>
      <link>https://jozef.io/r104-unit-testing-coverage/</link>
      <pubDate>Sat, 21 Jul 2018 14:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r104-unit-testing-coverage/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;A developer always pays his technical debts! And we have a debt to pay to the gods of coding best practices, as we did not present many unit tests for our functions yet. Today we will show how to efficiently investigate and improve unit test coverage for our R code, with focus on functions governing our RStudio addins, which have their own specifics.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As a practical example, we will do a simple resctructuring of one of our functions to increase its test coverage from a mere 34% to over 90%.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r104-01-testthat-so-pretty.gif&#34; alt=&#34;The pretty rewards for your tests&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;The pretty rewards for your tests&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#fly-through-of-unit-testing-in-r&#34;&gt;Fly-through of unit testing in R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#investigating-test-coverage-within-a-package&#34;&gt;Investigating test coverage within a package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#test-coverage-for-rstudio-addin-functions&#34;&gt;Test coverage for RStudio addin functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#rewriting-an-addin-function-for-better-coverage&#34;&gt;Rewriting an addin function for better coverage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#testing-the-rewritten-function-and-gained-coverage&#34;&gt;Testing the rewritten function and gained coverage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-give-me-the-package&#34;&gt;TL;DR - Just give me the package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;fly-through-of-unit-testing-in-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Fly-through of unit testing in R&lt;/h1&gt;
&lt;p&gt;Much has been written on the importance of unit testing, so we will not spend more time on convincing the readers, but rather very quickly provide a few references in case the reader is new to unit testing with R. In the later parts of the article we assume that these basics are known.&lt;/p&gt;
&lt;p&gt;In a few words&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/r-lib/devtools&#34;&gt;&lt;code&gt;devtools&lt;/code&gt;&lt;/a&gt; - Makes package development easier by providing R functions that simplify common tasks&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/r-lib/testthat&#34;&gt;&lt;code&gt;testthat&lt;/code&gt;&lt;/a&gt;- Is the most popular unit testing package for R&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/r-lib/covr&#34;&gt;&lt;code&gt;covr&lt;/code&gt;&lt;/a&gt;- Helps track test coverage for R packages and view reports locally or (optionally) upload the results&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For a start guide to use &lt;code&gt;testthat&lt;/code&gt; within a package, visit the &lt;a href=&#34;http://r-pkgs.had.co.nz/tests.html&#34;&gt;Testing section&lt;/a&gt; of R packages by Hadley Wickham. I would also recommend checking out the &lt;a href=&#34;https://www.tidyverse.org/articles/2017/12/testthat-2-0-0/&#34;&gt;showcase on the 2.0.0&lt;/a&gt; release of the testthat itself.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;investigating-test-coverage-within-a-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Investigating test coverage within a package&lt;/h1&gt;
&lt;p&gt;For the purpose of investigating the test coverage of a package we can use the &lt;code&gt;covr&lt;/code&gt; package. Within an R project, we can call the &lt;code&gt;package_coverage()&lt;/code&gt; function to get a nicely printed high-level overview, or we can provide a specific path to a package root directory and call it as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This looks much prettier in the R console ;)
covr::package_coverage(pkgPath)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## jhaddins Coverage: 59.05%&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R/viewSelection.R: 34.15%&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R/addRoxytag.R: 40.91%&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R/makeCmd.R: 92.86%&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a deeper investigation, converting the results to a &lt;code&gt;data.frame&lt;/code&gt; might be very useful. The below shows the count of number of times that given expression was called during the running of our tests for each group of code lines:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;covResults &amp;lt;- covr::package_coverage(pkgPath)
as.data.frame(covResults)[, c(1:3, 5, 11)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##             filename         functions first_line last_line value
## 1     R/addRoxytag.R            roxyfy         10        12     6
## 2     R/addRoxytag.R            roxyfy         11        11     2
## 3     R/addRoxytag.R            roxyfy         13        15     4
## 4     R/addRoxytag.R            roxyfy         14        14     2
## 5     R/addRoxytag.R            roxyfy         16        16     2
## 6     R/addRoxytag.R            roxyfy         17        17     2
## 7     R/addRoxytag.R            roxyfy         18        18     2
## 8     R/addRoxytag.R        addRoxytag         29        29     0
## 9     R/addRoxytag.R        addRoxytag         30        37     0
## 10    R/addRoxytag.R        addRoxytag         32        34     0
## 11    R/addRoxytag.R        addRoxytag         38        38     0
## 12    R/addRoxytag.R    addRoxytagCode         44        44     0
## 13    R/addRoxytag.R    addRoxytagLink         50        50     0
## 14    R/addRoxytag.R     addRoxytagEqn         56        56     0
## 15       R/makeCmd.R           makeCmd         20        24     5
## 16       R/makeCmd.R           makeCmd         21        21     0
## 17       R/makeCmd.R           makeCmd         23        23     5
## 18       R/makeCmd.R           makeCmd         25        27     5
## 19       R/makeCmd.R           makeCmd         26        26     4
## 20       R/makeCmd.R           makeCmd         28        32     5
## 21       R/makeCmd.R           makeCmd         33        35     5
## 22       R/makeCmd.R           makeCmd         34        34     2
## 23       R/makeCmd.R           makeCmd         36        38     5
## 24       R/makeCmd.R           makeCmd         37        37     1
## 25       R/makeCmd.R           makeCmd         39        39     5
## 26       R/makeCmd.R      replaceTilde         48        50     1
## 27       R/makeCmd.R      replaceTilde         49        49     1
## 28       R/makeCmd.R      replaceTilde         51        51     1
## 29       R/makeCmd.R        executeCmd         61        61     5
## 30       R/makeCmd.R        executeCmd         62        66     5
## 31       R/makeCmd.R        executeCmd         68        72     3
## 32       R/makeCmd.R        executeCmd         69        69     0
## 33       R/makeCmd.R        executeCmd         71        71     3
## 34       R/makeCmd.R runCurrentRscript         90        90     1
## 35       R/makeCmd.R runCurrentRscript         91        91     1
## 36       R/makeCmd.R runCurrentRscript         92        96     1
## 37       R/makeCmd.R runCurrentRscript         93        95     1
## 38       R/makeCmd.R runCurrentRscript         94        94     0
## 39 R/viewSelection.R     viewSelection          7         7     0
## 40 R/viewSelection.R     viewSelection          8        12     0
## 41 R/viewSelection.R     viewSelection         10        10     0
## 42 R/viewSelection.R     viewSelection         13        13     0
## 43 R/viewSelection.R  getFromSysframes         24        24     6
## 44 R/viewSelection.R  getFromSysframes         25        25     3
## 45 R/viewSelection.R  getFromSysframes         26        26     3
## 46 R/viewSelection.R  getFromSysframes         28        28     3
## 47 R/viewSelection.R  getFromSysframes         29        29     3
## 48 R/viewSelection.R  getFromSysframes         30        30     3
## 49 R/viewSelection.R  getFromSysframes         31        31    92
## 50 R/viewSelection.R  getFromSysframes         32        32    92
## 51 R/viewSelection.R  getFromSysframes         33        33    92
## 52 R/viewSelection.R  getFromSysframes         34        34     2
## 53 R/viewSelection.R  getFromSysframes         37        37     1
## 54 R/viewSelection.R        viewObject         56        56     3
## 55 R/viewSelection.R        viewObject         57        57     3
## 56 R/viewSelection.R        viewObject         58        58     3
## 57 R/viewSelection.R        viewObject         61        61     0
## 58 R/viewSelection.R        viewObject         64        64     0
## 59 R/viewSelection.R        viewObject         65        65     0
## 60 R/viewSelection.R        viewObject         66        66     0
## 61 R/viewSelection.R        viewObject         69        69     0
## 62 R/viewSelection.R        viewObject         70        70     0
## 63 R/viewSelection.R        viewObject         71        71     0
## 64 R/viewSelection.R        viewObject         74        74     0
## 65 R/viewSelection.R        viewObject         76        76     0
## 66 R/viewSelection.R        viewObject         77        77     0
## 67 R/viewSelection.R        viewObject         79        79     0
## 68 R/viewSelection.R        viewObject         81        81     0
## 69 R/viewSelection.R        viewObject         82        82     0
## 70 R/viewSelection.R        viewObject         83        83     0
## 71 R/viewSelection.R        viewObject         88        88     0
## 72 R/viewSelection.R        viewObject         89        89     0
## 73 R/viewSelection.R        viewObject         91        91     0
## 74 R/viewSelection.R        viewObject         92        92     0
## 75 R/viewSelection.R        viewObject         93        93     0
## 76 R/viewSelection.R        viewObject         96        96     0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Calling &lt;code&gt;covr::zero_coverage&lt;/code&gt; with a overage object returned by &lt;code&gt;package_coverage&lt;/code&gt; will provide a data.frame with locations that have 0 test coverage. The nice thing about running it within RStudio is that it outputs the results on the Markers tab in RStudio, where we can easily investigate:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;zeroCov &amp;lt;- covr::zero_coverage(covResults)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r104-02-zero-coverage.gif&#34; alt=&#34;zero_coverage markers&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;zero_coverage markers&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;test-coverage-for-rstudio-addin-functions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Test coverage for RStudio addin functions&lt;/h1&gt;
&lt;p&gt;Investigating our code, let us focus on the results for the &lt;code&gt;viewSelection.R&lt;/code&gt;, which has a very weak 34% test coverage. We can analyze exactly which lines have no test coverage in a specific file:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;zeroCov[zeroCov$filename == &amp;quot;R/viewSelection.R&amp;quot;, &amp;quot;line&amp;quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1]  7  8  9 10 11 12 13 61 64 65 66 69 70 71 74 76 77 79 81 82 83 88 89
## [24] 91 92 93 96&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/bd5979cecfbf4b81a6f8c02e5593fe109baed172/R/viewSelection.R&#34;&gt;Looking at the code&lt;/a&gt;, we can see that the first chuck of lines - 7:13 represent the &lt;code&gt;viewSelection&lt;/code&gt; function, which just calls &lt;code&gt;lapply&lt;/code&gt; and invisibly returns &lt;code&gt;NULL&lt;/code&gt;.
The main weak spot however is the function &lt;code&gt;viewObject&lt;/code&gt;, out of which we only test the early return in case of invalid &lt;code&gt;chr&lt;/code&gt; argument provided. None of the other functionality is tested.&lt;/p&gt;
&lt;p&gt;The reason behind this is that when running the tests, RStudio functionality is not available and therefore we would not be able to test even the not-so-well designed return values, as they are almost always preceded by a call to &lt;code&gt;rstudioapi&lt;/code&gt; or other RStudio-related functionality such as the object viewer, because that is what they are designed to do. This means we must restructure the code in such a way that we contain the RStudio-dependent functionality to a necessary minimum, keeping a big majority of the code testable - only calling the side-effecting &lt;code&gt;rstudioapi&lt;/code&gt; when actually executing the addin functionality itself.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;rewriting-an-addin-function-for-better-coverage&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Rewriting an addin function for better coverage&lt;/h1&gt;
&lt;p&gt;We will now show one potential way to solve this issue for the particular case of our &lt;code&gt;viewObject&lt;/code&gt; function.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The idea behind the solution is to only return the arguments for the call to the RStudio API related functionality, instead of executing them in the function itself - hence the rename to &lt;code&gt;getViewArgs&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This way we can test the function’s return value against the expected arguments and only execute them with &lt;code&gt;do.call&lt;/code&gt; in the addin execution wrapper itself. A picture may be worth a thousand words, so here is the diff with relevant changes:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r104-03-testability-refactor.png&#34; alt=&#34;Refactoring for testability&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Refactoring for testability&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;testing-the-rewritten-function-and-gained-coverage&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Testing the rewritten function and gained coverage&lt;/h1&gt;
&lt;p&gt;Now that our return values are testable across the entire &lt;code&gt;getViewArgs&lt;/code&gt; function, we can easily write tests to cover the entire function, a couple examples:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test_that(&amp;quot;getViewArgs for function&amp;quot;
        , expect_equal(
            getViewArgs(&amp;quot;reshape&amp;quot;)
          , list(what = &amp;quot;View&amp;quot;, args = list(x = reshape, title = &amp;quot;reshape&amp;quot;))
          )
        )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test_that(&amp;quot;getViewArgs for data.frame&amp;quot;
        , expect_equal(
            getViewArgs(&amp;quot;datasets::women&amp;quot;)
          , list(what = &amp;quot;View&amp;quot;,
                 args = list(x = data.frame(
                     height = c(58, 59, 60, 61, 62, 63, 64, 65,
                                66, 67, 68, 69, 70, 71, 72),
                     weight = c(115, 117, 120, 123, 126, 129, 132, 135,
                                139, 142, 146, 150, 154, 159, 164)
                     ),
                   title = &amp;quot;datasets::women&amp;quot;
                   )
            )
          )
        )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looking at the test coverage provided after our changes, we can see that we are at more than 90% percent coverage for &lt;code&gt;viewSelection.R&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This looks much prettier in the R console ;)
covResults &amp;lt;- covr::package_coverage(pkgPath)
covResults&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## jhaddins Coverage: 82.05%&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R/addRoxytag.R: 40.91%&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R/viewSelection.R: 90.57%&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R/makeCmd.R: 92.86%&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And looking at the lines that not covered for &lt;code&gt;viewSelection.R&lt;/code&gt;, we can indeed see that the only uncovered lines left are in fact those with the &lt;code&gt;viewSelection&lt;/code&gt; function, which is responsible only for executing the addin itself:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;covResults &amp;lt;- as.data.frame(covResults)
covResults[covResults$filename == &amp;quot;R/viewSelection.R&amp;quot; &amp;amp;
             covResults$value == 0, c(1:3, 5, 11)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##             filename     functions first_line last_line value
## 59 R/viewSelection.R viewSelection          7         7     0
## 60 R/viewSelection.R viewSelection          8        11     0
## 61 R/viewSelection.R viewSelection         10        10     0
## 62 R/viewSelection.R viewSelection         12        12     0
## 74 R/viewSelection.R    viewObject         50        50     0
## 75 R/viewSelection.R    viewObject         51        51     0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the ideal world we would of course want to also automate the testing of our addin execution itself by examining if their effects in the RStudio IDE are as expected, however this is far beyond the scope of this post. For some of our addin functionality we can however even directly test the side-effects, such as when the addin should produce a file with certain content.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the package&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;get the &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/tags/addins4-testcoverage&#34;&gt;status of the package after this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;or use &lt;code&gt;git clone&lt;/code&gt; from &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins.git&#34;&gt;&lt;code&gt;https://gitlab.com/jozefhajnala/jhaddins.git&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://testthat.r-lib.org/index.html&#34;&gt;Testthat - unit testing for R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://r-pkgs.had.co.nz/tests.html&#34;&gt;Testing chapter&lt;/a&gt; of R packages by Hadley Wickham&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://covr.r-lib.org/&#34;&gt;covr - Track test coverage for your R package&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>A primer in using Java from R - part 2</title>
      <link>https://jozef.io/r902-primer-java-from-r-2/</link>
      <pubDate>Sat, 07 Jul 2018 13:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r902-primer-java-from-r-2/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In this part of the primer we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava.&lt;/p&gt;
&lt;p&gt;In the &lt;a href=&#34;https://jozef.io/r901-primer-java-from-r-1/&#34;&gt;first part&lt;/a&gt; we talked about using the rJava package to create objects, call methods and work with arrays, we examined the various ways to call Java methods and calling Java code from R directly via execution of shell commands.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r901-01-r-java.gif&#34; alt=&#34;R &amp;lt;3 Java, or maybe not?&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;R &amp;lt;3 Java, or maybe not?&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#using-rjava-with-custom-built-classes&#34;&gt;Using rJava with custom built classes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#very-quick-look-at-performance&#34;&gt;Very quick look at performance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#usage-of-jars-in-r-packages&#34;&gt;Usage of jars in R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#handling-java-exceptions-in-r&#34;&gt;Handling Java exceptions in R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;using-rjava-with-custom-built-classes&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using rJava with custom built classes&lt;/h1&gt;
&lt;div id=&#34;preparing-a-.jar-archive-for-use&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Preparing a .jar archive for use&lt;/h2&gt;
&lt;p&gt;Getting back to our example with running the &lt;code&gt;main&lt;/code&gt; method of our &lt;code&gt;HelloWorldDummy&lt;/code&gt; class from the &lt;a href=&#34;https://jozef.io/r901-primer-java-from-r-1/&#34;&gt;first part of this primer&lt;/a&gt;, in practice we most likely want to actually create objects and invoke methods for such classes rather than simply call the main method.&lt;/p&gt;
&lt;p&gt;For our resources to be available to rJava, we need to create a .jar archive and add it to the class path. An example of the process can be as follows. Compile our code to create the class file, and jar it:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;$ javac DummyJavaClassJustForFun/HelloWorldDummy.java
$ cd DummyJavaClassJustForFun/
$ jar cvf HelloWorldDummy.jar HelloWorldDummy.class&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-the-.jar-file-to-the-class-path&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Adding the .jar file to the class path&lt;/h2&gt;
&lt;p&gt;Within R, attach rJava, initialize the JVM and investigate our current class path using &lt;code&gt;.jclassPath&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rJava)
.jinit()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.jclassPath()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, we add our newly created .jar to the class path using &lt;code&gt;.jaddClassPath&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.jaddClassPath(paste0(jardir, &amp;quot;HelloWorldDummy.jar&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If this worked, we can see the added jar(s) in the class path if we call &lt;code&gt;.jclassPath()&lt;/code&gt; again.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-objects-investigating-methods-and-fields&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating objects, investigating methods and fields&lt;/h2&gt;
&lt;p&gt;Now that we have our .jar in the class path, we can create a new Java object from our class:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dummyObj &amp;lt;- .jnew(&amp;quot;DummyJavaClassJustForFun/HelloWorldDummy&amp;quot;)
str(dummyObj)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Formal class &amp;#39;jobjRef&amp;#39; [package &amp;quot;rJava&amp;quot;] with 2 slots
##   ..@ jobj  :&amp;lt;externalptr&amp;gt; 
##   ..@ jclass: chr &amp;quot;DummyJavaClassJustForFun/HelloWorldDummy&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also investigate the available constructors, methods and fields for our class (or provide the object as argument, then its class will be queried):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.jconstructors&lt;/code&gt; returns a character vector with all constructors for a given class or object&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.jmethods&lt;/code&gt; returns a character vector with all methods for a given class or object&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.jfields&lt;/code&gt; returns a character vector with all fields (aka attributes) for a given class or object&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.DollarNames&lt;/code&gt; returns all fields and methods associated with the object. Method names are followed by ( or () depending on arity.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Requesting vectors of methods, constructors and fields by class
.jmethods(&amp;quot;DummyJavaClassJustForFun/HelloWorldDummy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()&amp;quot;              
##  [2] &amp;quot;public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])&amp;quot;      
##  [3] &amp;quot;public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException&amp;quot;   
##  [4] &amp;quot;public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException&amp;quot;
##  [5] &amp;quot;public final void java.lang.Object.wait() throws java.lang.InterruptedException&amp;quot;           
##  [6] &amp;quot;public boolean java.lang.Object.equals(java.lang.Object)&amp;quot;                                  
##  [7] &amp;quot;public java.lang.String java.lang.Object.toString()&amp;quot;                                       
##  [8] &amp;quot;public native int java.lang.Object.hashCode()&amp;quot;                                             
##  [9] &amp;quot;public final native java.lang.Class java.lang.Object.getClass()&amp;quot;                           
## [10] &amp;quot;public final native void java.lang.Object.notify()&amp;quot;                                        
## [11] &amp;quot;public final native void java.lang.Object.notifyAll()&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.jconstructors(&amp;quot;DummyJavaClassJustForFun/HelloWorldDummy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;public DummyJavaClassJustForFun.HelloWorldDummy()&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.jfields(&amp;quot;DummyJavaClassJustForFun/HelloWorldDummy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NULL&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Requesting vectors of methods, constructors and fields by object
.jmethods(dummyObj)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;public java.lang.String DummyJavaClassJustForFun.HelloWorldDummy.SayMyName()&amp;quot;              
##  [2] &amp;quot;public static void DummyJavaClassJustForFun.HelloWorldDummy.main(java.lang.String[])&amp;quot;      
##  [3] &amp;quot;public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException&amp;quot;   
##  [4] &amp;quot;public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException&amp;quot;
##  [5] &amp;quot;public final void java.lang.Object.wait() throws java.lang.InterruptedException&amp;quot;           
##  [6] &amp;quot;public boolean java.lang.Object.equals(java.lang.Object)&amp;quot;                                  
##  [7] &amp;quot;public java.lang.String java.lang.Object.toString()&amp;quot;                                       
##  [8] &amp;quot;public native int java.lang.Object.hashCode()&amp;quot;                                             
##  [9] &amp;quot;public final native java.lang.Class java.lang.Object.getClass()&amp;quot;                           
## [10] &amp;quot;public final native void java.lang.Object.notify()&amp;quot;                                        
## [11] &amp;quot;public final native void java.lang.Object.notifyAll()&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.jconstructors(dummyObj)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;public DummyJavaClassJustForFun.HelloWorldDummy()&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.jfields(dummyObj)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NULL&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;calling-methods-3-different-ways&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Calling methods 3 different ways&lt;/h2&gt;
&lt;p&gt;We can now invoke our &lt;code&gt;SayMyName&lt;/code&gt; method on this object in the three ways as discussed is &lt;a href=&#34;https://jozef.io/r901-primer-java-from-r-1/&#34;&gt;the first part of this primer&lt;/a&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# low level
lres &amp;lt;- .jcall(dummyObj, &amp;quot;Ljava/lang/String;&amp;quot;, &amp;quot;SayMyName&amp;quot;)

# high level
hres &amp;lt;- J(dummyObj, method = &amp;quot;SayMyName&amp;quot;) 

# convenient $ shorthand
dres &amp;lt;- dummyObj$SayMyName() 

c(lres, hres, dres)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;My name is DummyJavaClassJustForFun.HelloWorldDummy&amp;quot;
## [2] &amp;quot;My name is DummyJavaClassJustForFun.HelloWorldDummy&amp;quot;
## [3] &amp;quot;My name is DummyJavaClassJustForFun.HelloWorldDummy&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;very-quick-look-at-performance&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Very quick look at performance&lt;/h1&gt;
&lt;p&gt;The low-level is much faster, since &lt;code&gt;J&lt;/code&gt; has to use reflection to find the most suitable method. The &lt;code&gt;$&lt;/code&gt; seems to be the slowest, but also very convenient, as it supports code completion:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;microbenchmark::microbenchmark(times = 100
, .jcall(dummyObj, &amp;quot;Ljava/lang/String;&amp;quot;, &amp;quot;SayMyName&amp;quot;)
, J(dummyObj, &amp;quot;SayMyName&amp;quot;)
, dummyObj$SayMyName()
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Unit: microseconds
##                                                 expr      min       lq
##  .jcall(dummyObj, &amp;quot;Ljava/lang/String;&amp;quot;, &amp;quot;SayMyName&amp;quot;)   45.503   65.507
##                             J(dummyObj, &amp;quot;SayMyName&amp;quot;)  870.890  917.514
##                                 dummyObj$SayMyName() 1148.603 1217.089
##        mean    median       uq      max neval
##    95.20935   77.6195   84.445 1976.195   100
##  1091.08645  963.7035 1064.606 7603.580   100
##  1307.03536 1260.5855 1377.438 1731.829   100&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;usage-of-jars-in-r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Usage of jars in R packages&lt;/h1&gt;
&lt;p&gt;To use rJava within an R package, Simon Urbanek, the author of rJava even provides a convenience function for this purpose which initializes the JVM and registers Java classes and native code contained in the package with it. A quick step by step guide to use .jars within a package is as follows:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;place our .jars into &lt;code&gt;inst/java/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;add &lt;code&gt;Depends: rJava&lt;/code&gt; and &lt;code&gt;SystemRequirements: Java&lt;/code&gt; into our &lt;code&gt;NAMESPACE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;add a call to &lt;code&gt;.jpackage(pkgname, lib.loc=libname)&lt;/code&gt; into our &lt;code&gt;.onLoad.R&lt;/code&gt; or &lt;code&gt;.First.lib&lt;/code&gt; for example like so:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.onLoad &amp;lt;- function(libname, pkgname) {
  .jpackage(pkgname, lib.loc = libname)
}&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;if possible, add &lt;code&gt;.java&lt;/code&gt; source files into &lt;code&gt;/java&lt;/code&gt; folder of our package&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are interested in more detail than provided in this super-quick overview, Tobias Verbeke created a &lt;a href=&#34;https://cran.r-project.org/web/packages/helloJavaWorld/index.html&#34;&gt;Hello Java World! package&lt;/a&gt; with a &lt;a href=&#34;https://cran.r-project.org/web/packages/helloJavaWorld/vignettes/helloJavaWorld.pdf&#34;&gt;vignette&lt;/a&gt; providing a verbose step-by-step tutorial for interfacing to Java archives inside R packages.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div id=&#34;setting-java.parameters&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Setting java.parameters&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;.jpackage&lt;/code&gt; function calls &lt;code&gt;.jinit&lt;/code&gt; with the default &lt;code&gt;parameters = getOption(&amp;quot;java.parameters&amp;quot;)&lt;/code&gt;, so if we want to set some of the java parameters, we can do it for example like so:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;.onLoad &amp;lt;- function(libname, pkgname) {
  options(java.parameters = c(&amp;quot;-Xmx1000m&amp;quot;))
  .jpackage(pkgname, lib.loc = libname)
}&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that the &lt;code&gt;options&lt;/code&gt; call needs to be done before the call to &lt;code&gt;.jpackage&lt;/code&gt;, as Java parameters can only be used during JVM initialization. Consequently, this will only work if other package did not intialize the JVM already.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;handling-java-exceptions-in-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Handling Java exceptions in R&lt;/h1&gt;
&lt;p&gt;rJava maps Java exceptions to R conditions relayed by the &lt;code&gt;stop&lt;/code&gt; function, therefore we can use the standard R mechanisms such as &lt;code&gt;tryCatch&lt;/code&gt; to handle the exceptions.&lt;/p&gt;
&lt;p&gt;The R condition object, assume we call it &lt;code&gt;e&lt;/code&gt; for this, is actually an S3 object (a list) that contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;call&lt;/code&gt; - a &lt;code&gt;language&lt;/code&gt; object containing the call resulting in the exception&lt;/li&gt;
&lt;li&gt;&lt;code&gt;jobj&lt;/code&gt; - an &lt;code&gt;S4&lt;/code&gt; object containing the actual exception object, so we can for example investigate investigate it’s class: &lt;code&gt;e[[&amp;quot;jobj&amp;quot;]]@jclass&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tryCatch(
  iOne &amp;lt;- .jnew(class = &amp;quot;java/lang/Integer&amp;quot;, 1),
  error = function(e) {
    message(&amp;quot;\nLets look at the condition object:&amp;quot;)
    str(e)
    
    message(&amp;quot;\nClass of the jobj item:&amp;quot;)
    print(e[[&amp;quot;jobj&amp;quot;]]@jclass)
    
    message(&amp;quot;\nClasses of the condition object: &amp;quot;)
    class(e)
  }
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Lets look at the condition object:&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## List of 3
##  $ message: chr &amp;quot;java.lang.NoSuchMethodError: &amp;lt;init&amp;gt;&amp;quot;
##  $ call   : language .jnew(class = &amp;quot;java/lang/Integer&amp;quot;, 1)
##  $ jobj   :Formal class &amp;#39;jobjRef&amp;#39; [package &amp;quot;rJava&amp;quot;] with 2 slots
##   .. ..@ jobj  :&amp;lt;externalptr&amp;gt; 
##   .. ..@ jclass: chr &amp;quot;java/lang/NoSuchMethodError&amp;quot;
##  - attr(*, &amp;quot;class&amp;quot;)= chr [1:9] &amp;quot;NoSuchMethodError&amp;quot; &amp;quot;IncompatibleClassChangeError&amp;quot; &amp;quot;LinkageError&amp;quot; &amp;quot;Error&amp;quot; ...&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Class of the jobj item:&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;java/lang/NoSuchMethodError&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Classes of the condition object:&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;NoSuchMethodError&amp;quot;            &amp;quot;IncompatibleClassChangeError&amp;quot;
## [3] &amp;quot;LinkageError&amp;quot;                 &amp;quot;Error&amp;quot;                       
## [5] &amp;quot;Throwable&amp;quot;                    &amp;quot;Object&amp;quot;                      
## [7] &amp;quot;Exception&amp;quot;                    &amp;quot;error&amp;quot;                       
## [9] &amp;quot;condition&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since &lt;code&gt;class(e)&lt;/code&gt; is a vector of simple java class names which allows the R code to use direct handlers, we can handle different such classes differently:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;withCallingHandlers(
  iOne &amp;lt;- .jnew(class = &amp;quot;java/lang/Integer&amp;quot;, 1)
  , error = function(e) {
    message(&amp;quot;Meh, just a boring error&amp;quot;)
  }
  , NoSuchMethodError = function(e) {
    message(&amp;quot;We have a NoSuchMethodError&amp;quot;)
  }
  , IncompatibleClassChangeError = function(e) {
    message(&amp;quot;We also have a IncompatibleClassChangeError - lets recover&amp;quot;)
    recover()
    # recovering here and looking at 
    # 2: .jnew(class = &amp;quot;java/lang/Integer&amp;quot;, 1)
    # we see that the issue is in 
    # str(list(...))
    # List of 1
    #  $ : num 1
    # We actually passed a numeric, not integer
    # To fix it, just do
    # .jnew(class = &amp;quot;java/lang/Integer&amp;quot;, 1L)
  }
  , LinkageError = function(e) {
    message(&amp;quot;Ok, this is getting a bit overwhelming,
               lets smile and end here
               :o)&amp;quot;)
  }
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Meh, just a boring error&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## We have a NoSuchMethodError&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## We also have a IncompatibleClassChangeError - lets recover&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## recover called non-interactively; frames dumped, use debugger() to view&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Ok, this is getting a bit overwhelming,
##                lets smile and end here
##                :o)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error in .jnew(class = &amp;quot;java/lang/Integer&amp;quot;, 1): java.lang.NoSuchMethodError: &amp;lt;init&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/helloJavaWorld/vignettes/helloJavaWorld.pdf&#34;&gt;Hello Java World! vignette&lt;/a&gt; - a tutorial for interfacing to Java archives inside R packages by Tobias Verbeke&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rforge.net/rJava/&#34;&gt;rJava basic crashcourse&lt;/a&gt; - at the rJava site on rforge, scroll down to the Documentation section&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/types.html#wp276&#34;&gt;The JNI Type Signatures&lt;/a&gt; - at Oracle JNI specs&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/rJava/rJava.pdf&#34;&gt;rJava documentation&lt;/a&gt; on CRAN&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://darrenjw.wordpress.com/2011/01/01/calling-java-code-from-r/&#34;&gt;Calling Java code from R&lt;/a&gt; by prof. Darren Wilkinson&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>A primer in using Java from R - part 1</title>
      <link>https://jozef.io/r901-primer-java-from-r-1/</link>
      <pubDate>Sat, 23 Jun 2018 13:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r901-primer-java-from-r-1/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This primer shall consist of two parts and its goal is to provide a walk-through of using resources developed in Java from R. It is structured as more of a “note-to-future-self” rather than a proper educational article, I however hope that some readers may still find it useful. It will also list a set of &lt;a href=&#34;#references&#34;&gt;references&lt;/a&gt; that I found very helpful, for which I thank the respective authors.&lt;/p&gt;
&lt;p&gt;The primer is split into 2 posts:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;In this first one we talk about using of the rJava package to create objects, call methods and work with arrays, we examine the various ways to call Java methods and calling Java code from R directly via execution of shell commands.&lt;/li&gt;
&lt;li&gt;In the &lt;a href=&#34;https://jozef.io/r902-primer-java-from-r-2/&#34;&gt;second one&lt;/a&gt; we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r901-01-r-java.gif&#34; alt=&#34;R &amp;lt;3 Java, or maybe not?&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;R &amp;lt;3 Java, or maybe not?&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#calling-java-from-r-directly&#34;&gt;Calling Java from R directly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-rjava-package---an-r-to-java-interface&#34;&gt;The rJava package - an R to Java interface&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#calling-java-methods-using-the-rjava-package&#34;&gt;Calling Java methods using the rJava package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#signatures-in-jni-notation&#34;&gt;Signatures in JNI notation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;calling-java-from-r-directly&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Calling Java from R directly&lt;/h1&gt;
&lt;p&gt;Calling Java resources from R directly can be achieved using R’s &lt;code&gt;system()&lt;/code&gt; function, which invokes the specified OS command. We can either use an already compiled java class, or invoke the compilation also via a &lt;code&gt;system()&lt;/code&gt; call from R. Of course for any real world practical uses, we will probably do the Java coding, compilation and jaring in a Java IDE and provide R with just the final .jar file(s), I however find it helpful to have a small example of the simplest complete case, for which even the following is sufficient. Integrating pre-prepared .jars into an R packages will be covered in detail by the &lt;a href=&#34;https://jozef.io/r902-primer-java-from-r-2/&#34;&gt;second part of this primer&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Let us show that by writing a very silly dummy class with just 2 methods:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;main&lt;/code&gt;, that prints “Hello World!” + an optional suffix, if provided as argument&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SayMyName&lt;/code&gt; method, that returns a string constructed from “My name is” and &lt;code&gt;getClass().getName()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This &lt;code&gt;HelloWorldDummy.java&lt;/code&gt; file can look as follows:&lt;/p&gt;
&lt;pre class=&#34;java&#34;&gt;&lt;code&gt;package DummyJavaClassJustForFun;

public class HelloWorldDummy {

  public String SayMyName() {
   return(&amp;quot;My name is &amp;quot; + getClass().getName());
  }
  
  public static void main(String[] args) {
    String stringArg = &amp;quot;And that is it.&amp;quot;;
    if (args.length &amp;gt; 0) {
      stringArg = args[0];
    }
    System.out.println(&amp;quot;Hello, World. &amp;quot; + stringArg);
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;compilation-and-execution-via-bash-commands&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Compilation and execution via bash commands&lt;/h2&gt;
&lt;p&gt;Now that we have our dummy class ready, we can put together the commands and test them by just executing via a shell, or for RStudio fans, we can test the commands via RStudio’s cool Terminal feature. First, the compilation command, which may look something like the following, assuming that we are in the correct working directory:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;$ javac DummyJavaClassJustForFun/HelloWorldDummy.java&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we have the class compiled, we can execute the &lt;code&gt;main&lt;/code&gt; method, with and without the argument provided:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;$ java DummyJavaClassJustForFun/HelloWorldDummy
$ java DummyJavaClassJustForFun/HelloWorldDummy &amp;quot;I like winter&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In case we need to compile and run with more .jars that are in folder &lt;code&gt;jars/&lt;/code&gt;, we specify the folder using &lt;code&gt;-cp&lt;/code&gt; (class path):&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;$ javac -cp &amp;quot;jars/*&amp;quot; DummyJavaClassJustForFun/HelloWorldDummy.java
$ java -cp &amp;quot;jars/*:compile/src&amp;quot; DummyJavaClassJustForFun/HelloWorldDummy&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;compilation-and-execution-of-java-code-from-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Compilation and execution of Java code from R&lt;/h2&gt;
&lt;p&gt;Now that we have tested our commands, we can use R to do the compilation via the &lt;code&gt;system&lt;/code&gt; function. Do not forget to &lt;code&gt;cd&lt;/code&gt; into the correct directory within a single system call if needed:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;system(&amp;#39;cd data/; javac DummyJavaClassJustForFun/HelloWorldDummy.java&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After that we can also execute the &lt;code&gt;main&lt;/code&gt; method, and the &lt;code&gt;main&lt;/code&gt; method with one argument specified, just like we did it outside of R, once again using &lt;code&gt;cd&lt;/code&gt; to enter the proper working directory if needed:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;system(&amp;#39;cd data/; java DummyJavaClassJustForFun/HelloWorldDummy&amp;#39;)
system(&amp;#39;cd data/; java DummyJavaClassJustForFun/HelloWorldDummy &amp;quot;Also I like winter&amp;quot;&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-rjava-package---an-r-to-java-interface&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The rJava package - an R to Java interface&lt;/h1&gt;
&lt;p&gt;The rJava package provides a low-level interface to Java virtual machine. It allows creation of objects, calling methods and accessing fields of the objects. It also provides functionality to include our java resources into R packages easily.&lt;/p&gt;
&lt;p&gt;We can install it with the classic:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;quot;rJava&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note the system requirement Java JDK 1.2 or higher and for JRI/REngine JDK 1.4 or higher. After attaching the package, we also need to initialize a Java Virtual Machine (JVM):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Attach rJava and Init a JVM
library(rJava)
.jinit()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In case of issues with attaching the package using &lt;code&gt;library&lt;/code&gt;, one can refer to &lt;a href=&#34;https://stackoverflow.com/questions/37735108/r-error-onload-failed-in-loadnamespace-for-rjava&#34;&gt;this helpful StackOverflow thread&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;creating-java-objects-with-rjava&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating Java objects with rJava&lt;/h2&gt;
&lt;p&gt;We will now very quickly go through the basic uses of the package. The &lt;code&gt;.jnew&lt;/code&gt; function is used to create a new Java object. Note that the &lt;code&gt;class&lt;/code&gt; argument requires a fully qualified class name in Java Native Interface notation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Creating a new object of java.lang class String
sHello &amp;lt;- .jnew(class = &amp;quot;java/lang/String&amp;quot;, &amp;quot;Hello World!&amp;quot;)
# Creating a new object of java.lang class Integer
iOne &amp;lt;- .jnew(class = &amp;quot;java/lang/Integer&amp;quot;, &amp;quot;1&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;working-with-arrays-via-rjava&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Working with arrays via rJava&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Creating new arrays
iArray &amp;lt;- .jarray(1L:2L)
.jevalArray(iArray)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 1 2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Using a list of 2 and lapply
# Integer Matrix int[2][2]
iMatrix &amp;lt;- .jarray(list(iArray, iArray), contents.class = &amp;quot;[I&amp;quot;)
lapply(iMatrix, .jevalArray)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] 1 2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Integer Matrix int[2][2]
square &amp;lt;- array(1:4, dim = c(2, 2))
square&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Using dispatch = TRUE to create the array 
# Using simplify = TRUE to return a nice R array
dSquare &amp;lt;- .jarray(square, dispatch = TRUE)
.jevalArray(dSquare, simplify = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Integer Tesseract int[2][2][2][2]
tesseract &amp;lt;- array(1L:16L, dim = c(2, 2, 2, 2))
tesseract&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## , , 1, 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]   13   15
## [2,]   14   16&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Use dispatch = TRUE to create the array 
# Use simplify = TRUE to return a nice R array
# Interestingly, this seems weird
dTesseract &amp;lt;- .jarray(tesseract, dispatch = TRUE)
.jevalArray(dTesseract, simplify = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## , , 1, 1
## 
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    0
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    8
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]    9    0
## [2,]    0    0
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0   16&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;calling-java-methods-using-the-rjava-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Calling Java methods using the rJava package&lt;/h1&gt;
&lt;p&gt;rJava provides two levels of API:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fast, but inflexible low-level JNI-API in the form of the &lt;code&gt;.jcall&lt;/code&gt; function&lt;/li&gt;
&lt;li&gt;convenient (at the cost of performance) high-level reflection API based on the &lt;code&gt;$&lt;/code&gt; operator.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In practice, there are three ways available to us from the rJava package enabling us to call Java methods, each of them with their positives and negatives.&lt;/p&gt;
&lt;div id=&#34;the-low-level-way---.jcall&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The low-level way - &lt;code&gt;.jcall()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;.jcall(obj, returnSig = &amp;quot;V&amp;quot;, method, ...)&lt;/code&gt; calls a Java method with the supplied arguments the “low-level” way. A few important notes regarding the usage, for more refer to the R help on .jcall:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;requires exact match of argument and return types, doesn’t perform any lookup in the reflection tables&lt;/li&gt;
&lt;li&gt;passing sub-classes of the classes present in the method definition requires explicit casting using &lt;code&gt;.jcast&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;passing null arguments needs a proper class specification with &lt;code&gt;.jnull&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;vector of length 1 corresponding to a native Java type is considered a scalar, use &lt;code&gt;.jarray&lt;/code&gt; to pass a vector as array for safety&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Calling a Java method length on the object low-level way
.jcall(sHello, returnSig = &amp;quot;I&amp;quot;, &amp;quot;length&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 12&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Also we must be careful with the data types:

# This works
.jcall(sHello, returnSig = &amp;quot;C&amp;quot;, &amp;quot;charAt&amp;quot;, 5L)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 32&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# This does not
.jcall(sHello, returnSig = &amp;quot;C&amp;quot;, &amp;quot;charAt&amp;quot;, 5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error in .jcall(sHello, returnSig = &amp;quot;C&amp;quot;, &amp;quot;charAt&amp;quot;, 5): method charAt with signature (D)C not found&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-high-level-way---j&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The high-level way - &lt;code&gt;J()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;J(class, method, ...)&lt;/code&gt; is the high level API for accessing Java, it is slower than &lt;code&gt;.jnew&lt;/code&gt; or &lt;code&gt;.jcall&lt;/code&gt; since it has to use reflection to find the most suitable method.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;to call a method, the &lt;code&gt;method&lt;/code&gt; argument must be present as a character vector of length 1&lt;/li&gt;
&lt;li&gt;if &lt;code&gt;method&lt;/code&gt; is missing, &lt;code&gt;J&lt;/code&gt; creates a class name reference&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Calling a Java method length on the object high-level way
J(sHello, &amp;quot;length&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 12&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Also, the high-level will not help here this way
J(sHello, &amp;quot;charAt&amp;quot;, 5L)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error in .jcall(o, &amp;quot;I&amp;quot;, &amp;quot;intValue&amp;quot;): method intValue with signature ()I not found&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;J(sHello, &amp;quot;charAt&amp;quot;, 5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error in .jcall(&amp;quot;RJavaTools&amp;quot;, &amp;quot;Ljava/lang/Object;&amp;quot;, &amp;quot;invokeMethod&amp;quot;, cl, : java.lang.NoSuchMethodException: No suitable method for the given parameters&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-high-level-way-with-convenience--&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The high-level way with convenience - &lt;code&gt;$&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Closely connected to the &lt;code&gt;J&lt;/code&gt; function, the &lt;code&gt;$&lt;/code&gt; operator for &lt;code&gt;jobjRef&lt;/code&gt; Java object references provides convenience access to object attributes and calling Java methods by implementing relevant methods for the completion generator for R.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;$&lt;/code&gt; returns either the value of the attribute or calls a method, depending on which name matches first&lt;/li&gt;
&lt;li&gt;&lt;code&gt;$&amp;lt;-&lt;/code&gt; assigns a value to the corresponding Java attribute&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# And via the $ operator
sHello$length()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 12&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# But these still do not work
sHello$charAt(5L)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error in .jcall(o, &amp;quot;I&amp;quot;, &amp;quot;intValue&amp;quot;): method intValue with signature ()I not found&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sHello$charAt(5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Error in .jcall(&amp;quot;RJavaTools&amp;quot;, &amp;quot;Ljava/lang/Object;&amp;quot;, &amp;quot;invokeMethod&amp;quot;, cl, : java.lang.NoSuchMethodException: No suitable method for the given parameters&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;examining-methods-and-fields&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Examining methods and fields&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;.DollarNames&lt;/code&gt; returns all fields and methods associated with the object. Method names are followed by &lt;code&gt;(&lt;/code&gt; or &lt;code&gt;()&lt;/code&gt; depending on arity:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# vector of all fields and methods associated with sHello
.DollarNames(sHello)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;CASE_INSENSITIVE_ORDER&amp;quot; &amp;quot;equals(&amp;quot;               
##  [3] &amp;quot;toString()&amp;quot;             &amp;quot;hashCode()&amp;quot;            
##  [5] &amp;quot;compareTo(&amp;quot;             &amp;quot;compareTo(&amp;quot;            
##  [7] &amp;quot;indexOf(&amp;quot;               &amp;quot;indexOf(&amp;quot;              
##  [9] &amp;quot;indexOf(&amp;quot;               &amp;quot;indexOf(&amp;quot;              
## [11] &amp;quot;valueOf(&amp;quot;               &amp;quot;valueOf(&amp;quot;              
## [13] &amp;quot;valueOf(&amp;quot;               &amp;quot;valueOf(&amp;quot;              
## [15] &amp;quot;valueOf(&amp;quot;               &amp;quot;valueOf(&amp;quot;              
## [17] &amp;quot;valueOf(&amp;quot;               &amp;quot;valueOf(&amp;quot;              
## [19] &amp;quot;valueOf(&amp;quot;               &amp;quot;length()&amp;quot;              
## [21] &amp;quot;isEmpty()&amp;quot;              &amp;quot;charAt(&amp;quot;               
## [23] &amp;quot;codePointAt(&amp;quot;           &amp;quot;codePointBefore(&amp;quot;      
## [25] &amp;quot;codePointCount(&amp;quot;        &amp;quot;offsetByCodePoints(&amp;quot;   
## [27] &amp;quot;getChars(&amp;quot;              &amp;quot;getBytes()&amp;quot;            
## [29] &amp;quot;getBytes(&amp;quot;              &amp;quot;getBytes(&amp;quot;             
## [31] &amp;quot;getBytes(&amp;quot;              &amp;quot;contentEquals(&amp;quot;        
## [33] &amp;quot;contentEquals(&amp;quot;         &amp;quot;equalsIgnoreCase(&amp;quot;     
## [35] &amp;quot;compareToIgnoreCase(&amp;quot;   &amp;quot;regionMatches(&amp;quot;        
## [37] &amp;quot;regionMatches(&amp;quot;         &amp;quot;startsWith(&amp;quot;           
## [39] &amp;quot;startsWith(&amp;quot;            &amp;quot;endsWith(&amp;quot;             
## [41] &amp;quot;lastIndexOf(&amp;quot;           &amp;quot;lastIndexOf(&amp;quot;          
## [43] &amp;quot;lastIndexOf(&amp;quot;           &amp;quot;lastIndexOf(&amp;quot;          
## [45] &amp;quot;substring(&amp;quot;             &amp;quot;substring(&amp;quot;            
## [47] &amp;quot;subSequence(&amp;quot;           &amp;quot;concat(&amp;quot;               
## [49] &amp;quot;replace(&amp;quot;               &amp;quot;replace(&amp;quot;              
## [51] &amp;quot;matches(&amp;quot;               &amp;quot;contains(&amp;quot;             
## [53] &amp;quot;replaceFirst(&amp;quot;          &amp;quot;replaceAll(&amp;quot;           
## [55] &amp;quot;split(&amp;quot;                 &amp;quot;split(&amp;quot;                
## [57] &amp;quot;join(&amp;quot;                  &amp;quot;join(&amp;quot;                 
## [59] &amp;quot;toLowerCase(&amp;quot;           &amp;quot;toLowerCase()&amp;quot;         
## [61] &amp;quot;toUpperCase()&amp;quot;          &amp;quot;toUpperCase(&amp;quot;          
## [63] &amp;quot;trim()&amp;quot;                 &amp;quot;toCharArray()&amp;quot;         
## [65] &amp;quot;format(&amp;quot;                &amp;quot;format(&amp;quot;               
## [67] &amp;quot;copyValueOf(&amp;quot;           &amp;quot;copyValueOf(&amp;quot;          
## [69] &amp;quot;intern()&amp;quot;               &amp;quot;wait(&amp;quot;                 
## [71] &amp;quot;wait(&amp;quot;                  &amp;quot;wait()&amp;quot;                
## [73] &amp;quot;getClass()&amp;quot;             &amp;quot;notify()&amp;quot;              
## [75] &amp;quot;notifyAll()&amp;quot;            &amp;quot;chars()&amp;quot;               
## [77] &amp;quot;codePoints()&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;signatures-in-jni-notation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Signatures in JNI notation&lt;/h1&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th&gt;Java Type&lt;/th&gt;
&lt;th&gt;Signature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;boolean&lt;/td&gt;
&lt;td&gt;Z&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;byte&lt;/td&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;char&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;short&lt;/td&gt;
&lt;td&gt;S&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;int&lt;/td&gt;
&lt;td&gt;I&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;long&lt;/td&gt;
&lt;td&gt;J&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;float&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;double&lt;/td&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;type[]&lt;/td&gt;
&lt;td&gt;[ type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;method type&lt;/td&gt;
&lt;td&gt;( arg-types ) ret-type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;fully-qualified-class&lt;/td&gt;
&lt;td&gt;Lfully-qualified-class ;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;in-the-fully-qualified-class-row-of-the-table-above-note-the&#34;&gt;In the fully-qualified-class row of the table above note the&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;L&lt;/code&gt; prefix&lt;/li&gt;
&lt;li&gt;&lt;code&gt;;&lt;/code&gt; suffix&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;for-example&#34;&gt;For example&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;the Java method: &lt;code&gt;long f (int n, String s, int[] arr);&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;has type signature: &lt;code&gt;(ILjava/lang/String;[I)J&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rforge.net/rJava/&#34;&gt;rJava basic crashcourse&lt;/a&gt; - at the rJava site on rforge, scroll down to the Documentation section&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/types.html#wp276&#34;&gt;The JNI Type Signatures&lt;/a&gt; - at Oracle JNI specs&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/rJava/rJava.pdf&#34;&gt;rJava documentation&lt;/a&gt; on CRAN&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://darrenjw.wordpress.com/2011/01/01/calling-java-code-from-r/&#34;&gt;Calling Java code from R&lt;/a&gt; by prof. Darren Wilkinson&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Java_Native_Interface#Mapping_types&#34;&gt;Mapping of types between Java (JNI) and native code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stackoverflow.com/questions/37735108/r-error-onload-failed-in-loadnamespace-for-rjava&#34;&gt;Fixing issues with loading rJava&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>R:case4base - data aggregation with base R</title>
      <link>https://jozef.io/r003-aggregation/</link>
      <pubDate>Sat, 09 Jun 2018 13:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r003-aggregation/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the previous articles of the R:case4base series we discussed and learned&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how to &lt;a href=&#34;..\r001-reshape&#34;&gt;reshape data with base R&lt;/a&gt; to a form that is practical for our use and&lt;/li&gt;
&lt;li&gt;how to &lt;a href=&#34;..\r002-data-manipulation&#34;&gt;subset data&lt;/a&gt; to get the relevant parts of it with base R.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this one, we will look at aggregation techniques using base R’s &lt;code&gt;stats::aggregate&lt;/code&gt; generic function, focusing on the method for data frames. This will allow us to easily and safely create simple aggregations, but also provide a framework for completely custom aggregation functionality defined as separate functions that can be properly documented and unit tested.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#how-to-use-this-article&#34;&gt;How to use this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#simple-aggregations&#34;&gt;Simple aggregations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#grouping-by-more-variables-and-small-tweaks&#34;&gt;Grouping by more variables and small tweaks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-aggregate-as-a-framework-with-custom-aggregation-functions&#34;&gt;Using aggregate as a framework with custom aggregation functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#advanced-details-of-aggregate-use&#34;&gt;Advanced details of aggregate use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#aggregates-methods-for-other-object-classes&#34;&gt;Aggregate’s methods for other object classes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-want-the-code&#34;&gt;TL;DR - Just want the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#exercises&#34;&gt;Exercises&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#exercise-answers&#34;&gt;Exercise answers&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;how-to-use-this-article&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How to use this article&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a &lt;code&gt;case4base&lt;/code&gt; after all!&lt;/li&gt;
&lt;li&gt;If you have no time for reading, you can &lt;a href=&#34;https://jozef.io/post/data/r003-aggregation.r&#34;&gt;click here to get just the code with commentary&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;First, let’s read in yearly data on gross disposable income of household in the EU countries into R (&lt;a href=&#34;https://jozef.io/post/data/ESA2010_GDI.csv&#34;&gt;click here to download&lt;/a&gt;) and reshape them to get a nice, long format data to work with:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi &amp;lt;- read.csv(
  stringsAsFactors = FALSE
, file = &amp;quot;https://jozef.io/post/data/ESA2010_pretty.csv&amp;quot;
)

gdi &amp;lt;- reshape(data = gdi
             , direction = &amp;quot;long&amp;quot; # we are going from wide to long
             , varying = 2:67     # columns that will be stacked into 1
             , idvar = &amp;quot;country&amp;quot;  # identifying the subject in rows
             )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The goal of the article is therefore not really in presenting these conrete results, but to focus on the technical aspects and usefulness of the presented methods.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;simple-aggregations&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Simple aggregations&lt;/h1&gt;
&lt;p&gt;In this paragraph, we will try to show how to perform simple aggregation on data.frames. As the first example, let us look at the mean gross saving across the years per country:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;aggregate(x = gdi[&amp;quot;GrossSaving&amp;quot;]
        , by = list(country = gdi[[&amp;quot;country&amp;quot;]])
        , FUN = mean
        )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country GrossSaving
## 1         Austria  24724.6227
## 2         Belgium  28961.7136
## 3        Bulgaria  -1711.6136
## 4         Croatia          NA
## 5          Cyprus          NA
## 6  Czech Republic 208404.0000
## 7         Denmark  53667.7273
## 8         Estonia    487.1409
## 9           EU 28          NA
## 10   Euro area 19          NA
## 11        Finland   7656.7727
## 12         France 169311.6818
## 13        Germany 265215.6818
## 14         Greece   5289.8464
## 15        Hungary          NA
## 16        Iceland          NA
## 17        Ireland   5831.3136
## 18          Italy 135086.8591
## 19         Latvia    147.1718
## 20      Lithuania    394.4595
## 21     Luxembourg   2510.5136
## 22          Malta          NA
## 23    Netherlands  37810.7727
## 24         Norway 113559.5000
## 25         Poland  45032.8636
## 26       Portugal   9348.6191
## 27        Romania          NA
## 28         Serbia          NA
## 29       Slovakia   2470.1173
## 30       Slovenia   2346.7668
## 31          Spain          NA
## 32         Sweden 207348.7273
## 33    Switzerland  74211.0864
## 34         Turkey          NA
## 35 United Kingdom  79609.8636&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we can see, we provided 3 arguments to &lt;code&gt;aggregate&lt;/code&gt; (specifically the &lt;code&gt;aggregate.data.frame&lt;/code&gt; method that gets called if the provided &lt;code&gt;x&lt;/code&gt; is a data frame):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;x&lt;/code&gt; - the data we want to aggregate, in our case the &lt;code&gt;GrossSaving&lt;/code&gt; column of the &lt;code&gt;gdi&lt;/code&gt; data.frame&lt;/li&gt;
&lt;li&gt;&lt;code&gt;by&lt;/code&gt; - a list of 1 element - &lt;code&gt;country&lt;/code&gt; which specifies how the data will be grouped&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FUN&lt;/code&gt; - function which will be used, in our case arithmetic &lt;code&gt;mean&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r003-01-aggregate.gif&#34; alt=&#34;Simple aggregate&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Simple aggregate&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We can also see in our results, that for some countries such as Croatia, Cyprus and more, we have &lt;code&gt;NA&lt;/code&gt; as a result. This is because numerical operations on vectors that contain even a single &lt;code&gt;NA&lt;/code&gt; value will usually return &lt;code&gt;NA&lt;/code&gt; as a result. If we want, we can usually work around this by providing an extra &lt;code&gt;na.rm = TRUE&lt;/code&gt; argument to the function, which will strip the &lt;code&gt;NA&lt;/code&gt; values before computation:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;aggregate(x = gdi[&amp;quot;GrossSaving&amp;quot;]
        , by = list(country = gdi[[&amp;quot;country&amp;quot;]])
        , FUN = mean
        , na.rm = TRUE
        )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           country  GrossSaving
## 1         Austria   24724.6227
## 2         Belgium   28961.7136
## 3        Bulgaria   -1711.6136
## 4         Croatia   18301.8727
## 5          Cyprus     438.6838
## 6  Czech Republic  208404.0000
## 7         Denmark   53667.7273
## 8         Estonia     487.1409
## 9           EU 28  924443.4983
## 10   Euro area 19  754148.9800
## 11        Finland    7656.7727
## 12         France  169311.6818
## 13        Germany  265215.6818
## 14         Greece    5289.8464
## 15        Hungary 1220273.5714
## 16        Iceland    -336.9933
## 17        Ireland    5831.3136
## 18          Italy  135086.8591
## 19         Latvia     147.1718
## 20      Lithuania     394.4595
## 21     Luxembourg    2510.5136
## 22          Malta          NaN
## 23    Netherlands   37810.7727
## 24         Norway  113559.5000
## 25         Poland   45032.8636
## 26       Portugal    9348.6191
## 27        Romania     271.2048
## 28         Serbia          NaN
## 29       Slovakia    2470.1173
## 30       Slovenia    2346.7668
## 31          Spain   57683.3333
## 32         Sweden  207348.7273
## 33    Switzerland   74211.0864
## 34         Turkey  129045.3843
## 35 United Kingdom   79609.8636&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;grouping-by-more-variables-and-small-tweaks&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Grouping by more variables and small tweaks&lt;/h1&gt;
&lt;p&gt;To make things even easier, we can use the fact that data.frames are also lists and we can therefore substitute &lt;code&gt;by = list(country = gdi[[&amp;quot;country&amp;quot;]]&lt;/code&gt; by a much simpler and easier to read &lt;code&gt;gdi[&amp;quot;country&amp;quot;]&lt;/code&gt;. Note and be careful that we only use &lt;code&gt;[]&lt;/code&gt; for the sub-setting to get the sub-list, as &lt;code&gt;gdi[[&amp;quot;country&amp;quot;]]&lt;/code&gt; would give us the vector of countries, as well as &lt;code&gt;gdi$country&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;is.list(list(country = gdi[[&amp;quot;country&amp;quot;]]))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;is.list(gdi[&amp;quot;country&amp;quot;])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;is.list(gdi[[&amp;quot;country&amp;quot;]])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] FALSE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also group the data by more than one column, or a column translated in any way that fits our purposes, the only constraint is that the grouping elements (elements of the &lt;code&gt;by&lt;/code&gt; argument), are each as long as the variables in the data frame &lt;code&gt;x&lt;/code&gt;. And of course we also can aggregate more than 1 column at the same time.&lt;/p&gt;
&lt;p&gt;As an example, let us&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calculate the mean not only for each country, but extend the grouping to decades&lt;/li&gt;
&lt;li&gt;calculate the mean for more variables, not just &lt;code&gt;&amp;quot;GrossSaving&amp;quot;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;aggregate(x = gdi[c(&amp;quot;ConspC&amp;quot;, &amp;quot;AGDIpC&amp;quot;, &amp;quot;GrossSaving&amp;quot;)]
        , by = list(decade = paste0(substr(gdi[[&amp;quot;time&amp;quot;]], 1L, 3L), &amp;quot;0s&amp;quot;)
                  , country = gdi[[&amp;quot;country&amp;quot;]]
                  )
        , FUN = mean
        , na.rm = TRUE
        )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     decade        country      ConspC      AGDIpC   GrossSaving
## 1    1990s        Austria   19434.288   22578.640   20956.42000
## 2    2000s        Austria   21943.003   25145.214   25327.26000
## 3    2010s        Austria   23375.659   26135.279   26555.28571
## 4    1990s        Belgium   18938.482   21987.036   25395.72000
## 5    2000s        Belgium   21081.202   24088.858   30272.62000
## 6    2010s        Belgium   22594.490   24889.807   29636.12857
## 7    1990s       Bulgaria    3449.757    3494.050     -68.02000
## 8    2000s       Bulgaria    5578.549    5084.613   -2892.99000
## 9    2010s       Bulgaria    7813.849    7535.431   -1197.92857
## 10   1990s        Croatia         NaN         NaN           NaN
## 11   2000s        Croatia   51543.151   54675.474   15148.93750
## 12   2010s        Croatia   52515.373   57510.730   26709.70000
## 13   1990s         Cyprus   12302.576   12804.322     324.48600
## 14   2000s         Cyprus   15719.001   16671.523     727.46300
## 15   2010s         Cyprus   15788.383   15973.670      52.55000
## 16   1990s Czech Republic  159897.720  177018.694  132593.40000
## 17   2000s Czech Republic  200904.419  220895.554  204166.20000
## 18   2010s Czech Republic  228702.080  250919.280  268608.42857
## 19   1990s        Denmark  182933.412  180861.862   25988.60000
## 20   2000s        Denmark  205686.057  202040.782   48128.80000
## 21   2010s        Denmark  216901.757  220419.361   81351.28571
## 22   1990s        Estonia    3871.640    4249.542     270.28000
## 23   2000s        Estonia    6470.004    6500.569     160.78000
## 24   2010s        Estonia    7915.886    8509.247    1108.27143
## 25   1990s          EU 28   15214.920   16667.920  721765.79000
## 26   2000s          EU 28   17108.231   18635.367  890431.08600
## 27   2010s          EU 28   18125.639   19647.553 1001986.61714
## 28   1990s   Euro area 19   17607.510   19749.530  602634.05000
## 29   2000s   Euro area 19   19134.583   21366.280  733063.78800
## 30   2010s   Euro area 19   19740.273   21802.314  805915.67286
## 31   1990s        Finland   17220.232   18555.942    5659.60000
## 32   2000s        Finland   21616.329   23135.134    7536.40000
## 33   2010s        Finland   24631.860   26265.441    9255.28571
## 34   1990s         France   17903.622   20437.534  127231.80000
## 35   2000s         France   20696.734   23594.476  169890.90000
## 36   2010s         France   22088.164   25010.304  198541.28571
## 37   1990s        Germany         NaN   22112.486  215917.60000
## 38   2000s        Germany         NaN   23846.360  256128.00000
## 39   2010s        Germany         NaN   25848.440  313411.00000
## 40   1990s         Greece   12037.446   13388.490    9475.84600
## 41   2000s         Greece   15757.594   16707.839    8902.80000
## 42   2010s         Greece   13893.707   13620.999   -2861.51571
## 43   1990s        Hungary 1266767.368 1457751.030  868855.20000
## 44   2000s        Hungary 1737385.357 1838395.760 1140422.10000
## 45   2010s        Hungary 1727120.847 1864085.342 1646208.00000
## 46   1990s        Iceland         NaN         NaN           NaN
## 47   2000s        Iceland 3665145.798 3246304.470    5208.11000
## 48   2010s        Iceland 3491617.010 3112374.812  -11427.20000
## 49   1990s        Ireland   14151.552   14664.146    2801.42000
## 50   2000s        Ireland   20927.056   21749.417    6089.36000
## 51   2010s        Ireland   21803.019   22959.619    7626.88571
## 52   1990s          Italy   17703.632   20908.074  140031.20000
## 53   2000s          Italy   19631.544   22234.294  143875.11000
## 54   2010s          Italy   18584.590   20404.033  119000.54286
## 55   1990s         Latvia    3268.952    3188.684     -92.53200
## 56   2000s         Latvia    5400.014    5542.577     412.48900
## 57   2010s         Latvia    7088.467    6945.319     -60.63571
## 58   1990s      Lithuania    3260.052    3348.234     235.99800
## 59   2000s      Lithuania    5823.319    5947.266     422.12200
## 60   2010s      Lithuania    7934.150    8047.637     468.12857
## 61   1990s     Luxembourg   27550.836   31879.426    1411.74000
## 62   2000s     Luxembourg   32355.940   37663.168    2240.48000
## 63   2010s     Luxembourg   33700.054   40006.649    3681.11429
## 64   1990s          Malta         NaN         NaN           NaN
## 65   2000s          Malta         NaN         NaN           NaN
## 66   2010s          Malta         NaN         NaN           NaN
## 67   1990s    Netherlands   18829.144   20298.764   32126.40000
## 68   2000s    Netherlands   22457.564   23556.095   34825.10000
## 69   2010s    Netherlands   23204.377   24568.551   46136.28571
## 70   1990s         Norway  198946.604  207770.146   54321.60000
## 71   2000s         Norway  258438.853  269837.239   93655.20000
## 72   2010s         Norway  315219.324  334970.457  184307.00000
## 73   1990s         Poland   15919.330   18408.280   55897.00000
## 74   2000s         Poland   21828.780   22907.936   51046.50000
## 75   2010s         Poland   28529.733   28772.321   28681.85714
## 76   1990s       Portugal   10704.874   11856.422    8673.37200
## 77   2000s       Portugal   12562.477   13609.388   10208.75700
## 78   2010s       Portugal   12298.499   13066.471    8602.17000
## 79   1990s        Romania    8152.276    8427.888       5.88000
## 80   2000s        Romania   13854.047   13125.680  -11695.86000
## 81   2010s        Romania   19617.068   20486.043   20437.41667
## 82   1990s         Serbia         NaN         NaN           NaN
## 83   2000s         Serbia         NaN         NaN           NaN
## 84   2010s         Serbia         NaN         NaN           NaN
## 85   1990s       Slovakia    5050.218    5647.868    1810.30200
## 86   2000s       Slovakia    6824.102    7211.535    2189.37300
## 87   2010s       Slovakia    8479.726    8932.469    3342.47714
## 88   1990s       Slovenia    8573.522    9491.256    1009.57000
## 89   2000s       Slovenia   10719.190   12155.871    2626.34200
## 90   2010s       Slovenia   11666.361   12996.130    2902.51429
## 91   1990s          Spain   13961.670   15200.950   38715.00000
## 92   2000s          Spain   15808.068   17234.555   55932.60000
## 93   2010s          Spain   15180.146   16500.587   62894.14286
## 94   1990s         Sweden  182324.968  183408.674   68070.60000
## 95   2000s         Sweden  223273.148  229308.996  162984.10000
## 96   2010s         Sweden  251975.783  273511.021  370211.14286
## 97   1990s    Switzerland   41046.446   44855.838   53641.70000
## 98   2000s    Switzerland   44297.743   49556.481   69279.00000
## 99   2010s    Switzerland   47207.609   54372.377   95949.34286
## 100  1990s         Turkey         NaN         NaN           NaN
## 101  2000s         Turkey         NaN         NaN   69969.50000
## 102  2010s         Turkey         NaN         NaN  138891.36500
## 103  1990s United Kingdom   14625.190   15250.624   74342.00000
## 104  2000s United Kingdom   18919.282   19157.107   73031.80000
## 105  2010s United Kingdom   19716.279   20172.029   92769.85714&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-aggregate-as-a-framework-with-custom-aggregation-functions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using aggregate as a framework with custom aggregation functions&lt;/h1&gt;
&lt;p&gt;Perhaps one of the most useful cases for &lt;code&gt;aggregate&lt;/code&gt; is using it as a supporting framework for custom aggregations, since the &lt;code&gt;FUN&lt;/code&gt; argument can be set to a function defined to suit specific purposes. This provides a very flexible environment where one can&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;implement the custom aggregation functions in the most suitable way for the purpose&lt;/li&gt;
&lt;li&gt;have unit testing for those functions&lt;/li&gt;
&lt;li&gt;documentation and other aspects of implementation in place&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And use the aggregate as a reliable executor for such functionality, all using standard base R evaluation principles. An over-simplified example of the above approach could be the following:&lt;/p&gt;
&lt;p&gt;We define the aggregation function &lt;code&gt;dummyaggfun&lt;/code&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dummyaggfun &amp;lt;- function(v) {
  c(max = max(v)
  , min = min(v)
  , rng = max(v) - min(v)
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And apply the aggregation&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;aggregate(gdi[&amp;quot;GrossSaving&amp;quot;]
        , by = list(decade = paste0(substr(gdi[[&amp;quot;time&amp;quot;]], 1L, 3L), &amp;quot;0s&amp;quot;)
                  , country = gdi[[&amp;quot;country&amp;quot;]]
                  )
        , FUN = dummyaggfun
        )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     decade        country GrossSaving.max GrossSaving.min GrossSaving.rng
## 1    1990s        Austria        23226.80        19097.10         4129.70
## 2    2000s        Austria        31618.00        19897.90        11720.10
## 3    2010s        Austria        28755.60        25194.20         3561.40
## 4    1990s        Belgium        27350.10        24448.40         2901.70
## 5    2000s        Belgium        39041.60        25650.80        13390.80
## 6    2010s        Belgium        33126.40        27251.20         5875.20
## 7    1990s       Bulgaria          448.40         -483.00          931.40
## 8    2000s       Bulgaria         -758.60        -7200.60         6442.00
## 9    2010s       Bulgaria         2925.80        -4525.60         7451.40
## 10   1990s        Croatia              NA              NA              NA
## 11   2000s        Croatia              NA              NA              NA
## 12   2010s        Croatia              NA              NA              NA
## 13   1990s         Cyprus          545.50          185.62          359.88
## 14   2000s         Cyprus         1194.23          280.04          914.19
## 15   2010s         Cyprus              NA              NA              NA
## 16   1990s Czech Republic       145286.00       116646.00        28640.00
## 17   2000s Czech Republic       295156.00       156060.00       139096.00
## 18   2010s Czech Republic       293141.00       246605.00        46536.00
## 19   1990s        Denmark        42398.00         9694.00        32704.00
## 20   2000s        Denmark        72548.00        15456.00        57092.00
## 21   2010s        Denmark       111688.00        36971.00        74717.00
## 22   1990s        Estonia          401.20          200.20          201.00
## 23   2000s        Estonia         1115.10         -278.30         1393.40
## 24   2010s        Estonia         1415.10          839.50          575.60
## 25   1990s          EU 28              NA              NA              NA
## 26   2000s          EU 28      1077659.22       769059.51       308599.71
## 27   2010s          EU 28      1029579.38       976054.92        53524.46
## 28   1990s   Euro area 19              NA              NA              NA
## 29   2000s   Euro area 19       879005.73       596298.12       282707.61
## 30   2010s   Euro area 19       822350.15       781605.58        40744.57
## 31   1990s        Finland         6772.00         4436.00         2336.00
## 32   2000s        Finland        10986.00         6200.00         4786.00
## 33   2010s        Finland        10801.00         7534.00         3267.00
## 34   1990s         France       131350.00       119588.00        11762.00
## 35   2000s         France       206161.00       136627.00        69534.00
## 36   2010s         France       206511.00       191738.00        14773.00
## 37   1990s        Germany       217330.00       214836.00         2494.00
## 38   2000s        Germany       291363.00       216433.00        74930.00
## 39   2010s        Germany       345523.00       292290.00        53233.00
## 40   1990s         Greece        11398.81         8234.04         3164.77
## 41   2000s         Greece        11510.19         6390.06         5120.13
## 42   2010s         Greece         2897.71        -7727.10        10624.81
## 43   1990s        Hungary      1012178.00       710576.00       301602.00
## 44   2000s        Hungary      1614306.00       788873.00       825433.00
## 45   2010s        Hungary              NA              NA              NA
## 46   1990s        Iceland              NA              NA              NA
## 47   2000s        Iceland        88500.00       -52886.40       141386.40
## 48   2010s        Iceland              NA              NA              NA
## 49   1990s        Ireland         3219.60         2592.00          627.60
## 50   2000s        Ireland        11973.40         1384.10        10589.30
## 51   2010s        Ireland         9545.40         6374.60         3170.80
## 52   1990s          Italy       163452.00       116367.20        47084.80
## 53   2000s          Italy       156700.60       111087.70        45612.90
## 54   2010s          Italy       124778.90       104720.00        20058.90
## 55   1990s         Latvia           36.17         -206.97          243.14
## 56   2000s         Latvia         1922.58          -81.56         2004.14
## 57   2010s         Latvia          620.12         -555.45         1175.57
## 58   1990s      Lithuania          610.73          -78.62          689.35
## 59   2000s      Lithuania         1003.24         -719.82         1723.06
## 60   2010s      Lithuania         1516.18         -119.85         1636.03
## 61   1990s     Luxembourg         1488.10         1344.60          143.50
## 62   2000s     Luxembourg         2964.30         1584.80         1379.50
## 63   2010s     Luxembourg         4119.00         3192.70          926.30
## 64   1990s          Malta              NA              NA              NA
## 65   2000s          Malta              NA              NA              NA
## 66   2010s          Malta              NA              NA              NA
## 67   1990s    Netherlands        34110.00        28988.00         5122.00
## 68   2000s    Netherlands        47342.00        28712.00        18630.00
## 69   2010s    Netherlands        50314.00        40945.00         9369.00
## 70   1990s         Norway        66426.00        42704.00        23722.00
## 71   2000s         Norway       140538.00        51542.00        88996.00
## 72   2010s         Norway       253022.00       117285.00       135737.00
## 73   1990s         Poland        69410.00        43081.00        26329.00
## 74   2000s         Poland        84850.00        27414.00        57436.00
## 75   2010s         Poland        49574.00        14823.00        34751.00
## 76   1990s       Portugal         9717.60         7907.00         1810.60
## 77   2000s       Portugal        13217.79         8530.86         4686.93
## 78   2010s       Portugal        11929.76         6245.18         5684.58
## 79   1990s        Romania         2749.90        -2122.40         4872.30
## 80   2000s        Romania         1146.30       -24932.80        26079.10
## 81   2010s        Romania              NA              NA              NA
## 82   1990s         Serbia              NA              NA              NA
## 83   2000s         Serbia              NA              NA              NA
## 84   2010s         Serbia              NA              NA              NA
## 85   1990s       Slovakia         2073.54         1132.87          940.67
## 86   2000s       Slovakia         3119.07         1697.54         1421.53
## 87   2010s       Slovakia         4622.03         2627.62         1994.41
## 88   1990s       Slovenia         1214.62          731.94          482.68
## 89   2000s       Slovenia         3578.14         1587.45         1990.69
## 90   2010s       Slovenia         3175.60         2337.12          838.48
## 91   1990s          Spain              NA              NA              NA
## 92   2000s          Spain        93604.00        38368.00        55236.00
## 93   2010s          Spain        74681.00        53982.00        20699.00
## 94   1990s         Sweden       100539.00        50227.00        50312.00
## 95   2000s         Sweden       257867.00        85342.00       172525.00
## 96   2010s         Sweden       452834.00       280354.00       172480.00
## 97   1990s    Switzerland        56395.80        51875.60         4520.20
## 98   2000s    Switzerland        83474.60        59724.20        23750.40
## 99   2010s    Switzerland       104819.20        84475.00        20344.20
## 100  1990s         Turkey              NA              NA              NA
## 101  2000s         Turkey              NA              NA              NA
## 102  2010s         Turkey              NA              NA              NA
## 103  1990s United Kingdom        84296.00        55971.00        28325.00
## 104  2000s United Kingdom       102670.00        58101.00        44569.00
## 105  2010s United Kingdom       126386.00        68648.00        57738.00&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;advanced-details-of-aggregate-use&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Advanced details of aggregate use&lt;/h1&gt;
&lt;p&gt;Examining the code of &lt;code&gt;aggregate.data.frame&lt;/code&gt; will give us a good picture of how the function operates. This could be roughly described in the following way, abstracting from the defensive programming aspects and details and focusing on the functionality itself:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;create &lt;code&gt;grp&lt;/code&gt; - group labels that are (most likely) numbers stored as &lt;code&gt;character&lt;/code&gt; by factorizing the elements of &lt;code&gt;by&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;create &lt;code&gt;y&lt;/code&gt; - a &lt;code&gt;data.frame&lt;/code&gt; with the data grouping resulting from processing &lt;code&gt;by&lt;/code&gt;, to which the results will be binded&lt;/li&gt;
&lt;li&gt;take the input data &lt;code&gt;x&lt;/code&gt; (coerced to a &lt;code&gt;data.frame&lt;/code&gt;) and column by column &lt;code&gt;split&lt;/code&gt; the data into groups according to &lt;code&gt;grp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;apply &lt;code&gt;FUN&lt;/code&gt; (that was retrieved by &lt;code&gt;match.fun&lt;/code&gt;) on the results of the &lt;code&gt;split&lt;/code&gt;, assign the results into &lt;code&gt;z&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;bind the &lt;code&gt;y&lt;/code&gt; that has the group labels with &lt;code&gt;z&lt;/code&gt; that has the results&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;providing-the-fun-argument&#34;&gt;Providing the &lt;code&gt;FUN&lt;/code&gt; argument&lt;/h4&gt;
&lt;p&gt;One specific should be noted - providing &lt;code&gt;FUN&lt;/code&gt; as a character string (name of the function, e.g. &lt;code&gt;FUN = &amp;quot;mean&amp;quot;&lt;/code&gt;) will trigger the non-standard evaluation part of code in &lt;code&gt;match.fun&lt;/code&gt;, which we may like to avoid.
This is easily achieved by providing the &lt;code&gt;FUN&lt;/code&gt; argument with the function diretly, not via the function’s name (e.g. &lt;code&gt;FUN = mean&lt;/code&gt;) as in that case &lt;code&gt;match.fun&lt;/code&gt; just returns the provided &lt;code&gt;FUN&lt;/code&gt; without further changes&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;argument-structure-of-fun&#34;&gt;Argument structure of &lt;code&gt;FUN&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;The value returned from &lt;code&gt;split&lt;/code&gt; is a list of vectors containing the values for the groups. The &lt;code&gt;FUN&lt;/code&gt; is provided with the elements of that list via &lt;code&gt;lapply&lt;/code&gt;, which are vectors. This is helpful for the setup of the custom &lt;code&gt;FUN&lt;/code&gt;.
We can also take advantage of the &lt;code&gt;...&lt;/code&gt; concept and dedicate a part of the &lt;code&gt;FUN&lt;/code&gt; code to process more provided arguments.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;aggregates-methods-for-other-object-classes&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Aggregate’s methods for other object classes&lt;/h1&gt;
&lt;p&gt;So far we have mostly used the &lt;code&gt;aggregate.data.frame&lt;/code&gt; method, however &lt;code&gt;aggregate&lt;/code&gt; is a generic function with methods for multiple classes of objects, here is a very quick overview:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;aggregate.default&lt;/code&gt; - the default method, which uses the time series method if &lt;code&gt;x&lt;/code&gt; is a time series, and otherwise coerces &lt;code&gt;x&lt;/code&gt; to a &lt;code&gt;data.frame&lt;/code&gt; and calls the &lt;code&gt;data.frame&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aggregate.ts&lt;/code&gt; - the time series method, is further discussed in R’s help on &lt;code&gt;?aggregate&lt;/code&gt;. Investigation of the code is also very advisable.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aggregate.formula&lt;/code&gt; - the formula method, is a standard formula interface to &lt;code&gt;aggregate.data.frame&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aggregate.data.frame&lt;/code&gt; - is discussed in this article&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://www.milanor.net/blog/aggregation-dplyr-summarise-summarise_each/&#34;&gt;dplyr::summarize&lt;/a&gt; and friends&lt;/li&gt;
&lt;li&gt;using &lt;a href=&#34;https://datascienceplus.com/efficient-aggregation-using-data-table/&#34;&gt;data.table&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-want-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just want the code&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;No time for reading? &lt;a href=&#34;https://jozef.io/post/data/r003-aggregation.r&#34;&gt;Click here to get just the code with commentary&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;exercises&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Exercises&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Looking at the &lt;code&gt;aggregate(state.x77, list(Region = state.region), mean)&lt;/code&gt; example in &lt;code&gt;?aggregate&lt;/code&gt;, how does R know how to match the states to the regions? Would the example still work if the data in &lt;code&gt;state.x77&lt;/code&gt; were sorted differently?&lt;/li&gt;
&lt;li&gt;What is the difference between &lt;code&gt;aggregate(x = gdi[&amp;quot;GrossSaving&amp;quot;], by = gdi[&amp;quot;country&amp;quot;], FUN = mean)&lt;/code&gt; and &lt;code&gt;aggregate(x = gdi[[&amp;quot;GrossSaving&amp;quot;]], by = gdi[&amp;quot;country&amp;quot;], FUN = mean)&lt;/code&gt;. What is the issue with the latter? Looking at the code of &lt;code&gt;aggregate.data.frame&lt;/code&gt;, why does the latter still work?&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/aggregate&#34;&gt;aggregate&lt;/a&gt; at rdocumentation.org&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/base/versions/3.5.0/topics/split&#34;&gt;split&lt;/a&gt; at rdocumentation.org&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stackoverflow.com/questions/3057341/how-to-use-rs-ellipsis-feature-when-writing-your-own-function&#34;&gt;discussion on &lt;code&gt;...&lt;/code&gt; (ellipsis)&lt;/a&gt; on stack overflow&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://ec.europa.eu/eurostat/web/sector-accounts/data/annual-data&#34;&gt;original eurostat data source&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;exercise-answers&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Exercise answers&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://jozef.io/post/data/r003-aggregation.r&#34;&gt;At the bottom of the code for the article&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>RStudio:addins part 3 - View objects, files, functions and more with 1 keypress</title>
      <link>https://jozef.io/r103-keypress-viewer/</link>
      <pubDate>Sat, 26 May 2018 14:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r103-keypress-viewer/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In this post in the RStudio:addins series we will try to make our work more efficient with an addin for better inspection of objects, functions and files within RStudio. RStudio already has a very useful &lt;code&gt;View&lt;/code&gt; function and a &lt;code&gt;Go To Function / File&lt;/code&gt; feature with F2 as the default keyboard shortcut and yes, I know I promised automatic generation of &lt;code&gt;@importFrom&lt;/code&gt; roxygen tags in the &lt;a href=&#34;../r102-addin-roxytags&#34;&gt;previous post&lt;/a&gt;, unfortunately we will have to wait a bit longer for that one but I believe this one more than makes up for it in usefulness.&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;the-addin-we-will-create-in-this-article-will-let-us-use-rstudio-to-view-and-inspect-a-wide-range-of-objects-functions-and-files-with-1-keypress.&#34;&gt;The addin we will create in this article will let us use RStudio to View and inspect a wide range of objects, functions and files with 1 keypress.&lt;/h4&gt;
&lt;/blockquote&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r103-01-showcase.gif&#34; alt=&#34;The addins in action&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;The addins in action&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#retrieving-objects-from-sys.frames&#34;&gt;Retrieving objects from sys.frames&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#viewing-files-objects-functions-and-more-efficiently&#34;&gt;Viewing files, objects and functions and more efficiently&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-addin-function-updating-the-.dcf-file-and-key-binding&#34;&gt;The addin function, updating the .dcf file and key binding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-addin-in-action&#34;&gt;The addin in action&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-give-me-the-package&#34;&gt;TL;DR - Just give me the package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;retrieving-objects-from-sys.frames&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Retrieving objects from sys.frames&lt;/h1&gt;
&lt;p&gt;As a first step, we need to be able to retrieve the value of the object we are looking for based on a character string from a frame within the currently present &lt;code&gt;sys.frames()&lt;/code&gt; for our session. This may get tricky, as it is not sufficient to only look at parent frames, because we may easily have multiple sets of “parallel” call stacks, especially when executing addins.&lt;/p&gt;
&lt;p&gt;An example can be seen in the following screenshot, where we have a &lt;code&gt;browser()&lt;/code&gt; call executed during the Addin execution itself. We can see that our current frame is 18 and browsing through its parent would get us to frames &lt;code&gt;17 -&amp;gt; 16 -&amp;gt; 15 -&amp;gt; 14 -&amp;gt; 0&lt;/code&gt; (&lt;code&gt;0&lt;/code&gt; being the &lt;code&gt;.GlobalEnv&lt;/code&gt;). The object we are looking for is however most likely in one of the other frames (&lt;code&gt;9&lt;/code&gt; in this particular case):&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r103-02-frames.png&#34; alt=&#34;Example of sys.frames&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Example of sys.frames&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;getFromSysframes &amp;lt;- function(x) {
  if (!(is.character(x) &amp;amp;&amp;amp; length(x) == 1 &amp;amp;&amp;amp; nchar(x) &amp;gt; 0)) {
    warning(&amp;quot;Expecting a non-empty character of length 1. Returning NULL.&amp;quot;)
    return(invisible(NULL))
  }
  validframes &amp;lt;- c(sys.frames()[-sys.nframe()], .GlobalEnv)
  res &amp;lt;- NULL
  for (i in validframes) {
    inherits &amp;lt;- identical(i, .GlobalEnv)
    res &amp;lt;- get0(x, i, inherits = inherits)
    if (!is.null(res)) {
      return(res)
    }
  }
  return(invisible(res))
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;viewing-files-objects-functions-and-more-efficiently&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Viewing files, objects, functions and more efficiently&lt;/h1&gt;
&lt;p&gt;As a second step, we write a function to actually view our object in RStudio. We have quite some flexibility here, so as a first shot we can do the following:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Open a file if the selection (or the selection with quotes added) is a path to an existing file. This is useful for viewing our scripts, data files, etc. even if they are not quoted, such as the links in your &lt;code&gt;Rmd&lt;/code&gt; files&lt;/li&gt;
&lt;li&gt;Attempt to retrieve the object by the name and if found, try to use &lt;code&gt;View&lt;/code&gt; to view it&lt;/li&gt;
&lt;li&gt;If we did not find the object, we can optionally still try to retrieve the value by evaluating the provided character string. This carries some pitfalls, but is very useful for example for
&lt;ul&gt;
&lt;li&gt;viewing elements of lists, vectors, etc. where we need to evaluate &lt;code&gt;[&lt;/code&gt;, &lt;code&gt;[[&lt;/code&gt; or &lt;code&gt;$&lt;/code&gt; to do so.&lt;/li&gt;
&lt;li&gt;viewing operation results directly in the viewer, as opposed to writing them out into the console, useful for example for wide matrices that (subjectively) look better in the RStudio viewer, compared to the console output&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;If the &lt;code&gt;View&lt;/code&gt; fails, we can still show useful information by trying to View its structure, enabling us to inspect objects that cannot be coerced to a &lt;code&gt;data.frame&lt;/code&gt; and therefore would fail to be viewed.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;viewObject &amp;lt;- function(chr,
                       tryEval = getOption(&amp;quot;jhaddins_view_tryeval&amp;quot;,
                                           default = TRUE)
                       ) {

  if (!(is.character(chr) &amp;amp;&amp;amp; length(chr) == 1 &amp;amp;&amp;amp; nchar(chr) &amp;gt; 0)) {
    message(&amp;quot;Invalid input, expecting a non-empty character of length 1&amp;quot;)
    return(invisible(1L))
  }

  ViewWrap &amp;lt;- get(&amp;quot;View&amp;quot;, envir = as.environment(&amp;quot;package:utils&amp;quot;))

  # maybe it is an unquoted filename - if so, open it
  if (file.exists(chr)) {
    rstudioapi::navigateToFile(chr)
    return(invisible(0L))
  }
  # or maybe it is a quoted filename - if so, open it
  if (file.exists(gsub(&amp;quot;\&amp;quot;&amp;quot;, &amp;quot;&amp;quot;, chr, fixed = TRUE))) {
    rstudioapi::navigateToFile(gsub(&amp;quot;\&amp;quot;&amp;quot;, &amp;quot;&amp;quot;, chr, fixed = TRUE))
    return(invisible(0L))
  }

  obj &amp;lt;- getFromSysframes(chr)

  if (is.null(obj)) {
    if (isTRUE(tryEval)) {
      # object not found, try evaluating
      try(obj &amp;lt;- eval(parse(text = chr)), silent = TRUE)
    }
    if (is.null(obj)) {
      message(sprintf(&amp;quot;Object %s not found&amp;quot;, chr))
      return(invisible(1L))
    }
  }

  # try to View capturing output for potential errors
  Viewout &amp;lt;- utils::capture.output(ViewWrap(obj, title = chr))
  if (length(Viewout) &amp;gt; 0 &amp;amp;&amp;amp; grepl(&amp;quot;Error&amp;quot;, Viewout)) {
    # could not view, try to at least View the str of the object
    strcmd &amp;lt;- sprintf(&amp;quot;str(%s)&amp;quot;, chr)
    message(paste(Viewout,&amp;quot;| trying to View&amp;quot;, strcmd))
    ViewWrap(utils::capture.output(utils::str(obj)), title = strcmd)
  }

  return(invisible(0L))
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This function can of course be improved and updated in many ways, for example using the &lt;code&gt;summary&lt;/code&gt; method instead of &lt;code&gt;str&lt;/code&gt; for selected object classes, or showing contents of &lt;code&gt;.csv&lt;/code&gt; (or other data) files already read into a &lt;code&gt;data.frame&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-addin-function-updating-the-.dcf-file-and-key-binding&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The addin function, updating the .dcf file and key binding&lt;/h1&gt;
&lt;p&gt;If you followed the previous posts in the series, you most likely already know what is coming up next. First, we need a function serving as a binding for the addin that will execute out &lt;code&gt;viewObject&lt;/code&gt; function on the active document’s selections:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;viewSelection &amp;lt;- function() {
  context &amp;lt;- rstudioapi::getActiveDocumentContext()
  lapply(X = context[[&amp;quot;selection&amp;quot;]]
         , FUN = function(thisSel) {
           viewObject(thisSel[[&amp;quot;text&amp;quot;]])
         }
  )
  return(invisible(NULL))
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Secondly, we update the &lt;code&gt;inst/rstudio/addins.dcf&lt;/code&gt; file by adding the binding for the newly created addin:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Name: viewSelection
Description: Tries to use View to View the object defined by a text selected in RStudio
Binding: viewSelection
Interactive: false&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, we re-install the package and assign the keyboard shortcut in the &lt;code&gt;Tools -&amp;gt; Addins -&amp;gt; Browse Addins... -&amp;gt; Keyboard Shortcuts...&lt;/code&gt; menu. Personally I assigned a single &lt;code&gt;F4&lt;/code&gt; keystroke for this, as I use it very often:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r103-03-key-binding.gif&#34; alt=&#34;Assigning a keyboard shortcut to use the Addin&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Assigning a keyboard shortcut to use the Addin&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-addin-in-action&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The addin in action&lt;/h1&gt;
&lt;p&gt;Now, let’s view a few files, a &lt;code&gt;data.frame&lt;/code&gt;, a &lt;code&gt;function&lt;/code&gt; and a &lt;code&gt;try-error&lt;/code&gt; class object just pressing &lt;code&gt;F4&lt;/code&gt;.
&lt;img src=&#34;../img/r103-01-showcase.gif&#34; alt=&#34;The addins in action&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the package&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;get the &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/tags/addins3-viewer&#34;&gt;status of the package after this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;or use &lt;code&gt;git clone&lt;/code&gt; from &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins.git&#34;&gt;&lt;code&gt;https://gitlab.com/jozefhajnala/jhaddins.git&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://adv-r.had.co.nz/Environments.html&#34;&gt;Environments chapter&lt;/a&gt; of Advanced R&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://support.rstudio.com/hc/en-us/articles/205175388-Using-the-Data-Viewer&#34;&gt;Using RStudio’s Data Viewer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>RStudio:addins part 2 - roxygen documentation formatting made easy</title>
      <link>https://jozef.io/r102-addin-roxytags/</link>
      <pubDate>Sat, 12 May 2018 14:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r102-addin-roxytags/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Code documentation is extremely important if you want to share the code with anyone else, future you included. In this second post in the RStudio:addins series we will pay a part of our technical debt from the &lt;a href=&#34;../r101-addin-reproducibility&#34;&gt;previous article&lt;/a&gt; and document our R functions conveniently using a new addin we will build for this purpose.&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;the-addin-we-will-create-in-this-article-will-let-us-create-well-formatted-roxygen-documentation-easily-by-using-keyboard-shortcuts-to-add-useful-tags-such-as-code-or-link-around-selected-text-in-rstudio.&#34;&gt;The addin we will create in this article will let us create well formatted roxygen documentation easily by using keyboard shortcuts to add useful tags such as &lt;code&gt;\code{}&lt;/code&gt; or &lt;code&gt;\link{}&lt;/code&gt; around selected text in RStudio.&lt;/h4&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#quick-intro-to-documentation-with-roxygen2&#34;&gt;Quick intro to documentation with roxygen2&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#documenting-your-first-function&#34;&gt;Documenting your first function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#generating-and-viewing-the-documentation&#34;&gt;Generating and viewing the documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-real-life-example&#34;&gt;Real-life example&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#our-addins-to-make-documenting-a-breeze&#34;&gt;Our addins to make documenting a breeze&lt;/a&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#add-a-selected-tag-around-a-character-string&#34;&gt;Add a selected tag around a character string&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#apply-the-tag-on-a-selection-in-an-active-document-in-rstudio&#34;&gt;Apply the tag on a selection in an active document in RStudio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#wrappers-around-addroxytag-to-be-used-as-addin-for-some-useful-tags&#34;&gt;Wrappers around &lt;code&gt;addRoxytag&lt;/code&gt; to be used as addin for some useful tags&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#add-the-addin-bindings-into-addins.dcf-and-assign-keyboard-shortcuts&#34;&gt;Add the addin bindings into &lt;code&gt;addins.dcf&lt;/code&gt; and assign keyboard shortcuts&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-addins-in-action&#34;&gt;The addins in action&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#what-is-next---even-more-automated-documentation&#34;&gt;What is next - even more automated documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-give-me-the-package&#34;&gt;TL;DR - Just give me the package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;quick-intro-to-documentation-with-roxygen2&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Quick intro to documentation with roxygen2&lt;/h1&gt;
&lt;div id=&#34;documenting-your-first-function&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;1. Documenting your first function&lt;/h2&gt;
&lt;p&gt;To help us generate documentation easily we will be using the &lt;a href=&#34;https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html&#34;&gt;roxygen2&lt;/a&gt; package. You can install it using &lt;code&gt;install.packages(&amp;quot;roxygen2&amp;quot;)&lt;/code&gt;. Roxygen2 works with in-code tags and will generate R’s documentation format &lt;code&gt;.Rd&lt;/code&gt; files, create a &lt;code&gt;NAMESPACE&lt;/code&gt;, and manage the &lt;code&gt;Collate&lt;/code&gt; field in &lt;code&gt;DESCRIPTION&lt;/code&gt; (not relevant to us at this point) automatically for our package.&lt;/p&gt;
&lt;p&gt;Documenting a function works in 2 simple steps:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r102-01-add-roxy-skeleton.gif&#34; alt=&#34;Documenting a function&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Documenting a function&lt;/p&gt;
&lt;/div&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Inserting a skeleton - Do this by placing your cursor anywhere in the function you want to document and click &lt;code&gt;Code Tools -&amp;gt; Insert Roxygen Skeleton&lt;/code&gt; (default keyboard shortcut &lt;code&gt;Ctrl+Shift+Alt+R&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Populating the skeleton with relevant information. A few important tags are:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;#&#39; @params&lt;/code&gt; - describing the arguments of the function&lt;/li&gt;
&lt;li&gt;&lt;code&gt;#&#39; @return&lt;/code&gt; - describing what the function returns&lt;/li&gt;
&lt;li&gt;&lt;code&gt;#&#39; @importFrom package function&lt;/code&gt; - in case your function uses a function from a different package Roxygen will automatically add it to the &lt;code&gt;NAMESPACE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;#&#39; @export&lt;/code&gt; - if case you want the function to be exported (mainly for use by other packages)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;#&#39; @examples&lt;/code&gt; - showing how to use the function in practice&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;generating-and-viewing-the-documentation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;2. Generating and viewing the documentation&lt;/h2&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r102-02-generate-doc.gif&#34; alt=&#34;Generating and viewing the documentation&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Generating and viewing the documentation&lt;/p&gt;
&lt;/div&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;We generate the documentation files using &lt;code&gt;roxygen2::roxygenise()&lt;/code&gt; or &lt;code&gt;devtools::document()&lt;/code&gt; (default keyboard shortcut &lt;code&gt;Ctrl+Shift+D&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Re-installing the package (default keyboard shortcut &lt;code&gt;Ctrl+Shift+B&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Viewing the documentation for a function using &lt;code&gt;?functioname&lt;/code&gt; e.g. &lt;code&gt;?mean&lt;/code&gt;, or placing cursor on a function name and pressing &lt;code&gt;F1&lt;/code&gt; in RStudio - this will open the &lt;code&gt;Viewer&lt;/code&gt; pane with the help for that function&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;a-real-life-example&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;3. A real-life example&lt;/h2&gt;
&lt;p&gt;Let us now document &lt;code&gt;runCurrentRscript&lt;/code&gt; a little bit:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#&amp;#39; runCurrentRscript
#&amp;#39; @description Wrapper around executeCmd with default arguments for easy use as an RStudio addin
#&amp;#39; @param path character(1) string, specifying the path of the file to be used as Rscript argument (ideally a path to an R script)
#&amp;#39; @param outputFile character(1) string, specifying the name of the file, into which the output produced by running the Rscript will be written
#&amp;#39; @param suffix character(1) string, specifying additional suffix to pass to the command
#&amp;#39; @importFrom rstudioapi getActiveDocumentContext
#&amp;#39; @importFrom rstudioapi navigateToFile
#&amp;#39; @seealso executeCmd
#&amp;#39; @return side-effects
runCurrentRscript &amp;lt;- function(
  path = replaceTilde(rstudioapi::getActiveDocumentContext()[[&amp;quot;path&amp;quot;]])
, outputFile = &amp;quot;output.txt&amp;quot;
, suffix = &amp;quot;2&amp;gt;&amp;amp;1&amp;quot;) {
  cmd &amp;lt;- makeCmd(path, outputFile = outputFile, suffix = suffix)
  executeCmd(cmd)
  if (!is.null(outputFile) &amp;amp;&amp;amp; file.exists(outputFile)) {
    rstudioapi::navigateToFile(outputFile)
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we can see by looking at &lt;code&gt;?runCurrentRscript&lt;/code&gt; versus &lt;code&gt;?mean&lt;/code&gt;, our documentation does not quite look up to par with documentation for other functions:
&lt;img src=&#34;../img/r102-03-compare.png&#34; alt=&#34;comparing the view of documentation for base::mean and runCurrentRscript&#34; /&gt;&lt;/p&gt;
&lt;p&gt;What is missing if we abstract from the richness of the content is the usage of markup commands (tags) for formatting and linking our documentation. Some of the very useful such tags are for example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;\code{}&lt;/code&gt;, &lt;code&gt;\strong{}&lt;/code&gt;, &lt;code&gt;\emph{}&lt;/code&gt; for font style&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\link{}&lt;/code&gt;, &lt;code&gt;\href{}&lt;/code&gt;, &lt;code&gt;\url{}&lt;/code&gt; for linking to other parts of the documentation or external resources&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\enumerate{}&lt;/code&gt;, &lt;code&gt;\itemize{}&lt;/code&gt;, &lt;code&gt;\tabular{}&lt;/code&gt; for using lists and tables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\eqn{}&lt;/code&gt;, &lt;code&gt;\deqn{}&lt;/code&gt; for mathematical expressions such as equations etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;For the full list of options regarding text formatting, linking and more see &lt;a href=&#34;https://cran.r-project.org/doc/manuals/R-exts.html#Rd-format&#34;&gt;Writing R Extensions’ Rd format chapter&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;our-addins-to-make-documenting-a-breeze&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Our addins to make documenting a breeze&lt;/h1&gt;
&lt;p&gt;As you can imagine, typing the markup commands in full all the time is quite tedious. The goal of our new addin will therefore be to make this process efficient using keyboard shortcuts - just select a text and our addin will place the desired tags around it. For this time, we will be satisfied with simple 1 line tags.&lt;/p&gt;
&lt;div id=&#34;add-a-selected-tag-around-a-character-string&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;1. Add a selected tag around a character string&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;roxyfy &amp;lt;- function(str, tag = NULL, splitLines = TRUE) {
  if (is.null(tag)) {
    return(str)
  }
  if (!isTRUE(splitLines)) {
    return(paste0(&amp;quot;\\&amp;quot;, tag, &amp;quot;{&amp;quot;, str, &amp;quot;}&amp;quot;))
  }
  str &amp;lt;- unlist(strsplit(str, &amp;quot;\n&amp;quot;))
  str &amp;lt;- paste0(&amp;quot;\\&amp;quot;, tag, &amp;quot;{&amp;quot;, str, &amp;quot;}&amp;quot;)
  paste(str, collapse = &amp;quot;\n&amp;quot;)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;apply-the-tag-on-a-selection-in-an-active-document-in-rstudio&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;2. Apply the tag on a selection in an active document in RStudio&lt;/h2&gt;
&lt;p&gt;We will make the functionality available for multi-selections as well by &lt;code&gt;lapply-ing&lt;/code&gt; over the &lt;code&gt;selection&lt;/code&gt; elements retrieved from the active document in RStudio.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;addRoxytag &amp;lt;- function(tag = NULL) {
  context &amp;lt;- rstudioapi::getActiveDocumentContext()
  lapply(X = context[[&amp;quot;selection&amp;quot;]]
       , FUN = function(thisSel, contextid) {
           rstudioapi::modifyRange(location = thisSel[[&amp;quot;range&amp;quot;]]
                                 , roxyfy(thisSel[[&amp;quot;text&amp;quot;]], tag)
                                 , id = contextid)
         }
       , contextid = context[[&amp;quot;id&amp;quot;]]
       )
  return(invisible(NULL))
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;wrappers-around-addroxytag-to-be-used-as-addin-for-some-useful-tags&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;3. Wrappers around &lt;code&gt;addRoxytag&lt;/code&gt; to be used as addin for some useful tags&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;addRoxytagCode &amp;lt;- function() {
  addRoxytag(tag = &amp;quot;code&amp;quot;)
}

addRoxytagLink &amp;lt;- function() {
  addRoxytag(tag = &amp;quot;link&amp;quot;)
}

addRoxytagEqn &amp;lt;- function() {
  addRoxytag(tag = &amp;quot;eqn&amp;quot;)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;add-the-addin-bindings-into-addins.dcf-and-assign-keyboard-shortcuts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;4. Add the addin bindings into &lt;code&gt;addins.dcf&lt;/code&gt; and assign keyboard shortcuts&lt;/h2&gt;
&lt;p&gt;As the final step, we need to add the bindings for our new addins to the &lt;code&gt;inst/rstudio/addins.dcf&lt;/code&gt; file and re-install the package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Name: addRoxytagCode
Description: Adds roxgen tag code to current selections in the active RStudio document
Binding: addRoxytagCode
Interactive: false

Name: addRoxytagLink
Description: Adds roxgen tag link to current selections in the active RStudio document
Binding: addRoxytagLink
Interactive: false

Name: addRoxytagEqn
Description: Adds roxgen tag eqn to current selections in the active RStudio document
Binding: addRoxytagEqn
Interactive: false&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r102-04-assign-shortcuts.gif&#34; alt=&#34;assigning keyboard shortcuts to addins&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;assigning keyboard shortcuts to addins&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-addins-in-action&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The addins in action&lt;/h1&gt;
&lt;p&gt;And now, let’s just select the text we want to format and watch our addins do the work for us! Then document the package, re-install it and view the improved help for our functions:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r102-05-addins-in-action.gif&#34; alt=&#34;The addins in action&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;The addins in action&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;what-is-next---even-more-automated-documentation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;What is next - even more automated documentation&lt;/h1&gt;
&lt;p&gt;Next time we will try to enrich our addins for generating documentation by adding the following functionalities&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;automatic generation of &lt;code&gt;@importFrom&lt;/code&gt; tags by inspecting the function code&lt;/li&gt;
&lt;li&gt;allowing for more complex tags such as &lt;code&gt;itemize&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the package&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Get the &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/tags/addins2-roxytagsMadeFast&#34;&gt;status of the package after this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;or use &lt;code&gt;git clone&lt;/code&gt; from
&lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins.git&#34;&gt;&lt;code&gt;https://gitlab.com/jozefhajnala/jhaddins.git&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html&#34;&gt;Generating Rd files&lt;/a&gt; with Roxygen2&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/web/packages/roxygen2/vignettes/formatting.html&#34;&gt;Formatting reference sheet&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/doc/manuals/R-exts.html#Rd-format&#34;&gt;Writing R Extensions - Rd format&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>RStudio:addins part 1 - code reproducibility testing</title>
      <link>https://jozef.io/r101-addin-reproducibility/</link>
      <pubDate>Sat, 05 May 2018 14:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r101-addin-reproducibility/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is the first post in the RStudio:addins series. The aim of the series is to walk the readers through creating an R package that will contain functionality for integrating useful addins into the RStudio IDE. At the end of this first article, your RStudio will be 1 useful addin richer.&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;the-addin-we-will-create-in-this-article-will-let-us-run-a-script-open-in-rstudio-in-r-vanilla-mode-via-a-keyboard-shortcut-and-open-a-file-with-the-scripts-output-in-rstudio.&#34;&gt;The addin we will create in this article will let us run a script open in RStudio in R vanilla mode via a keyboard shortcut and open a file with the script’s output in RStudio.&lt;/h4&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is useful for testing whether your script is reproducible by users that do not have the same start-up options as you (e.g. preloaded environment, site file, etc.), making it a good tool to test your scripts before sharing them.&lt;/p&gt;
&lt;p&gt;If you want to get straight to the code, you can find it at &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins.git&#34;&gt;&lt;code&gt;https://gitlab.com/jozefhajnala/jhaddins.git&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#prerequisites-and-recommendations&#34;&gt;Prerequisites and recommendations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-1---creating-a-package&#34;&gt;Step 1 - Creating a package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-2---writing-the-first-functions&#34;&gt;Step 2 - Writing the first functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-3---setting-up-an-addin&#34;&gt;Step 3 - Setting up an addin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-4---updating-our-description-and-namespace&#34;&gt;Step 4 - Updating our &lt;code&gt;DESCRIPTION&lt;/code&gt; and &lt;code&gt;NAMESPACE&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#what-is-next---always-paying-our-technical-debts&#34;&gt;What is next - Always paying our (technical) debts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#wrapping-up&#34;&gt;Wrapping up&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-give-me-the-package&#34;&gt;TL;DR - Just give me the package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;prerequisites-and-recommendations&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Prerequisites and recommendations&lt;/h1&gt;
&lt;p&gt;To make the most use of the series, you will need the following:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;http://www.cran.r-project.org/&#34;&gt;R&lt;/a&gt;, ideally version 3.4.3 or more recent, 64bit&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rstudio.com/products/RStudio/&#34;&gt;RStudio&lt;/a&gt; IDE, ideally version 1.1.383 or more recent&lt;/li&gt;
&lt;li&gt;Also recommended&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://git-scm.com/downloads&#34;&gt;git&lt;/a&gt;, for version control&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://tortoisegit.org/&#34;&gt;TortoiseGit&lt;/a&gt;, convenient shell interface to git for those using Windows, with pretty icons and all&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Recommended R packages (install with &lt;code&gt;install.packages(&amp;quot;packagename&amp;quot;)&lt;/code&gt;, or via RStudio’s Packages tab):&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;devtools&lt;/code&gt; - makes your development life easier&lt;/li&gt;
&lt;li&gt;&lt;code&gt;testthat&lt;/code&gt; - provides a framework for unit testing integrated into RStudio&lt;/li&gt;
&lt;li&gt;&lt;code&gt;roxygen2&lt;/code&gt; - makes code documentation easy&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;step-1---creating-a-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Step 1 - Creating a package&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Use &lt;code&gt;devtools::create&lt;/code&gt; to create a package (note that we will update more &lt;code&gt;DESCRIPTION&lt;/code&gt; fields later and you can also choose any path you like and it will be reflected in the name of the package)&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;devtools::create(
  path = &amp;quot;jhaddins&amp;quot;
, description = list(&amp;quot;License&amp;quot; = &amp;quot;GPL-3&amp;quot;)
)&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;In RStudio or elsewhere navigate to the &lt;code&gt;jhaddins&lt;/code&gt; folder and open the project &lt;code&gt;jhaddins.Rproj&lt;/code&gt; (or the name of your project if you chose a different path)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run the first check and install the package&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;devtools::check()   # Ctrl+Shift+E or Check button on RStudio&amp;#39;s build tab
devtools::install() # Ctrl+Shift+B or Install button on RStudio&amp;#39;s build tab&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Optionally, initialize git for version control&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;devtools::use_git()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2---writing-the-first-functions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Step 2 - Writing the first functions&lt;/h1&gt;
&lt;p&gt;We will now write some functions into a file called &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/master/R/makeCmd.R&#34;&gt;makeCmd.R&lt;/a&gt; that will let us run the desired functionality:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;makeCmd&lt;/code&gt; to create a command executable via &lt;code&gt;system&lt;/code&gt; or &lt;code&gt;shell&lt;/code&gt;, with defaults set up for executing an R file specified by &lt;code&gt;path&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;makeCmd &amp;lt;- function(path
                  , command = &amp;quot;Rscript&amp;quot;
                  , opts = &amp;quot;--vanilla&amp;quot;
                  , outputFile = NULL
                  , suffix = NULL
                  , addRhome = TRUE) {
  if (Sys.info()[&amp;quot;sysname&amp;quot;] == &amp;quot;Windows&amp;quot;) {
    qType &amp;lt;- &amp;quot;cmd2&amp;quot;
  } else {
    qType &amp;lt;- &amp;quot;sh&amp;quot;
  }
  if (isTRUE(addRhome)) {
    command &amp;lt;- file.path(R.home(&amp;quot;bin&amp;quot;), command)
  }
  cmd &amp;lt;- paste(
    shQuote(command, type = qType)
  , shQuote(opts, type = qType)
  , shQuote(path, type = qType)
  )
  if (!is.null(outputFile)) {
    cmd &amp;lt;- paste(cmd, &amp;quot;&amp;gt;&amp;quot;, shQuote(outputFile))
  }
  if (!is.null(suffix)) {
    cmd &amp;lt;- paste(cmd, suffix)
  }
  cmd
}&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;executeCmd&lt;/code&gt; to execute a command&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;executeCmd &amp;lt;- function(cmd, intern = FALSE) {
  sysName &amp;lt;- Sys.info()[&amp;quot;sysname&amp;quot;]
  stopifnot(
    is.character(cmd)
  , length(cmd) == 1
  , sysName %in% c(&amp;quot;Windows&amp;quot;, &amp;quot;Linux&amp;quot;)
  )

  if (sysName == &amp;quot;Windows&amp;quot;) {
    shell(cmd, intern = intern)
  } else {
    system(cmd, intern = intern)
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;replaceTilde&lt;/code&gt; for Linux purposes&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;replaceTilde &amp;lt;- function(path) {
  if (substr(path, 1, 1) == &amp;quot;~&amp;quot;) {
    path &amp;lt;- sub(&amp;quot;~&amp;quot;, Sys.getenv(&amp;quot;HOME&amp;quot;), path, fixed = TRUE)
  }
  file.path(path)
}&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;And finally the function which will be used for the addin execution - &lt;code&gt;runCurrentRscript&lt;/code&gt; to retrieve the path to the currently active file in RStudio, run it, write the output to a file &lt;code&gt;output.txt&lt;/code&gt; and open the file with output.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;runCurrentRscript &amp;lt;- function(
  path = replaceTilde(rstudioapi::getActiveDocumentContext()[[&amp;quot;path&amp;quot;]])
, outputFile = &amp;quot;output.txt&amp;quot;) {
  cmd &amp;lt;- makeCmd(path, outputFile = outputFile)
  executeCmd(cmd)
  if (!is.null(outputFile) &amp;amp;&amp;amp; file.exists(outputFile)) {
    rstudioapi::navigateToFile(outputFile)
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3---setting-up-an-addin&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Step 3 - Setting up an addin&lt;/h1&gt;
&lt;p&gt;Now that we have all our functions ready, all we have to do is create a file &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/master/inst/rstudio/addins.dcf&#34;&gt;addins.dcf&lt;/a&gt; under the &lt;code&gt;\inst\rstudio&lt;/code&gt; folder of our package. We specify the &lt;code&gt;Name&lt;/code&gt; of the addin, write a nice &lt;code&gt;Description&lt;/code&gt; of what it does and most importantly specify the &lt;code&gt;Binding&lt;/code&gt; to the function we want to call:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r101-addin-01.gif&#34; alt=&#34;creating addins.dcf under inst/rstudio&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;creating addins.dcf under inst/rstudio&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Name: runCurrentRscript
Description: Executes the currently open R script file via Rscript with --vanilla option
Binding: runCurrentRscript
Interactive: false&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can rebuild and install our package and in RStudio’s menu navigate to &lt;code&gt;Tools -&amp;gt; Addins -&amp;gt; Browse Addins...&lt;/code&gt;, and there it is - our first addin. For the best experience, we can click the &lt;code&gt;Keyboard Shortcuts...&lt;/code&gt; button and assign a keyboard shortcut to our addin for easy use.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;../img/r101-addin-02.gif&#34; alt=&#34;setting an RStudio addin keyboard shortcut&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;setting an RStudio addin keyboard shortcut&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Now just open an R script, hit our shortcut and voilà, our script gets execute via RScript in vanilla mode.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4---updating-our-description-and-namespace&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Step 4 - Updating our &lt;code&gt;DESCRIPTION&lt;/code&gt; and &lt;code&gt;NAMESPACE&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;As our last steps, we should&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Update our &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/master/DESCRIPTION&#34;&gt;DESCRIPTION&lt;/a&gt; file with &lt;code&gt;rstudioapi&lt;/code&gt; as &lt;code&gt;Imports&lt;/code&gt;, as we will be needing it before using our package:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Package: jhaddins
Title: JH&amp;#39;s RStudio Addins
Version: 0.0.0.9000
Authors@R: person(&amp;quot;Jozef&amp;quot;, &amp;quot;Hajnala&amp;quot;, email = &amp;quot;jozef.hajnala@gmail.com&amp;quot;, role = c(&amp;quot;aut&amp;quot;, &amp;quot;cre&amp;quot;))
Description: Useful addins to make RStudio even better.
Depends: R (&amp;gt;= 3.0.1)
Imports: rstudioapi (&amp;gt;= 0.7)
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Update our &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/blob/master/NAMESPACE&#34;&gt;NAMESPACE&lt;/a&gt; by importing the functions from other packages that we are using, namely:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;importFrom(rstudioapi, navigateToFile)
importFrom(rstudioapi, getActiveDocumentContext)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can finally rebuild and install our package again and run a &lt;code&gt;CHECK&lt;/code&gt; to see that we have no errors, warnings and notes telling us something is wrong. Make sure to use the &lt;code&gt;document = FALSE&lt;/code&gt; for now.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;devtools::install() # Ctrl+Shift+B or Install button on RStudio&amp;#39;s build tab
devtools::check(document = FALSE)   # Ctrl+Shift+E or Check button on RStudio&amp;#39;s build tab&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;what-is-next---always-paying-our-technical-debts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;What is next - Always paying our (technical) debts&lt;/h1&gt;
&lt;p&gt;In the next post of the series, we will pay our debt of&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;missing documentation for our functions, that will help us to generate updates to our &lt;code&gt;NAMESPACE&lt;/code&gt; automatically and help us get a nice documentation so that we can read about our functions using &lt;code&gt;?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;and unit tests to help us sleep better knowing that our functions get tested!&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;wrapping-up&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Wrapping up&lt;/h1&gt;
&lt;p&gt;We can quickly create an RStudio addin by:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Creating an R package&lt;/li&gt;
&lt;li&gt;Writing a function in that package&lt;/li&gt;
&lt;li&gt;Creating a &lt;code&gt;addins.dcf&lt;/code&gt; in &lt;code&gt;\inst\rstudio&lt;/code&gt; folder of our package&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-give-me-the-package&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just give me the package&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Get the &lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins/tags/addins1-codeReproducibilityTesting&#34;&gt;status of the package after this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;or use &lt;code&gt;git clone&lt;/code&gt; from
&lt;a href=&#34;https://gitlab.com/jozefhajnala/jhaddins.git&#34;&gt;&lt;code&gt;https://gitlab.com/jozefhajnala/jhaddins.git&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://raw.githubusercontent.com/rstudio/cheatsheets/master/rstudio-ide.pdf&#34;&gt;RStudio IDE cheat sheet&lt;/a&gt; (4.4MB, pdf)&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://rviews.rstudio.com/2016/11/11/easy-tricks-you-mightve-missed/&#34;&gt;RStudio IDE tricks&lt;/a&gt; you might have missed&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rstudio.com/resources/webinars/understanding-add-ins/&#34;&gt;Understanding Addins&lt;/a&gt; - A fantastic webinar, where you can learn how to write and setup addins step-by-step&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>R:case4base - data subsetting and manipulation with base R</title>
      <link>https://jozef.io/r002-data-manipulation/</link>
      <pubDate>Sat, 21 Apr 2018 00:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r002-data-manipulation/</guid>
      <description>


&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#how-to-use-this-article&#34;&gt;How to use this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#selecting-subsetting-relevant-data-from-a-data.frame&#34;&gt;Selecting (subsetting) relevant data from a data.frame&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#constructing-meaningful-subsets-simply-and-safely&#34;&gt;Constructing meaningful subsets simply and safely&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#more-ways-to-provide-subset-indices&#34;&gt;More ways to provide subset indices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-want-the-code&#34;&gt;TL;DR - Just want the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#exercises&#34;&gt;Exercises&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#exercise-answers&#34;&gt;Exercise answers&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the &lt;a href=&#34;..\r001-reshape&#34;&gt;previous article&lt;/a&gt; we discussed and learned how to &lt;a href=&#34;..\r001-reshape&#34;&gt;reshape data with base R&lt;/a&gt; to a form that is practical for our use. In this one, we will look at basic data manipulation techniques, namely obtaining relevant subsets of our data. The key will be safety and avoiding complication and confusion as much as possible. This is why we:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;try to avoid using &lt;code&gt;subset&lt;/code&gt;, as this function is implemented via non-standard evaluation.&lt;/li&gt;
&lt;li&gt;also skip &lt;code&gt;$&lt;/code&gt; as it uses partial matching and is hardly used with variables as column names.&lt;/li&gt;
&lt;li&gt;not mention more details related to &lt;code&gt;list&lt;/code&gt; properties of &lt;code&gt;data.frames&lt;/code&gt; here as the topic could get confusing. If you would like to go to more important detail, we strongly recommend a thorough read of the &lt;a href=&#34;http://adv-r.had.co.nz/Subsetting.html&#34;&gt;subsetting&lt;/a&gt; chapter of Hadley Wickham’s &lt;a href=&#34;http://adv-r.had.co.nz/&#34;&gt;Advanced R&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;how-to-use-this-article&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How to use this article&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a &lt;code&gt;case4base&lt;/code&gt; after all!&lt;/li&gt;
&lt;li&gt;If you have no time for reading, you can &lt;a href=&#34;https://jozef.io/post/data/r002-data-manipulation.r&#34;&gt;click here to get just the code with commentary&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;First, let’s read in yearly data on gross disposable income of household in the EU countries into R (&lt;a href=&#34;https://jozef.io/post/data/ESA2010_GDI.csv&#34;&gt;click here to download&lt;/a&gt;):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi &amp;lt;- read.csv(
  stringsAsFactors = FALSE
, url(&amp;quot;https://jozef.io/post/data/ESA2010_GDI.csv&amp;quot;)
              )
head(gdi[, 1:6, drop = FALSE])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;selecting-subsetting-relevant-data-from-a-data.frame&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Selecting (subsetting) relevant data from a &lt;code&gt;data.frame&lt;/code&gt;&lt;/h1&gt;
&lt;p&gt;In this paragraph, we will try to show how to subset with as little hassle as possible while preserving the maximum safety in your operations. We shall go into more detail later in the article. The standard approach to subsetting &lt;code&gt;data.frames&lt;/code&gt; can be summarised:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;dataframe_name[row_subset, col_subset, drop = FALSE]&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dataframe_name&lt;/code&gt; is the name of the &lt;code&gt;data.frame&lt;/code&gt; we are subsetting&lt;/li&gt;
&lt;li&gt;&lt;code&gt;row_subset&lt;/code&gt; is a vector specifying the subset of rows&lt;/li&gt;
&lt;li&gt;&lt;code&gt;col_subset&lt;/code&gt; is a vector specifying the subset of columns&lt;/li&gt;
&lt;li&gt;&lt;code&gt;drop = FALSE&lt;/code&gt; is to make sure the result does not get simplified when not indented. This should always be used, unless we specifically want to simplify the result (e.g. to a vector for indexing)&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;constructing-meaningful-subsets-simply-and-safely&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Constructing meaningful subsets simply and safely&lt;/h1&gt;
&lt;p&gt;In practice, we of course will most of the time not select rows and/or columns with positions known apriori, but based on more variable conditions. For this purpose, the advised way would be is to construct logical vectors:&lt;/p&gt;
&lt;p&gt;Let us now subset the rows of our data to get the data for countries that have known (not &lt;code&gt;NA&lt;/code&gt;) value in the year 2016 and this value is less than 1 million:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rowidx &amp;lt;- !is.na(gdi[, &amp;quot;Y.2016&amp;quot;]) &amp;amp; gdi[, &amp;quot;Y.2016&amp;quot;] &amp;lt; 1000000
gdi[rowidx, c(1, 23), drop = FALSE]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        country    Y.2016
## 3      Belgium 243825.50
## 4     Bulgaria  60237.00
## 8      Estonia  12548.30
## 9      Ireland  97318.90
## 11       Spain 698701.00
## 13     Croatia      0.00
## 16      Latvia  15737.79
## 17   Lithuania  24743.49
## 18  Luxembourg  20155.80
## 21 Netherlands 357383.00
## 22     Austria 214980.60
## 24    Portugal 128789.39
## 26    Slovenia  24756.63
## 27    Slovakia  48882.91
## 28     Finland 126590.00
## 33 Switzerland 458641.00&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that when creating the &lt;code&gt;rowidx&lt;/code&gt; we omitted the &lt;code&gt;drop = FALSE&lt;/code&gt; despite the aforementioned best practice. This is because in this particular case we consciously welcome the result being simplified to a vector, as its use is only as an index for subsetting.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;more-ways-to-provide-subset-indices&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;More ways to provide subset indices&lt;/h1&gt;
&lt;p&gt;Subsetting can be done in a few ways. We will now use them to show a subset the first two and the 27th row and the first, 22nd and 23rd column, giving us the GDI for EU28, Euro Area 19 and Slovakia in the years 2015 and 2016:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Logical vectors &lt;code&gt;TRUE&lt;/code&gt; for rows/columns to subset, &lt;code&gt;FALSE&lt;/code&gt; for those to omit&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;st1 &amp;lt;- gdi[c(TRUE, TRUE, rep(FALSE, 24), TRUE, rep(FALSE, 8))
         , c(TRUE, rep(FALSE, 20), rep(TRUE, 2))
         , drop = FALSE
         ]&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Numeric vectors of row/column numbers to subset&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;st2 &amp;lt;- gdi[c(1:2, 27) 
         , c(1, 22:23)
         , drop = FALSE
         ]&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Negative numeric vectors of row/column numbers to omit&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;st3 &amp;lt;- gdi[c(-3:-26, -28:-35)
         , c(-2:-21)
         , drop = FALSE
         ]&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Character vectors of row/column names to subset&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;st4 &amp;lt;- gdi[c(&amp;quot;1&amp;quot;, &amp;quot;2&amp;quot;, &amp;quot;27&amp;quot;) # we do not have very meaningful rownames
         , c(&amp;quot;country&amp;quot;, &amp;quot;Y.2015&amp;quot;, &amp;quot;Y.2016&amp;quot;)
         , drop = FALSE
         ]
st4&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         country     Y.2015     Y.2016
## 1         EU 28 9439578.39 9454683.60
## 2  Euro area 19 6598231.27 6736686.43
## 27     Slovakia   47464.71   48882.91&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;5&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;All of the above give identical results&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;identical(st1, st2) &amp;amp;&amp;amp; identical(st2, st3) &amp;amp;&amp;amp; identical(st3, st4)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;tips&#34;&gt;Tips&lt;/h4&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The above methods are also working and safe for matrices&lt;/li&gt;
&lt;li&gt;Negative and positive numeric vectors cannot be combined&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/select&#34;&gt;dplyr::select&lt;/a&gt; and &lt;a href=&#34;https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/filter&#34;&gt;dplyr::filter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Using &lt;a href=&#34;https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html&#34;&gt;data.table&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-want-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just want the code&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;No time for reading? &lt;a href=&#34;https://jozef.io/post/data/r002-data-manipulation.r&#34;&gt;Click here to get just the code with commentary&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;exercises&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Exercises&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;What is the difference between &lt;code&gt;gdi[3, 3]&lt;/code&gt; and &lt;code&gt;gdi[3, 3, drop = FALSE]&lt;/code&gt; ?&lt;/li&gt;
&lt;li&gt;What is the difference between &lt;code&gt;gdi[-3, 3]&lt;/code&gt; and &lt;code&gt;gdi[3, -3]&lt;/code&gt; ? What about &lt;code&gt;gdi[-3, 3, drop = FALSE]&lt;/code&gt; ?&lt;/li&gt;
&lt;li&gt;Why cannot we omit the first part of the &amp;amp; in &lt;code&gt;rowidx &amp;lt;- !is.na(gdi[, &amp;quot;Y.2016&amp;quot;]) &amp;amp; gdi[, &amp;quot;Y.2016&amp;quot;] &amp;lt; 1000000&lt;/code&gt;. What would happen if we just did &lt;code&gt;rowidx &amp;lt;- gdi[, &amp;quot;Y.2016&amp;quot;] &amp;lt; 1000000&lt;/code&gt; ?&lt;/li&gt;
&lt;li&gt;Bonus question 1: Why is &lt;code&gt;identical(gdi[, &amp;quot;Y.2016&amp;quot;, drop = FALSE], gdi[&amp;quot;Y.2016&amp;quot;])&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Bonus question 2: Why is &lt;code&gt;identical(gdi[, &amp;quot;Y.2016&amp;quot;], gdi[[&amp;quot;Y.2016&amp;quot;]])&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Advanced R’s chapter on &lt;a href=&#34;http://adv-r.had.co.nz/Subsetting.html&#34;&gt;subsetting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;and on &lt;a href=&#34;http://adv-r.had.co.nz/Subsetting.html#data-types&#34;&gt;data types&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://ec.europa.eu/eurostat/web/sector-accounts/data/annual-data&#34;&gt;original eurostat data source&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;exercise-answers&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Exercise answers&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://jozef.io/post/data/r002-data-manipulation.r&#34;&gt;At the bottom of the code for the article&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>R:case4base - reshape data with base R</title>
      <link>https://jozef.io/r001-reshape/</link>
      <pubDate>Sat, 07 Apr 2018 00:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r001-reshape/</guid>
      <description>


&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#how-to-use-this-article&#34;&gt;How to use this article&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#basic-wide-to-long-reshape&#34;&gt;Basic wide to long reshape&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#basic-long-to-wide-reshape&#34;&gt;Basic long to wide reshape&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#advanced-reshape&#34;&gt;Advanced reshape&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternatives-to-base-r&#34;&gt;Alternatives to base R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tldr---just-want-the-code&#34;&gt;TL;DR - Just want the code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#exercises&#34;&gt;Exercises&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#exercise-answers&#34;&gt;Exercise answers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#discuss-the-article&#34;&gt;Discuss the article&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is the first post in the R:case4base series. The aim of the series is to elaborate on very useful features of base R that are lesser known and many times substituted with custom functionality of external packages.&lt;/p&gt;
&lt;p&gt;The simplest, yet probably one of the most common use cases would be to change the data from what is called “wide” shape to “long” shape. Base R offers a very good function for this very purpose. Meet &lt;code&gt;stats::reshape&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-to-use-this-article&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How to use this article&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a &lt;code&gt;case4base&lt;/code&gt; after all!&lt;/li&gt;
&lt;li&gt;If you have no time for reading, you can &lt;a href=&#34;https://jozef.io/post/data/r001-reshape.r&#34;&gt;click here to get just the code with commentary&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;basic-wide-to-long-reshape&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Basic wide to long reshape&lt;/h1&gt;
&lt;p&gt;First, let’s read in yearly data on gross disposable income of household in the EU countries into R (&lt;a href=&#34;https://jozef.io/post/data/ESA2010_GDI.csv&#34;&gt;click here to download&lt;/a&gt;):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi &amp;lt;- read.csv(
  stringsAsFactors = FALSE
, url(&amp;quot;https://jozef.io/post/data/ESA2010_GDI.csv&amp;quot;)
              )
head(gdi[, 1:7])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0
##      Y.2000
## 1 6425313.4
## 2 4598956.1
## 3  161753.6
## 4   17676.4
## 5 1359309.0
## 6  639955.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The goal of the article is therefore not really in presenting these conrete results, but to focus on the technical aspects and usefulness of the presented methods.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To reshape our &lt;code&gt;data.frame&lt;/code&gt; from wide to long, all we have to do is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi_long &amp;lt;- reshape(data = gdi         # data.frame in wide format to be reshaped
                  , direction = &amp;quot;long&amp;quot; # we are going from wide to long
                  , varying = 2:23     # columns that will be stacked into 1
                  )

head(gdi_long)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##               country time        Y id
## 1.1995          EU 28 1995       NA  1
## 2.1995   Euro area 19 1995       NA  2
## 3.1995        Belgium 1995 140734.1  3
## 4.1995       Bulgaria 1995   1036.0  4
## 5.1995 Czech Republic 1995 894042.0  5
## 6.1995        Denmark 1995 566363.0  6&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before we get into clean-up such that the output &lt;code&gt;data.frame&lt;/code&gt; is nice and tidy, let us first take look at the arguments of the function that we used already&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;data&lt;/code&gt; - almost obviously, this is the &lt;code&gt;data.frame&lt;/code&gt; we want to reshape&lt;/li&gt;
&lt;li&gt;&lt;code&gt;varying&lt;/code&gt; - names or indices of columns which we want to stack on each other into a single column&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;h4 id=&#34;tip&#34;&gt;Tip&lt;/h4&gt;
&lt;p&gt;We can see that R automatically recognizes the &lt;code&gt;Y&lt;/code&gt; and the years that get translated into the &lt;code&gt;time&lt;/code&gt; column. This is because the column names are in a format that reshape can guess automatically: &lt;code&gt;[string].[integer]&lt;/code&gt;, in our case &lt;code&gt;&amp;quot;Y.1996&amp;quot;&lt;/code&gt;, &lt;code&gt;&amp;quot;Y.1997&amp;quot;&lt;/code&gt;, etc.
It has a lot of benefits to keep this naming convention for your column names before reshaping. If your names have a different character between the &lt;code&gt;[string]&lt;/code&gt; and the &lt;code&gt;[integer]&lt;/code&gt; (for example &lt;code&gt;&amp;quot;something_1996&amp;quot;&lt;/code&gt;, &lt;code&gt;&amp;quot;something_1997&amp;quot;&lt;/code&gt;), you can specify this character with the &lt;code&gt;sep&lt;/code&gt; argument (e.g. &lt;code&gt;sep = &amp;quot;_&amp;quot;&lt;/code&gt;).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Now looking back at the reshaped &lt;code&gt;gdi_long&lt;/code&gt;, we see that the reshape worked, however we have 4 improvements that can be done providing the function with more arguments:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;the &lt;code&gt;id&lt;/code&gt; column, which is not particularly useful this way&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;Y&lt;/code&gt; column, which does have the correct data, however we would perhaps like to call it something a bit more descriptive&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;time&lt;/code&gt; column, which could be named differently&lt;/li&gt;
&lt;li&gt;we may want to update the values in the &lt;code&gt;time&lt;/code&gt; column to something custom&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi_long_full &amp;lt;- reshape(data = gdi         # data.frame in wide format to be reshaped
                       , direction = &amp;quot;long&amp;quot; # still going from wide to long
                       , varying = 2:23     # columns that will be stacked into 1
                       , idvar = &amp;quot;country&amp;quot;  # what identifies the rows?
                       , v.names = &amp;quot;GDI&amp;quot;    # how will the column with values be called
                       , timevar = &amp;quot;year&amp;quot;   # how will the time column be called
                       , times = 1995:2016  # what are the values for the timevar column
                       )
head(gdi_long_full)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                            country year      GDI
## EU 28.1995                   EU 28 1995       NA
## Euro area 19.1995     Euro area 19 1995       NA
## Belgium.1995               Belgium 1995 140734.1
## Bulgaria.1995             Bulgaria 1995   1036.0
## Czech Republic.1995 Czech Republic 1995 894042.0
## Denmark.1995               Denmark 1995 566363.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We easily see the solution to our 4 improvements:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;specify &lt;code&gt;idvar = &amp;quot;country&amp;quot;&lt;/code&gt; argument, as this column identifies the subjects in the rows&lt;/li&gt;
&lt;li&gt;specify &lt;code&gt;v.names = &amp;quot;GDI&amp;quot;&lt;/code&gt; argument, as this will rename the column with values (our values are gross disposable income)&lt;/li&gt;
&lt;li&gt;specify &lt;code&gt;timevar = &amp;quot;year&amp;quot;&lt;/code&gt; argument, as our time is actually years (the data is measure on a yearly basis)&lt;/li&gt;
&lt;li&gt;specify &lt;code&gt;times = 1995:2016&lt;/code&gt; argument, this is shown just for completion, we could for example do &lt;code&gt;times = -21:0&lt;/code&gt; if we want the years to be measured based on 2016 instead of actual years&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;basic-long-to-wide-reshape&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Basic long to wide reshape&lt;/h1&gt;
&lt;p&gt;Now that have the wide to long reshape done, the reshape from long to wide format is a formality. It works exactly the same way, we just switch the arguments around a bit:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gdi_wide &amp;lt;- reshape(gdi_long_full      # data.frame in long format to be reshaped  
                  , direction = &amp;quot;wide&amp;quot; # going from long to wide this time
                  , idvar = &amp;quot;country&amp;quot;  # identifying the subject in rows
                  , timevar = &amp;quot;year&amp;quot;   # column with values that will change to columns
                  , v.names = &amp;quot;GDI&amp;quot;    # column with the values
                  )
head(gdi_wide[, 1:7, drop = FALSE])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                            country GDI.1995  GDI.1996  GDI.1997  GDI.1998
## EU 28.1995                   EU 28       NA        NA        NA        NA
## Euro area 19.1995     Euro area 19       NA        NA        NA        NA
## Belgium.1995               Belgium 140734.1  141599.4  145023.2  149705.2
## Bulgaria.1995             Bulgaria   1036.0    1468.1   12367.4   14921.1
## Czech Republic.1995 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0
## Denmark.1995               Denmark 566363.0  578102.0  591416.0  621236.0
##                      GDI.1999  GDI.2000
## EU 28.1995          5982392.8 6425313.4
## Euro area 19.1995   4393727.3 4598956.1
## Belgium.1995         153804.0  161753.6
## Bulgaria.1995         16052.8   17676.4
## Czech Republic.1995 1280040.0 1359309.0
## Denmark.1995         614893.0  639955.0&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;advanced-reshape&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Advanced reshape&lt;/h1&gt;
&lt;p&gt;Let us now examine a bit more advanced reshape with some more data. First, we will look at the generic setup. We now have data not just for the GDI, but for 3 measurements in the columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ConspC&lt;/code&gt; - in columns &lt;code&gt;X1995ConspC .. X2016ConspC&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AGDIpC&lt;/code&gt; - in columns &lt;code&gt;X1995AGDIpC .. X2016AGDIpC&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GrossSaving&lt;/code&gt; - in columns &lt;code&gt;X1995GrossSaving .. X2016GrossSaving&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;more_notpretty &amp;lt;- read.csv(
  stringsAsFactors = FALSE
, file = &amp;quot;https://jozef.io/post/data/ESA2010_not_pretty.csv&amp;quot;
)
head(more_notpretty[, 1:5, drop = FALSE])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          country X1995ConspC X1996ConspC X1997ConspC X1998ConspC
## 1          EU 28          NA          NA          NA          NA
## 2   Euro area 19          NA          NA          NA          NA
## 3        Belgium    18168.83    18634.68    18867.78    19334.14
## 4       Bulgaria          NA     3777.06     3163.05     3326.24
## 5 Czech Republic   148721.29   159428.17   162742.83   161855.85
## 6        Denmark   176096.32   179576.05   182940.60   187630.27&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since these data do not have column names that R would be able to guess automatically, we will have to provide quite a few arguments:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;varying&lt;/code&gt; as a list of vectors, each specifying the columns for one varying variable&lt;/li&gt;
&lt;li&gt;&lt;code&gt;v.names&lt;/code&gt; as a vector of names for those variables&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;more_notpretty_long &amp;lt;- reshape(data = more_notpretty
                             , direction = &amp;quot;long&amp;quot;
                             , varying = list(2:23
                                            , 24:45
                                            , 46:67
                                            )
                             , timevar = &amp;quot;year&amp;quot;
                             , times = 1995:2016
                             , idvar = &amp;quot;country&amp;quot;
                             , v.names = c(&amp;quot;ConspC&amp;quot;
                                         , &amp;quot;AGDIpC&amp;quot;
                                         , &amp;quot;GrossSaving&amp;quot;
                                         )
                             )
head(more_notpretty_long)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                            country year    ConspC    AGDIpC GrossSaving
## EU 28.1995                   EU 28 1995        NA        NA          NA
## Euro area 19.1995     Euro area 19 1995        NA        NA          NA
## Belgium.1995               Belgium 1995  18168.83  21577.92     27350.1
## Bulgaria.1995             Bulgaria 1995        NA        NA       448.4
## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46    116646.0
## Denmark.1995               Denmark 1995 176096.32 179741.30     42398.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let us showcase how easy the reshape is if we adhere to R’s favourite column naming with the same data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;more_pretty &amp;lt;- read.csv(
  stringsAsFactors = FALSE
, file = &amp;quot;https://jozef.io/post/data/ESA2010_pretty.csv&amp;quot;
)
head(more_pretty[, 1:5, drop = FALSE])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          country ConspC.1995 ConspC.1996 ConspC.1997 ConspC.1998
## 1          EU 28          NA          NA          NA          NA
## 2   Euro area 19          NA          NA          NA          NA
## 3        Belgium    18168.83    18634.68    18867.78    19334.14
## 4       Bulgaria          NA     3777.06     3163.05     3326.24
## 5 Czech Republic   148721.29   159428.17   162742.83   161855.85
## 6        Denmark   176096.32   179576.05   182940.60   187630.27&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We tell R only the information it necessarily needs, same as with the simple reshape:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;more_pretty_long &amp;lt;- reshape(data = more_pretty
                           , direction = &amp;quot;long&amp;quot;
                           , varying = 2:67
                           , idvar = &amp;quot;country&amp;quot;
                           )
head(more_pretty_long)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                            country time    ConspC    AGDIpC GrossSaving
## EU 28.1995                   EU 28 1995        NA        NA          NA
## Euro area 19.1995     Euro area 19 1995        NA        NA          NA
## Belgium.1995               Belgium 1995  18168.83  21577.92     27350.1
## Bulgaria.1995             Bulgaria 1995        NA        NA       448.4
## Czech Republic.1995 Czech Republic 1995 148721.29 166316.46    116646.0
## Denmark.1995               Denmark 1995 176096.32 179741.30     42398.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That was really easy and we got the desired result!&lt;/p&gt;
&lt;p&gt;Now as the very last example, we may want to get the data into an even longer form, if we also&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;consider the actual variables we are measuring as &lt;code&gt;varying&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;their names will turn into &lt;code&gt;times&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;with &lt;code&gt;measurement&lt;/code&gt; being the name for &lt;code&gt;timevar&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;more_longer &amp;lt;- reshape(data = more_pretty_long
                    , direction = &amp;quot;long&amp;quot;
                    , varying = 3:5
                    , timevar = &amp;quot;measurement&amp;quot;
                    , times = names(more_pretty_long[, 3:5])
                    , v.names = &amp;quot;Value&amp;quot;
                    )
head(more_longer)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                 country time measurement     Value id
## 1.ConspC          EU 28 1995      ConspC        NA  1
## 2.ConspC   Euro area 19 1995      ConspC        NA  2
## 3.ConspC        Belgium 1995      ConspC  18168.83  3
## 4.ConspC       Bulgaria 1995      ConspC        NA  4
## 5.ConspC Czech Republic 1995      ConspC 148721.29  5
## 6.ConspC        Denmark 1995      ConspC 176096.32  6&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;alternatives-to-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternatives to base R&lt;/h1&gt;
&lt;p&gt;There are many alternatives to the base functionality, each with their own pros and cons, here is a selection of three in no particular order:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;melt&lt;/code&gt; and &lt;code&gt;cast&lt;/code&gt; from the &lt;a href=&#34;https://cran.r-project.org/web/packages/reshape2/index.html&#34;&gt;reshape2 package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gather&lt;/code&gt; and &lt;code&gt;spread&lt;/code&gt; from the &lt;a href=&#34;https://cran.r-project.org/web/packages/tidyr/index.html&#34;&gt;tidyR package&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;melt&lt;/code&gt; and &lt;code&gt;dcast&lt;/code&gt; from the &lt;a href=&#34;https://cran.r-project.org/web/packages/data.table/index.html&#34;&gt;data.table package&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;tldr---just-want-the-code&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR - Just want the code&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;No time for reading? &lt;a href=&#34;https://jozef.io/post/data/r001-reshape.r&#34;&gt;Click here to get just the code with commentary&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;exercises&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Exercises&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;At the beginning of the article, our data had countries in rows and yearly data as columns. Reshape the data such that the countries will be in columns and the years are in rows.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;reshape(reshape(gdi_long_full))&lt;/code&gt; gives us a data.frame equivalent to &lt;code&gt;gdi_long_full&lt;/code&gt;, even though we call the function twice with no extra arguments, just the data. What kind of sorcery is this? Why don’t we need to provide at least the &lt;code&gt;direction&lt;/code&gt;, or the &lt;code&gt;varying&lt;/code&gt; arguments?&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;http://ec.europa.eu/eurostat/web/sector-accounts/data/annual-data&#34;&gt;original eurostat data source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/reshape&#34;&gt;stats::reshape at rdocumentation.org&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;exercise-answers&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Exercise answers&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://jozef.io/post/data/r001-reshape.r&#34;&gt;At the bottom of the code for the article&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    <item>
      <title>R:case4base - about the series</title>
      <link>https://jozef.io/r000-about-case4base/</link>
      <pubDate>Sat, 24 Mar 2018 00:00:00 +0000</pubDate>
      
      <guid>https://jozef.io/r000-about-case4base/</guid>
      <description>


&lt;div id=&#34;contents&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Contents&lt;/h1&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;#what-is-does-this-series-offer&#34;&gt;What is does this series offer?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#what-is-considered-base-r&#34;&gt;What is considered base R?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#discuss-the-article&#34;&gt;Discuss the article&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;what-is-does-this-series-offer&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;What is does this series offer?&lt;/h1&gt;
&lt;p&gt;This is the introduction to the &lt;code&gt;R:case4base&lt;/code&gt; series. The aim of the series is to elaborate on very useful features of base R that are lesser known and many times substituted with custom functionality of external packages.
The motivation behind the series is to provide useful and easy to read information on the usage of these functionalities from the basic to the advanced topics related to them.&lt;/p&gt;
&lt;p&gt;Usually one article in the series will&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;contain content on 1 such functionality&lt;/li&gt;
&lt;li&gt;follow a learning system starting from the basics and continuing with more advanced topics, with examples and simple explanations, at the cost of rigorousness&lt;/li&gt;
&lt;li&gt;come with accompanying peace of fully portable R code that can be downloaded and played with no additional setup or packages needed&lt;/li&gt;
&lt;li&gt;come with a few exercises for those wanting to examine the code a bit more&lt;/li&gt;
&lt;li&gt;provide a list of references for further reading&lt;/li&gt;
&lt;li&gt;provide a list of alternatives to the base functionality in no particular order&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;what-is-considered-base-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;What is considered base R&lt;/h1&gt;
&lt;p&gt;The list of packages considered as &lt;code&gt;base&lt;/code&gt; can be retrieved with some basic info calling the following:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;installed.packages(priority = &amp;quot;base&amp;quot;)[, c(5, 6)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           Depends Imports                     
## base      NA      NA                          
## compiler  NA      NA                          
## datasets  NA      NA                          
## graphics  NA      &amp;quot;grDevices&amp;quot;                 
## grDevices NA      NA                          
## grid      NA      &amp;quot;grDevices, utils&amp;quot;          
## methods   NA      &amp;quot;utils, stats&amp;quot;              
## parallel  NA      &amp;quot;tools, compiler&amp;quot;           
## splines   NA      &amp;quot;graphics, stats&amp;quot;           
## stats     NA      &amp;quot;utils, grDevices, graphics&amp;quot;
## stats4    NA      &amp;quot;graphics, methods, stats&amp;quot;  
## tcltk     NA      &amp;quot;utils&amp;quot;                     
## tools     NA      NA                          
## utils     NA      NA&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    
    
    
    
    
  </channel>
</rss>
