The {sparklyr} package lets us connect and use Apache Spark for high-performance, highly parallelized, and distributed computations. We can also use Spark’s capabilities to improve and streamline our data processing pipelines, as Spark supports reading and writing from many popular sources such as Parquet, Orc, etc. and most database systems via JDBC drivers. In this post, we will explore using R to perform data loads to Spark and optionally R from relational database management systems such as MySQL, Oracle, and MS SQL Server and show how such processes can be simplified.
In this part of the primer we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava. In the first part we talked about using the rJava package to create objects, call methods and work with arrays, we examined the various ways to call Java methods and calling Java code from R directly via execution of shell commands.
This primer shall consist of two parts and its goal is to provide a walk-through of using resources developed in Java from R. It is structured as more of a “note-to-future-self” rather than a proper educational article, I however hope that some readers may still find it useful. It will also list a set of references that I found very helpful, for which I thank the respective authors.