In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs, SQL query generators that can be executed by Spark and how to use the lower-level API to invoke methods on Java object references from R. In this fifth part, we will look into more details around sparklyr’s invoke() API, investigate available methods for different classes of objects using the Java reflection API and look under the hood of the sparklyr interface mechanism with invoke logging.
In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs and SQL query generators that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed. In this fourth part, we will look at how to write R functions that interface with Spark via a lower-level invocation API that lets us use all the functionality that is exposed by the Scala Spark APIs.
In the previous part of this series, we looked at writing R functions that can be executed directly by Spark without serialization overhead with a focus on writing functions as combinations of dplyr verbs and investigated how the SQL is generated and Spark plans created. In this third part, we will look at how to write R functions that generate SQL queries that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed.
In the first part of this series, we looked at how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We also examined how Apache Arrow can increase the performance of data transfers between the R session and the Spark instance. In this second part, we will look at how to write R functions that can be executed directly by Spark without serialization overhead that we have shown in the previous installment.
Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. This series of articles will attempt to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages. In this first part, we will examine how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions.
In the previous post, we focused on setting up declarative Jenkins pipelines with emphasis on parametrizing builds and using environment variables across pipeline stages. In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, combining sources from multiple git repositories and ensuring proper access right to the Jenkins agent. Running stages in parallel Parallel computation using R Orchestrating parallelization of R jobs with Jenkins Failing early Cloning multiple git repositories Cloning into a separate subdirectory Cleaning up Changing permissions to allow the Jenkins user to read References Running stages in parallel Parallel computation using R There are numerous way to achieve parallel computation in the context of an R application, those native to R are for example
Jenkins is a popular open-source tool that helps teams with automation and implementation of continuous integration and deployment pipelines, comparable to for example Atlassian’s Bamboo, GitLab CI or to some extent Travis. In this post, we share some practical lessons learned when integrating R applications via Jenkins for the purpose of continuous integration and regression testing on runner nodes configured using Jenkins via declarative pipelines defined in a Jenkinsfile.
Recently I was involved in a task that included reading and writing quite large amounts of data, totaling more than 1 TB worth of csvs without the standard big data infrastructure. After trying multiple approaches, the one that made this possible was using data.table’s reading and writing facilities - fread() and fwrite(). This motivated me to look at benchmarking data.table’s fread() and how it compares to other packages such as tidyverse’s readr and base R for reading tabular data from text files such as csvs.
As pointed out by a recent read the R source post on the R hub’s website, reading the actual code, not just the documentation is a great way to learn more about programming and implementation details. But there is one more activity to get even more hands-on experience and understanding of the code in practice. In this post, we provide tips on how to interactively debug R code step-by-step and investigate the values of objects in the middle of function execution.
As we wrote in Should you start your R blog now?, blogging has probably never been more accessible to the general population, R users included. Usually, the simplest solution is to host your blog via a service that provides it for free, such as Netlify, GitHub or GitLab Pages. But what if you want to host that awesome blog on your own, HTTPS enabled domain? In this post, we will look at how to port a Hugo-based website, such as a blogdown blog to our own domain, specifically focusing on GitLab Pages.