Introduction
Over the past months, we published and refined a series of posts on Using Spark from R for performance with arbitrary code. Since the posts have grown in size and scope the blogposts were no longer the best medium to share the content in the way most useful to the readers, we decided to compile a publication instead and open-source it for all readers to use freely.
In this post, we present Using Spark from R for performance, an open-source online publication that will serve as a medium to communicate the current and future installments of the series comprehensively, including instructions on how to use it and a Docker image with all the prerequisites needed to run the code examples.
Contents
Who is this book for?
The book is published at sparkfromr.com and it focuses on users who are interested in practical insights into using the sparklyr
interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages. This publication focuses on exploring the different interfaces available for communication between R and Spark using the sparklyr package.
We have also created a Docker image that lets you use the code in the book without caring for setting up all the necessary software requirements such as Java, Spark, and all the necessary R packages. A guide to using the book with that image is included as a separate chapter.
What are the main topics currently covered?
The main topics are summarized in the following chapters:
- Communication between Spark and sparklyr
- Non-translated functions with spark_apply
- Constructing functions by piping dplyr verbs
- Constructing SQL and executing it with Spark
- Using the lower-level invoke API to manipulate Spark’s Java objects from R
- Exploring the invoke API from R with Java reflection and examining invokes with logs
Are the sources also available?
Yes. The content is rendered and published automatically from publicly accessible git repositories, you can find the
- Content sources in the sparkfromr GitHub repository
- Rendered version in the sparkfrom_deployed GitHub repository
- Automatically built Docker image used to render the book on DockerHub
- Sources used to build the Docker images in the sparkfrom_docker GitHub repository
All contributions to the above are of course most welcome.
Where can issues be raised?
In case you find any errors and other issues with the book, or simply have requests for improvements or more content features the ideal place to raise them is directly in the GitHub repositories:
- For issues in the content of the book, please raise an issue here
- For issues related to the Docker image, please raise an issue here
Acknowledgments and thank yous
Creation of this book would not be possible without many openly available resources such as the
- R packages around the rmarkdown ecosystem created by Yihui Xie, namely the bookdown package via which this publication is rendered
- the project also heavily relies on the Rocker Project which provides Docker images for the R environment thanks to Carl Boettiger, Dirk Eddelbuettel, and Noam Ross
- last but not least there would be nothing to write about in this short book if the sparklyr package was not written by Javier Luraschi et al., the R programming language itself maintained by the R core group and the Apache Spark creators and maintainers.
My thanks go to the creators and maintainers of all these amazing open-source tools.
Happy reading!