Introduction

Over the past months, we published and refined a series of posts on Using Spark from R for performance with arbitrary code. Since the posts have grown in size and scope the blogposts were no longer the best medium to share the content in the way most useful to the readers, we decided to compile a publication instead and open-source it for all readers to use freely.

In this post, we present Using Spark from R for performance, an open-source online publication that will serve as a medium to communicate the current and future installments of the series comprehensively, including instructions on how to use it and a Docker image with all the prerequisites needed to run the code examples.

Who is this book for?

The book is published at sparkfromr.com and it focuses on users who are interested in practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages. This publication focuses on exploring the different interfaces available for communication between R and Spark using the sparklyr package.

We have also created a Docker image that lets you use the code in the book without caring for setting up all the necessary software requirements such as Java, Spark, and all the necessary R packages. A guide to using the book with that image is included as a separate chapter.

What are the main topics currently covered?

The main topics are summarized in the following chapters:

Are the sources also available?

Yes. The content is rendered and published automatically from publicly accessible git repositories, you can find the

Content sources in the sparkfromr GitHub repository
Rendered version in the sparkfrom_deployed GitHub repository
Automatically built Docker image used to render the book on DockerHub
Sources used to build the Docker images in the sparkfrom_docker GitHub repository

All contributions to the above are of course most welcome.

Where can issues be raised?

In case you find any errors and other issues with the book, or simply have requests for improvements or more content features the ideal place to raise them is directly in the GitHub repositories:

For issues in the content of the book, please raise an issue here
For issues related to the Docker image, please raise an issue here

Acknowledgments and thank yous

Creation of this book would not be possible without many openly available resources such as the

R packages around the rmarkdown ecosystem created by Yihui Xie, namely the bookdown package via which this publication is rendered
the project also heavily relies on the Rocker Project which provides Docker images for the R environment thanks to Carl Boettiger, Dirk Eddelbuettel, and Noam Ross
last but not least there would be nothing to write about in this short book if the sparklyr package was not written by Javier Luraschi et al., the R programming language itself maintained by the R core group and the Apache Spark creators and maintainers.

My thanks go to the creators and maintainers of all these amazing open-source tools.

Logos of bookdown, Apache Spark and R

Happy reading!

Releasing and open-sourcing the Using Spark from R for performance with arbitrary code series

Introduction

Contents

Who is this book for?

What are the main topics currently covered?

Are the sources also available?

Where can issues be raised?

Acknowledgments and thank yous

Releasing and open-sourcing the Using Spark from R for performance with arbitrary code series

Introduction

Contents

Who is this book for?

What are the main topics currently covered?

Are the sources also available?

Where can issues be raised?

Acknowledgments and thank yous

Did you find this post helpful or interesting? Help others find it by sharing: