R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years

Introduction

It is almost the 29th of February 2020! A day that is very interesting for R, because it marks 20 years from the release of R v1.0.0, the first official public release of the R programming language.

In this post, we will look back on the 20 years of R with a bit of history and 3 interesting perspectives - how much faster did R get over the years, how many R packages were being released since 2000 and how did the number of package downloads grow.

The first release of R, 29th February 2000

The first official public release of R happened on the 29th of February, 2000. In the release announcement, Peter Dalgaard notes:

“The release of a current major version indicates that we believe that R has reached a level of stability and maturity that makes it suitable for production use. Also, the release of 1.0.0 marks that the base language and the API for extension writers will remain stable for the foreseeable future. In addition we have taken the opportunity to tie up as many loose ends as we could.”

Today, 20 years later, it is quite amazing how true the statement around the API remaining stable has proven. The original release announcement and full release statement are still available online.

You can also still download the very first public version of R. For instance, for Windows you can find it on the Previous Releases of R for Windows page. And it is quite runnable, even under Windows 10.

Further down in history, to 1977

Now to give R justice in terms of age, we need to go even further into history. In the full release statement of R v1.0.0, we can find that

R implements a dialect of the award-winning language S, developed at Bell Laboratories by John Chambers et al.

With some digging we can use the Wayback Machine Internet Archive to find interesting notes on Version 1 of S itself written by John Chambers, where he writes:

Over the summer of 1976, some actual implementation began. The paper record has a gap over this period (maybe we were too busy coding to write things down). My recollection is that by early autumn, a language was available for local use on the Honeywell system in use at Murray Hill. Certainly by early 1977 there was software and a first version of a user’s manual.

As we can see the ideas and principles behind R are actually much older than 20 years and even 40 years. If you are interested in the history, I recommend watching the very interesting 40 years of S talk from userR 2016.

Faster - How performant is R today versus 20 years ago?

With the 20th birthday of R approaching, I was curious as to how much faster did the implementation of R get with increasing versions. I wrote a very simple benchmarking code to solve the Longest Collatz sequence problem for the first 1 million numbers with a brute-force-ish algorithm.

Then executed it on the same hardware using 20 different versions of R, starting with the very original 1.0, through 2.0, 3.0 all the way to today’s development version.

Benchmarking code

Below is the code snippet with the implementation to be benchmarked:

col_len <- function(n) {
  len <- 0
  while (n > 1) {
    len <- len + 1
    if ((n %% 2) == 0)
      n <- n / 2
    else {
      n <- (n * 3 + 1) / 2
      len <- len + 1
    }
  }
  len
}

res <- lapply(
  1:10,
  function(i) {
    gc()
    system.time(
      max(sapply(seq(from = 1, to = 999999), col_len))
    )
  }
)

Results

Now to the interesting part, the results - the below chart shows the boxplots of time required to execute the code in seconds, with R versions on the horizontal axis.

We can see that the median time to execute the above code to find the longest Collatz sequence amongst the first million numbers was:

  • February 2000: More than 17 minutes with the first R version, 1.0.0
  • January 2002: A large performance boost came already with the 1.4.1 release, decreasing the time by almost 4x, to around 4.5 minutes
  • October 2004: Even more interestingly, my measurements have seen another big improvement with version 2.0.0 - to just 168 seconds, less than 3 minutes. I was not however able to get such good results for any of the later 2.x versions
  • April 2014 - Another speed improvement came 10 years later, with version 3.1 decreasing the time to around 145 seconds
  • April 2017 - Finally, the 3.4 release has seen another significant performance boost, from this version on the time needed to perform this calculation is less than 30 seconds.

Some details and notes

The above is by no means a proper benchmarking solution and was ran purely out of interest. The benchmarks were run on a

  • Windows-based PC with Intel Core (TM) i5-4590 Processor and 8 GB DDR3 1600 MHz RAM.
  • using 32-bit versions of R, with no additional packages installed
  • the following options were used with R 1.0.0: --vsize=900M --nsize=20000k

Some interesting notes on running the same code with a 20-year-old version of R:

  • There was no message() function available
  • Integer literals using the L suffix were not accepted
  • The function do.call() needed a character function name as the first argument
  • Did not accept = for assignment. It did accept _ though ;-)

Other than that, the code ran with no issues across all the tested versions.

Stronger - How many packages were released over the years?

The power of R comes by no small part from the fact that it is easily extensible and the extensions are easily accessible using The Comprehensive R Archive Network, known to most simply as CRAN.

Next on the list of interesting numbers was to look at how CRAN has grown to the powerhouse with more than 15 000 available packages today. Namely, I looked at the numbers of new packages (first releases to CRAN), and total releases (including newer versions of existing packages) over the years using the pkgsearch package.

Results

Once again, the numbers speak for themselves

  • In 2000-2004 the number of newly released packages was less than a 100
  • In 2010 CRAN has seen more than 400 new packages
  • In 2014 more than 1000 packages had their first release
  • In 2017 over 2000 new packages were added to CRAN
  • In 2018 and 2019, the number of total CRAN releases was more than 10 000

I would like to take this opportunity to thank the team behind CRAN to make this amazing growth possible.

Bigger - How did downloads of R packages grow?

The size of the user and developer bases of programming languages is difficult to estimate, but we can use a simple proxy to get a picture in terms of growth. RStudio’s CRAN mirror provides a REST API from which we can look at and visualize the number of monthly downloads of R packages in the past 7 years:

Note the numbers above represent just one of many CRAN mirrors and therefore the true number of package downloads is much higher, the informational value of the chart is mostly in the growth, which is quite impressive:

  • January 2013 has seen around 1.1 million
  • January 2015 it was 7.7 million
  • January 2017 it was 26.9 million
  • January 2020 more than 128 million downloads

Thank you for the 20 years

And here is to 20 more.

Cheers!

Cheers!

Resources