R:case4base - data subsetting and manipulation with base R

Introduction

In the previous article we discussed and learned how to reshape data with base R to a form that is practical for our use. In this one, we will look at basic data manipulation techniques, namely obtaining relevant subsets of our data. The key will be safety and avoiding complication and confusion as much as possible. This is why we:

  • try to avoid using subset, as this function is implemented via non-standard evaluation.
  • also skip $ as it uses partial matching and is hardly used with variables as column names.
  • not mention more details related to list properties of data.frames here as the topic could get confusing. If you would like to go to more important detail, we strongly recommend a thorough read of the subsetting chapter of Hadley Wickham’s Advanced R

How to use this article

  1. This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a case4base after all!
  2. If you have no time for reading, you can click here to get just the code with commentary

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

gdi <- read.csv(
  stringsAsFactors = FALSE
, url("https://jozef.io/post/data/ESA2010_GDI.csv")
              )
head(gdi[, 1:6, drop = FALSE])
##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0

Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.

The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.

Selecting (subsetting) relevant data from a data.frame

In this paragraph, we will try to show how to subset with as little hassle as possible while preserving the maximum safety in your operations. We shall go into more detail later in the article. The standard approach to subsetting data.frames can be summarised:

dataframe_name[row_subset, col_subset, drop = FALSE]

Where:

  • dataframe_name is the name of the data.frame we are subsetting
  • row_subset is a vector specifying the subset of rows
  • col_subset is a vector specifying the subset of columns
  • drop = FALSE is to make sure the result does not get simplified when not indented. This should always be used, unless we specifically want to simplify the result (e.g. to a vector for indexing)

Constructing meaningful subsets simply and safely

In practice, we of course will most of the time not select rows and/or columns with positions known apriori, but based on more variable conditions. For this purpose, the advised way would be is to construct logical vectors:

Let us now subset the rows of our data to get the data for countries that have known (not NA) value in the year 2016 and this value is less than 1 million:

rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000
gdi[rowidx, c(1, 23), drop = FALSE]
##        country    Y.2016
## 3      Belgium 243825.50
## 4     Bulgaria  60237.00
## 8      Estonia  12548.30
## 9      Ireland  97318.90
## 11       Spain 698701.00
## 13     Croatia      0.00
## 16      Latvia  15737.79
## 17   Lithuania  24743.49
## 18  Luxembourg  20155.80
## 21 Netherlands 357383.00
## 22     Austria 214980.60
## 24    Portugal 128789.39
## 26    Slovenia  24756.63
## 27    Slovakia  48882.91
## 28     Finland 126590.00
## 33 Switzerland 458641.00

Note that when creating the rowidx we omitted the drop = FALSE despite the aforementioned best practice. This is because in this particular case we consciously welcome the result being simplified to a vector, as its use is only as an index for subsetting.

More ways to provide subset indices

Subsetting can be done in a few ways. We will now use them to show a subset the first two and the 27th row and the first, 22nd and 23rd column, giving us the GDI for EU28, Euro Area 19 and Slovakia in the years 2015 and 2016:

  1. Logical vectors TRUE for rows/columns to subset, FALSE for those to omit
st1 <- gdi[c(TRUE, TRUE, rep(FALSE, 24), TRUE, rep(FALSE, 8))
         , c(TRUE, rep(FALSE, 20), rep(TRUE, 2))
         , drop = FALSE
         ]
  1. Numeric vectors of row/column numbers to subset
st2 <- gdi[c(1:2, 27) 
         , c(1, 22:23)
         , drop = FALSE
         ]
  1. Negative numeric vectors of row/column numbers to omit
st3 <- gdi[c(-3:-26, -28:-35)
         , c(-2:-21)
         , drop = FALSE
         ]
  1. Character vectors of row/column names to subset
st4 <- gdi[c("1", "2", "27") # we do not have very meaningful rownames
         , c("country", "Y.2015", "Y.2016")
         , drop = FALSE
         ]
st4
##         country     Y.2015     Y.2016
## 1         EU 28 9439578.39 9454683.60
## 2  Euro area 19 6598231.27 6736686.43
## 27     Slovakia   47464.71   48882.91
  1. All of the above give identical results
identical(st1, st2) && identical(st2, st3) && identical(st3, st4)
## [1] TRUE

Tips

  1. The above methods are also working and safe for matrices
  2. Negative and positive numeric vectors cannot be combined

Alternatives to base R

TL;DR - Just want the code

No time for reading? Click here to get just the code with commentary

Exercises

  1. What is the difference between gdi[3, 3] and gdi[3, 3, drop = FALSE] ?
  2. What is the difference between gdi[-3, 3] and gdi[3, -3] ? What about gdi[-3, 3, drop = FALSE] ?
  3. Why cannot we omit the first part of the & in rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000. What would happen if we just did rowidx <- gdi[, "Y.2016"] < 1000000 ?
  4. Bonus question 1: Why is identical(gdi[, "Y.2016", drop = FALSE], gdi["Y.2016"])
  5. Bonus question 2: Why is identical(gdi[, "Y.2016"], gdi[["Y.2016"]])

References

  1. Advanced R’s chapter on subsetting
  2. and on data types
  3. original eurostat data source