0

I have a data frame, df, which contains traffic through London underground stations for every hour in 2018:

        Year    Month Day  Hour    Station.ID    Traffic
1       2018    1     1    0       A             1000
2       2018    1     1    0       B             1300
3       2018    1     1    0       C             956
4       2018    1     1    0       D             721
...

It is over 7,000,000 rows long. I would like an efficient way to look up the traffic at certain dates and times. For example, if I wanted to know the traffic in station 'X' at 10am on 4/5/2018 I'd currently perform:

df[df$Year==2018 & df$Month==5 & df$Day==4 & df$Hour==10 & df$Station.ID=='X',]$Traffic

But this method will needlessly look through the entire dataframe. My idea is to organize data into a hierarchical structure like so:

library(data.tree)
df$pathString <- paste("MyTree", 
                        df$Year, 
                        df$Month,
                        df$Day,
                        df$Hour,
                        df$Station.ID,
                        sep = "/")
dftree <- as.Node(df)

My previous request would now resemble:

dftree$'2018'$'5'$'4'$'10'$X$Traffic

and this will be orders of magnitude quicker. My problem is that it takes too long to actually organize df into a tree in the first place! If I take a subset of 1000 rows, then it takes a couple of minutes. With all 7,000,000 rows it churns away with no end in sight.

My questions:

1) What is the most appropriate data structure for fast look-up when data is organized by date?

2) Is df too large for this structure to be appropriate?

  • 1
    Instead of multiple `Year`, `Month` etc. columns create just one `POSIXct` (datetime) column. Then, use `data.table` and set as key `Station.Id` and the newly created column. In this way you can use binary search and extraction of desired station.id/datetime will be very quick. – nicola Jul 01 '18 at 14:52
  • Relevant vignette: [Keys and fast binary search based subset](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html). Benchmarks in the last section ("binary search vs vector scans"). – Henrik Jul 01 '18 at 16:20

1 Answers1

0

A data.table.

Using the flights data set, a query like yours takes around half a second:

library(data.table)
library(nycflights13)
flights <- as.data.table(flights)
flights7M <- rbindlist(lapply(1:22, function(x) flights))

nrow(flights7M) / 7e6  # close enough
#> [1] 1.058439

bench::system_time({
  setkey(flights7M, year, month, day, hour, origin)
  flights7M[.(2013L, 5L, 4L, 10L, "JFK")]
})
#> process    real 
#>    1.8s 587.4ms

Created on 2018-07-02 by the reprex package (v0.2.0).

Hugh
  • 15,521
  • 12
  • 57
  • 100