I have a data frame, df, which contains traffic through London underground stations for every hour in 2018:
Year Month Day Hour Station.ID Traffic
1 2018 1 1 0 A 1000
2 2018 1 1 0 B 1300
3 2018 1 1 0 C 956
4 2018 1 1 0 D 721
...
It is over 7,000,000 rows long. I would like an efficient way to look up the traffic at certain dates and times. For example, if I wanted to know the traffic in station 'X' at 10am on 4/5/2018 I'd currently perform:
df[df$Year==2018 & df$Month==5 & df$Day==4 & df$Hour==10 & df$Station.ID=='X',]$Traffic
But this method will needlessly look through the entire dataframe. My idea is to organize data into a hierarchical structure like so:
library(data.tree)
df$pathString <- paste("MyTree",
df$Year,
df$Month,
df$Day,
df$Hour,
df$Station.ID,
sep = "/")
dftree <- as.Node(df)
My previous request would now resemble:
dftree$'2018'$'5'$'4'$'10'$X$Traffic
and this will be orders of magnitude quicker. My problem is that it takes too long to actually organize df into a tree in the first place! If I take a subset of 1000 rows, then it takes a couple of minutes. With all 7,000,000 rows it churns away with no end in sight.
My questions:
1) What is the most appropriate data structure for fast look-up when data is organized by date?
2) Is df too large for this structure to be appropriate?