I'd like to be able to join a dataframe...
concerts
that has:- a column of keys (
venue
) - a column of date-times (
start_time
)- The
start_time
s represent the time at which the concert began at the venue.
- The
- a column of keys (
...with a number of other dataframes that are effectively time series. For example, the dataframe...
temperatures
has:- a column of keys (also
venue
) - a column of hourly date-times (
datetime
) that span the entire time frame (many days before the concert times and many days after). - a
temperature
column
- a column of keys (also
What I want to have in the joined result is: the temperature at the venue at the start hour of the concert, but also at the end of the first, second, third, and fourth hours of the concert. Essentially a 4-hour 'window' of temperatures.
The only approach I can think of is to create lagged columns in temperatures
(one for each of the 1st-4th hours of the concert), and then to join with concerts
on the venue and start hours. But this is very slow when applied to my actual dataset, which has many more columns than just temperature.
Here is the example data I've cooked up.
library(lubridate)
library(tidyverse)
concerts <- tibble(venue = c("A", "A", "B", "B"),
start_time = ymd_hm(c("2019-08-09 08:05",
"2019-08-10 16:07",
"2019-08-09 09:30",
"2019-08-10 17:15"))
)
temperatures <- tibble(venue = c(rep("A", 50),
rep("B", 50)),
datetime = rep(seq(ymd_hm("2019-08-09 00:00"), by = "hour", length.out = 50), 2),
temperature = c(rnorm(50, 60, 5),
rnorm(50, 95, 5))
)
Here is my successful but expensive attempt, with the desired results in temperature_over_course_of_concerts
.
temperatures_lagged <- temperatures %>%
mutate(temperature_1hr_in = lag(temperature, 1),
temperature_2hr_in = lag(temperature, 2),
temperature_3hr_in = lag(temperature, 3),
temperature_4hr_in = lag(temperature, 4)) %>%
rename(temperature_start = temperature)
temperature_over_course_of_concerts <- concerts %>%
mutate(start_hour = floor_date(start_time, unit = "hour")) %>%
left_join(temperatures_lagged, by = c("venue" = "venue", "start_hour" = "datetime"))