1

I'd like to be able to join a dataframe...

  1. concerts that has:
    • a column of keys (venue)
    • a column of date-times (start_time)
      • The start_times represent the time at which the concert began at the venue.

...with a number of other dataframes that are effectively time series. For example, the dataframe...

  1. temperatures has:
    • a column of keys (also venue)
    • a column of hourly date-times (datetime) that span the entire time frame (many days before the concert times and many days after).
    • a temperature column

What I want to have in the joined result is: the temperature at the venue at the start hour of the concert, but also at the end of the first, second, third, and fourth hours of the concert. Essentially a 4-hour 'window' of temperatures.

The only approach I can think of is to create lagged columns in temperatures (one for each of the 1st-4th hours of the concert), and then to join with concerts on the venue and start hours. But this is very slow when applied to my actual dataset, which has many more columns than just temperature.

Here is the example data I've cooked up.

library(lubridate)
library(tidyverse)

concerts <- tibble(venue = c("A", "A", "B", "B"),
                    start_time = ymd_hm(c("2019-08-09 08:05",
                                     "2019-08-10 16:07",
                                     "2019-08-09 09:30",
                                     "2019-08-10 17:15"))
                   )

temperatures <- tibble(venue = c(rep("A", 50),
                                   rep("B", 50)),
                datetime = rep(seq(ymd_hm("2019-08-09 00:00"), by = "hour", length.out = 50), 2),
                temperature = c(rnorm(50, 60, 5),
                         rnorm(50, 95, 5))
                )

Here is my successful but expensive attempt, with the desired results in temperature_over_course_of_concerts.

temperatures_lagged <- temperatures  %>%
  mutate(temperature_1hr_in = lag(temperature, 1),
         temperature_2hr_in = lag(temperature, 2),
         temperature_3hr_in = lag(temperature, 3),
         temperature_4hr_in = lag(temperature, 4)) %>%
  rename(temperature_start = temperature)

temperature_over_course_of_concerts <- concerts %>%
  mutate(start_hour = floor_date(start_time, unit = "hour")) %>%
  left_join(temperatures_lagged, by = c("venue" = "venue", "start_hour" = "datetime"))
  • Would look into non-equi joins, which let you join on a window: https://stackoverflow.com/questions/32893022/join-two-datasets-based-on-an-inequality-condition – Chris Aug 12 '19 at 18:12

0 Answers0