1

I would like to do an interval join with an additional key. The simplest way in dplyr is quite slow

intervalDf <- tibble(id = rep(seq(1, 100000, 1), 10),
                     k1 = rep(seq(1, 1000, 1), 1000),
                     startTime = sample(seq(as.Date('1995/01/01'), as.Date('1999/06/01'), by="day"), 1000000, replace = TRUE),
                     endTime = startTime + sample.int(180, 1000000, replace = TRUE))
eventDf <- tibble(k1 = rep(seq(1, 1000, 1), 200),
                  points = sample.int(10, 200000, replace = TRUE),
                  date = sample(seq(as.Date('1995/01/01'), as.Date('2000/01/01'), by="day"), 200000, replace = TRUE))
testDf <- inner_join(intervalDf, eventDf, by = "k1") %>%
  filter(date >= startTime,
         date <= endTime) %>%
  count(id, startTime, endTime, wt = points, name = "points")

I was hoping to use interval_join in the fuzzyjoin package, but that function does not allow for an additional join key besides the interval https://rdrr.io/cran/fuzzyjoin/man/interval_join.html.

I'm guessing there's some slick way with rolling join in data.table or something, but I can't figure it out.

blahblah4252
  • 105
  • 4
  • 2
    Use a non-equi join i.e. `library(data.table);setDT(intervalDF)[eventDf, on = .(k1, date >= startTime, date<= endTime)]` – akrun Aug 15 '22 at 18:45

1 Answers1

2

we should use a non-equi join here, with slight modification to the approach outlined in the comments:

library(data.table)

setDT(eventDf)
setDT(intervalDf)

eventDf[intervalDf, on=.(k1, date>=startTime, date<=endTime)][, sum(points), .(id, startTime=date, endTime=date.1)]
akrun
  • 874,273
  • 37
  • 540
  • 662
langtang
  • 22,248
  • 1
  • 12
  • 27