interval join with extra key

Question

I would like to do an interval join with an additional key. The simplest way in dplyr is quite slow

intervalDf <- tibble(id = rep(seq(1, 100000, 1), 10),
                     k1 = rep(seq(1, 1000, 1), 1000),
                     startTime = sample(seq(as.Date('1995/01/01'), as.Date('1999/06/01'), by="day"), 1000000, replace = TRUE),
                     endTime = startTime + sample.int(180, 1000000, replace = TRUE))
eventDf <- tibble(k1 = rep(seq(1, 1000, 1), 200),
                  points = sample.int(10, 200000, replace = TRUE),
                  date = sample(seq(as.Date('1995/01/01'), as.Date('2000/01/01'), by="day"), 200000, replace = TRUE))
testDf <- inner_join(intervalDf, eventDf, by = "k1") %>%
  filter(date >= startTime,
         date <= endTime) %>%
  count(id, startTime, endTime, wt = points, name = "points")

I was hoping to use interval_join in the fuzzyjoin package, but that function does not allow for an additional join key besides the interval https://rdrr.io/cran/fuzzyjoin/man/interval_join.html.

I'm guessing there's some slick way with rolling join in data.table or something, but I can't figure it out.

Use a non-equi join i.e. `library(data.table);setDT(intervalDF)[eventDf, on = .(k1, date >= startTime, date<= endTime)]` — akrun, Aug 15 '22 at 18:45

score 2 · Accepted Answer · edited Aug 15 '22 at 19:06

2

we should use a non-equi join here, with slight modification to the approach outlined in the comments:

library(data.table)

setDT(eventDf)
setDT(intervalDf)

eventDf[intervalDf, on=.(k1, date>=startTime, date<=endTime)][, sum(points), .(id, startTime=date, endTime=date.1)]

edited Aug 15 '22 at 19:06

akrun

874,273
37
540
662

answered Aug 15 '22 at 19:03

langtang

22,248
1
12
27

interval join with extra key

1 Answers1