0

Recently I had to join two dataframes based on their timestamps. The left data contains a fixed timestamp and the right a range. I got it mostly working as you can see in my MWE, but the system tends to produce duplicate results at the crossing point from one range to the next. I've tried all the options, nothing worked.

Is there a nice way to suppress the duplicate entry?
In this example it is the bold one, number 13. Of course you can try to filter it, but that feels rather hacky.

left.data <- data.frame(
  Data.left = 1:20,
  Timestamp = seq(
    ISOdatetime(2000, 1, 1, 10, 55, 10),
    by = "5 min",
    length.out = 20
  )
)

right.data <- data.frame(
  Data.right = 1:10,
  PeriodStartTime = seq(
    ISOdatetime(2000, 1, 1, 10, 55, 10),
    by = "hour",
    length.out = 10
  ),
  PeriodEndTime = seq(
    ISOdatetime(2000, 1, 1, 11, 55, 10),
    by = "hour",
    length.out = 10
  )
)

result <- fuzzyjoin::interval_join(
    x = left.data,
    y = right.data,
    by = c(
      "Timestamp" = "PeriodStartTime",
      "Timestamp" = "PeriodEndTime"
    ),
    mode = "left"
  )
Data.left Timestamp Data.right PeriodStartTime PeriodEndTime
1 2000-01-01 10:55:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
2 2000-01-01 11:00:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
3 2000-01-01 11:05:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
4 2000-01-01 11:10:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
5 2000-01-01 11:15:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
6 2000-01-01 11:20:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
7 2000-01-01 11:25:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
8 2000-01-01 11:30:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
9 2000-01-01 11:35:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
10 2000-01-01 11:40:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
11 2000-01-01 11:45:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
12 2000-01-01 11:50:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
13 2000-01-01 11:55:10 1 2000-01-01 10:55:10 2000-01-01 11:55:10
13 2000-01-01 11:55:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
14 2000-01-01 12:00:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
15 2000-01-01 12:05:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
16 2000-01-01 12:10:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
17 2000-01-01 12:15:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
18 2000-01-01 12:20:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
19 2000-01-01 12:25:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
20 2000-01-01 12:30:10 2 2000-01-01 11:55:10 2000-01-01 12:55:10
Someone2
  • 421
  • 2
  • 15

1 Answers1

1

Maybe someone will provide another answer with interval_join, but here is something to consider with fuzzy_left_join.

Your match function match_fun could be set to allow for equality for the lower bound of the range (greater or equal to), but be less than the upper bound.

library(fuzzyjoin)

fuzzy_left_join(
  left.data,
  right.data,
  by = c(
    "Timestamp" = "PeriodStartTime",
    "Timestamp" = "PeriodEndTime"
  ),
  match_fun = c(`>=`, `<`)
)

This result would give you:

   Data.left           Timestamp Data.right     PeriodStartTime       PeriodEndTime
1          1 2000-01-01 10:55:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
2          2 2000-01-01 11:00:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
3          3 2000-01-01 11:05:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
4          4 2000-01-01 11:10:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
5          5 2000-01-01 11:15:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
6          6 2000-01-01 11:20:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
7          7 2000-01-01 11:25:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
8          8 2000-01-01 11:30:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
9          9 2000-01-01 11:35:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
10        10 2000-01-01 11:40:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
11        11 2000-01-01 11:45:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
12        12 2000-01-01 11:50:10          1 2000-01-01 10:55:10 2000-01-01 11:55:10
13        13 2000-01-01 11:55:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
14        14 2000-01-01 12:00:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
15        15 2000-01-01 12:05:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
16        16 2000-01-01 12:10:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
17        17 2000-01-01 12:15:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
18        18 2000-01-01 12:20:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
19        19 2000-01-01 12:25:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
20        20 2000-01-01 12:30:10          2 2000-01-01 11:55:10 2000-01-01 12:55:10
Ben
  • 28,684
  • 5
  • 23
  • 45