Match two tables based on a time difference criterium

Question

I have a data table (lv_timest) with time stamps every 3 hours for each date:

# A tibble: 6 × 5
     LV0_mean LV1_mean  LV2_mean Date_time           Date      
     <dbl>    <dbl>     <dbl>    <S3:POSIXct>        <date>    
1    0.778    -4.12     0.736    2016-12-28 00:00:00 2016-12-28
2    0.376    -0.234    0.388    2016-12-28 03:00:00 2016-12-28
3    0.409    1.46      0.241    2016-12-28 06:00:00 2016-12-28
4    0.760    2.07      0.460    2016-12-28 09:00:00 2016-12-28
5    0.759    2.91      0.735    2016-12-28 12:00:00 2016-12-28
6    0.857    3.00      0.803    2016-12-28 15:00:00 2016-12-28

from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):

# A tibble: 6 × 4
   Event_number Date_time           Date       Date_time_new
   <int>        <S3: POSIXct>       <date>     <S3: POSIXct>
1  75           2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2  123          2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3  264          2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4  317          2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5  318          2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6  369          2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00

For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:

Event_number Date_time.x          Date.x      Date_time_new      LV0_mean LV1_mean   LV2_mean Date_time.y          Date.y
<int>        <S3: POSIXct>        <date>      <S3: POSIXct>      <dbl>    <dbl>      <dbl>    <S3: POSIXct>        <date>                         
75           2016-12-28 08:00:00  2016-12-28 2016-12-28 08:00:00 0.760    2.07       0.460    2016-12-28 09:00:00  2016-12-28

In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.

require(lubridate)
require(fuzzyjoin)

diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour

match_timest <- fuzzy_inner_join(event_timest, lv_timest,
                                 by = c("Date" = "Date",
                                        "Date_time_new" = "Date_time"),
                                 match_fun = list(`==`, diff_timest))
head(match_timest)

# A tibble: 6 × 9
  Event_number Date_time.x         Date.x     Date_time_new       LV0_mean LV1_mean LV2_mean Date_time.y         Date.y    
         <int> <dttm>              <date>     <dttm>                 <dbl>    <dbl>    <dbl> <dttm>              <date>    
1           75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00   0.760     2.07     0.460 2016-12-28 09:00:00 2016-12-28
2          123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00   1.24      1.83     2.05  2016-12-30 15:00:00 2016-12-30
3          264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00  -0.128    -5.43     2.72  2017-01-07 06:00:00 2017-01-07
4          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.0751    0.171    2.56  2017-01-09 09:00:00 2017-01-09
5          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09
6          318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09

Would there be another way to do this?

**Edit**: issue can be solved by changing the function `diff_timest`to `diff_timest <- function(x, y){abs((x%--%y) / hours(1)) <= 1}`. However the limitation is that this won’t match timestamps from different dates as @zephryl raises. Hence, using `full_join(lv_timest, by = character())` is preferred (answer below). — Lisa, Feb 08 '22 at 14:38

danlooo · Answer 1 · 2022-02-08T12:39:48.247

Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

datetimes_a <- tibble(
  id = seq(3),
  group = "A",
  datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
  id = seq(3),
  group = "B",
  datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)

datetimes_a %>%
  # start with cross product of all possible pairs
  expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
  mutate(diff = abs(datetime - datetime_b)) %>%
  # get shortest time difference
  group_by(id, id_b) %>%
  arrange(diff) %>%
  slice(1) %>%
  # time diff must be less than 1hr
  filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups:   id, id_b [3]
#>      id group datetime             id_b group_b datetime_b          diff     
#>   <int> <chr> <dttm>              <int> <chr>   <dttm>              <drtn>   
#> 1     1 A     2016-12-28 00:00:00     1 B       2016-12-28 00:00:10   10 secs
#> 2     2 A     2016-12-28 03:00:00     2 B       2016-12-28 03:20:00 1200 secs
#> 3     3 A     2016-12-28 23:59:59     3 B       2016-12-29 00:00:02    3 secs

^{Created on 2022-02-08 by the reprex package (v2.0.1)}

This works also if the nearest timepoint is on another date e.g. right before and after midnight.

zephryl · Accepted Answer · 2022-02-08T12:43:46.053

1

I would suggest a standard join, followed by a grouped filter to the closest instance of each timestamp:

library(tidyverse)
library(lubridate)

match_timest <- event_timest %>%
  inner_join(lv_timest, by = "Date") %>%
  mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
  group_by(Date_time.y) %>%
  filter(diff <= 1 & diff == min(diff)) %>%
  ungroup() %>%
  select(!diff)

Note:

this will still match multiple rows if there’s more than one that are the exact same shortest difference from the index timestamp.
this won’t match timestamps from different dates — eg, 23:59:59 on 1/1/22 won’t be matched with 00:00:00 on 1/2/22. If you’d like to do that, you can use a full Cartesian join (full_join(lv_timest, by = character())) rather than the inner_join() above.

edited Feb 08 '22 at 12:43

answered Feb 08 '22 at 12:32

zephryl

14,633
3
11
30

Thank you, in this particular case I would suggest not using the second criterium in the filtering step (`& diff == min(diff)`) as this leads to missing some events (some have the same date but a different event number). In this case there is only one timestamp that can match with a time difference below 1h. But if that was not the case, this should be kept. – Lisa Feb 08 '22 at 14:48
I’m not sure I follow. e.g., `abs(as.numeric(lubridate::ymd_hms("2022-01-02 00:01:00") - lubridate::ymd_hms("2022-01-01 23:55:00"), unit = "hours")) == 0.1` — this still correctly shows the correct time difference with datetimes on two different days. – zephryl Feb 08 '22 at 15:22
Sorry that was unclear, the issue actually is not the time difference calculation. Since the table is grouped by datetime but some events have the same day, the time difference that needs to be retained sometimes is not the minimum of the entire group, and several lines should be kept for the same group in the filtering step (_e.g._ line 4 and 5 in the repro). Of course this all depends on the data, and can be dealt with by changing the grouping variable. – Lisa Feb 14 '22 at 12:51

Match two tables based on a time difference criterium

2 Answers2

Linked

Related