Join with fuzzy matching by date in R

Question

I have two data frames that I'd like to join them by the dates

df1 <-
  data.frame(
    day = seq(ymd("2020-01-01"), ymd("2020-01-14"), by = "1 day"),
    key = rep(c("green", "blue"), 7),
    value_x = sample(1:100, 14)
  ) %>% 
    as_tibble()

df2 <-
  data.frame(
    day = seq(ymd("2020-01-01"), ymd("2020-01-12"), by = "3 days"),
    key = rep(c("green", "blue"), 2),
    value_y = c(2, 4, 6, 8)
  ) %>% 
  as_tibble()

I want the output to be like this

# A tibble: 14 x 3
   day        key   value_x   value_y
   <date>     <fct>   <int>     <int>
 1 2020-01-01 green      91         2
 2 2020-01-02 blue       28        NA
 3 2020-01-03 green      75         2
 4 2020-01-04 blue       14         4
 5 2020-01-05 green       3         2
 6 2020-01-06 blue       27         4
 7 2020-01-07 green      15         6
 8 2020-01-08 blue        7         4
 9 2020-01-09 green       1         6
10 2020-01-10 blue       10         8
11 2020-01-11 green       9         6
12 2020-01-12 blue       76         8
13 2020-01-13 green      31         6
14 2020-01-14 blue       62         8

I tried doing this code

merge(df1, df2, by = c("day", "key"), all.x = TRUE)

I'd like the day in the left table to join to the most recent day in the Y table that has a value. If there is no value, then it should be NA.

Edit --

Not all the dates in df2 will appear in df1 while they do have a common ID. This is an example-

df1 

day           id       key  
1 2020-01-08    A    green
2 2020-01-10    A    green
3 2020-02-24    A    blue 
4 2020-03-24    A    green
   
df2 

day            id   value 
1 2020-01-03    A       2
2 2020-01-07    A       4
3 2020-01-22    A       4
4 2020-03-24    A       6   

desired output

day           id       key    value
1 2020-01-08    A    green        4   
2 2020-01-10    A    green        4
3 2020-02-24    A    blue         4
4 2020-03-24    A    green        6

You need to `set.seed()` first when using `sample` or other random functions to make the result reproducible. — thelatemail, Sep 07 '20 at 00:51
If you're interested in a *data.table* alternative too, this is a 'rolling join' - `setDT(df2)[setDT(df1), on=.(key,day), roll=+Inf]` and you're done. — thelatemail, Sep 07 '20 at 00:56
Would the data table solution work given the update to the question? — Cauder, Sep 07 '20 at 03:31
Can you check the last 2 values in your desired output? I think they should be 6 and 6. — Ronak Shah, Sep 07 '20 at 03:35
Thanks for your help, Ronak. I believe they should be 4 and 6 — Cauder, Sep 07 '20 at 03:48
@Cauder - yep, and it will give 4,6 as the last two results - `setDT(df2)[setDT(df1), on=.(id,day), roll=+Inf]` — thelatemail, Sep 07 '20 at 03:52
I updated the comment to reflect that I'd like to do this with a large dataset — Cauder, Sep 07 '20 at 10:04
From the conversation under the accepted answer, I'm opening a new question — Cauder, Sep 07 '20 at 10:09
Posted the question: https://stackoverflow.com/questions/63775540/join-with-fuzzy-matching-by-date-in-r-with-a-large-dataset — Cauder, Sep 07 '20 at 10:12

Ronak Shah · Accepted Answer · 2020-09-07T04:01:41.957

4

After merging, you can arrange the data based on key and day and fill with the most recent non-NA value.

library(dplyr)

merge(df1, df2, by = c('day', 'key'), all.x = TRUE) %>%
  arrange(key, day) %>%
  group_by(key) %>%
  tidyr::fill(value_y) %>%
  arrange(day)

#          day   key value_x value_y
#1  2020-01-01 green      40       2
#2  2020-01-02  blue      45      NA
#3  2020-01-03 green      54       2
#4  2020-01-04  blue      11       4
#5  2020-01-05 green      12       2
#6  2020-01-06  blue       7       4
#7  2020-01-07 green      72       6
#8  2020-01-08  blue      76       4
#9  2020-01-09 green      52       6
#10 2020-01-10  blue      32       8
#11 2020-01-11 green      69       6
#12 2020-01-12  blue      10       8
#13 2020-01-13 green      63       6
#14 2020-01-14  blue      84       8

For the updated data you can use the following :

df1 %>%
  left_join(df2, by = 'id') %>%
  mutate(diff = day.x - day.y) %>%
  group_by(id, key, day.x) %>%
  filter(diff == min(diff[diff >= 0])) %>%
  arrange(day.x) %>%
  select(day = day.x, id, key, value)

#   day        id    key   value
#  <date>     <chr> <chr> <int>
#1 2020-01-08 A     green     4
#2 2020-01-10 A     green     4
#3 2020-02-24 A     blue      4
#4 2020-03-24 A     green     6

edited Sep 07 '20 at 04:01

answered Sep 07 '20 at 00:50

Ronak Shah

377,200
20
156
213

Can I group this by key? I'm wondering if I had more key values and more dates so the most recent non na value might belong to a different key – Cauder Sep 07 '20 at 00:53
Yes, I think it is better to group by `key`. Updated the answer. – Ronak Shah Sep 07 '20 at 00:54
This didn't solve my use case. I'm going to update the description to reflect why. The problem is that this assumes that the dates in df2 all appear in df1 – Cauder Sep 07 '20 at 03:05
@Cauder Updated the answer. – Ronak Shah Sep 07 '20 at 04:02
Thanks! Will this break down for datasets that are more than 20M? Wondering if I should switch to data tables instead – Cauder Sep 07 '20 at 04:22
1

I am not sure if this will break but `data.table` is usually faster. – Ronak Shah Sep 07 '20 at 04:40
Unfortunately this script does not work with the size of my dataset. It causes RStudio to crash out with "unexpected errors." I mentioned it on this thread, would you be open to writing this with data tables? https://stackoverflow.com/questions/63774476/what-are-helpful-optimizations-in-r-for-big-data-sets – Cauder Sep 07 '20 at 09:18
That's a great suggestion. I'll mark this accept and open a separate question – Cauder Sep 07 '20 at 10:09
Posted the question: https://stackoverflow.com/questions/63775540/join-with-fuzzy-matching-by-date-in-r-with-a-large-dataset – Cauder Sep 07 '20 at 10:12

Join with fuzzy matching by date in R

1 Answers1

Linked