2

I have two data frames that I'd like to join them by the dates

df1 <-
  data.frame(
    day = seq(ymd("2020-01-01"), ymd("2020-01-14"), by = "1 day"),
    key = rep(c("green", "blue"), 7),
    value_x = sample(1:100, 14)
  ) %>% 
    as_tibble()

df2 <-
  data.frame(
    day = seq(ymd("2020-01-01"), ymd("2020-01-12"), by = "3 days"),
    key = rep(c("green", "blue"), 2),
    value_y = c(2, 4, 6, 8)
  ) %>% 
  as_tibble()

I want the output to be like this

# A tibble: 14 x 3
   day        key   value_x   value_y
   <date>     <fct>   <int>     <int>
 1 2020-01-01 green      91         2
 2 2020-01-02 blue       28        NA
 3 2020-01-03 green      75         2
 4 2020-01-04 blue       14         4
 5 2020-01-05 green       3         2
 6 2020-01-06 blue       27         4
 7 2020-01-07 green      15         6
 8 2020-01-08 blue        7         4
 9 2020-01-09 green       1         6
10 2020-01-10 blue       10         8
11 2020-01-11 green       9         6
12 2020-01-12 blue       76         8
13 2020-01-13 green      31         6
14 2020-01-14 blue       62         8

I tried doing this code

merge(df1, df2, by = c("day", "key"), all.x = TRUE)

I'd like the day in the left table to join to the most recent day in the Y table that has a value. If there is no value, then it should be NA.

Edit --

Not all the dates in df2 will appear in df1 while they do have a common ID. This is an example-

df1 

day           id       key  
1 2020-01-08    A    green
2 2020-01-10    A    green
3 2020-02-24    A    blue 
4 2020-03-24    A    green
   
df2 

day            id   value 
1 2020-01-03    A       2
2 2020-01-07    A       4
3 2020-01-22    A       4
4 2020-03-24    A       6   

desired output

day           id       key    value
1 2020-01-08    A    green        4   
2 2020-01-10    A    green        4
3 2020-02-24    A    blue         4
4 2020-03-24    A    green        6 
Cauder
  • 2,157
  • 4
  • 30
  • 69

1 Answers1

4

After merging, you can arrange the data based on key and day and fill with the most recent non-NA value.

library(dplyr)

merge(df1, df2, by = c('day', 'key'), all.x = TRUE) %>%
  arrange(key, day) %>%
  group_by(key) %>%
  tidyr::fill(value_y) %>%
  arrange(day)

#          day   key value_x value_y
#1  2020-01-01 green      40       2
#2  2020-01-02  blue      45      NA
#3  2020-01-03 green      54       2
#4  2020-01-04  blue      11       4
#5  2020-01-05 green      12       2
#6  2020-01-06  blue       7       4
#7  2020-01-07 green      72       6
#8  2020-01-08  blue      76       4
#9  2020-01-09 green      52       6
#10 2020-01-10  blue      32       8
#11 2020-01-11 green      69       6
#12 2020-01-12  blue      10       8
#13 2020-01-13 green      63       6
#14 2020-01-14  blue      84       8

For the updated data you can use the following :

df1 %>%
  left_join(df2, by = 'id') %>%
  mutate(diff = day.x - day.y) %>%
  group_by(id, key, day.x) %>%
  filter(diff == min(diff[diff >= 0])) %>%
  arrange(day.x) %>%
  select(day = day.x, id, key, value)

#   day        id    key   value
#  <date>     <chr> <chr> <int>
#1 2020-01-08 A     green     4
#2 2020-01-10 A     green     4
#3 2020-02-24 A     blue      4
#4 2020-03-24 A     green     6
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Can I group this by key? I'm wondering if I had more key values and more dates so the most recent non na value might belong to a different key – Cauder Sep 07 '20 at 00:53
  • Yes, I think it is better to group by `key`. Updated the answer. – Ronak Shah Sep 07 '20 at 00:54
  • This didn't solve my use case. I'm going to update the description to reflect why. The problem is that this assumes that the dates in df2 all appear in df1 – Cauder Sep 07 '20 at 03:05
  • @Cauder Updated the answer. – Ronak Shah Sep 07 '20 at 04:02
  • Thanks! Will this break down for datasets that are more than 20M? Wondering if I should switch to data tables instead – Cauder Sep 07 '20 at 04:22
  • 1
    I am not sure if this will break but `data.table` is usually faster. – Ronak Shah Sep 07 '20 at 04:40
  • Unfortunately this script does not work with the size of my dataset. It causes RStudio to crash out with "unexpected errors." I mentioned it on this thread, would you be open to writing this with data tables? https://stackoverflow.com/questions/63774476/what-are-helpful-optimizations-in-r-for-big-data-sets – Cauder Sep 07 '20 at 09:18
  • That's a great suggestion. I'll mark this accept and open a separate question – Cauder Sep 07 '20 at 10:09
  • Posted the question: https://stackoverflow.com/questions/63775540/join-with-fuzzy-matching-by-date-in-r-with-a-large-dataset – Cauder Sep 07 '20 at 10:12