Filter dataframe by another dataframe containing duplicate indices

Question

This is a variation of the question posed here. I really liked the dplyr solution there, but I cannot figure out a way to do it in the following case without looping, which is not ideal in a massive dataset.

Suppose I have a key dataframe df1 that contains all key dates for each id, including some ids with >1 key date:

    id   keydate
1   id1  2015-03-06
2   id1  2017-04-10
3   id1  2021-08-20
4   id2  2018-12-11
5   id2  2020-12-15
6   id3  2019-06-04

df1 <- structure(list(id = structure(1:6, .Label = c("id1", "id1", "id1", "id2", "id2", "id3"), class = "factor"),
                      date = structure(c(16500, 17266, 18859, 17876, 18611, 18051), class = "Date")),
                 .Names = c("id", "keydate"),
                 row.names = c(NA, -6L), class = "data.frame")

df1 will define the start date of a 10-day window of interest.

I have a second dataframe df2 that has ids and dates:

    id  date
1   id1 2015-03-10
2   id1 2021-08-29
3   id1 2021-10-19
4   id2 2018-12-20
5   id2 2021-01-29
6   id3 2019-07-12
7   id3 2019-08-23

df2 <- structure(list(id = structure(1:7, .Label = c("id1", "id1", "id1", "id2", "id2", "id3", "id3"), class = "factor"),
                      date = structure(c(16504, 18868, 18919, 17885, 18656, 18089, 18131), class = "Date")),
                 .Names = c("id", "date"),
                 row.names = c(NA, -7L), class = "data.frame")

I am trying to find all rows in df2 where keydate <= date <= keydate + 10 for all keydates in df1 matched by id. Desired output would look like this:

    id  date
1   id1 2015-03-10
2   id1 2021-08-29
3   id2 2018-12-20

specifically returning rows 1, 2, and 4 of df2.

Have you done any searching SO for similar questions? Seems to me I’ve seen several similar ones over the years. — IRTFM, Apr 25 '23 at 21:56
Of course, but I didn't find one that had the duplicated index problem. Maybe it's an issue of me not knowing the correct terms. Anyway, the act of writing this problem out helped me solve it. — DocBuckets, Apr 25 '23 at 21:58

score 0 · Accepted Answer · answered Apr 25 '23 at 22:00

I feel like a dunce. left_join is exhaustively combinatorial, meaning every instance of keydate will be matched with every instance of date per id in new rows. This makes the solution trivial. I will leave it here for future people looking for the solution:

df2 %>% 
  left_join(df1, by="id") %>% 
  filter(date >= keydate & date <= keydate + 10) %>% 
  select(-keydate)

The exhaustive nature of left_join can be seen with the results of the first function in my solution:

    id       date    keydate
1  id1 2015-03-10 2015-03-06
2  id1 2015-03-10 2017-04-10
3  id1 2015-03-10 2021-08-20
4  id1 2021-08-29 2015-03-06
5  id1 2021-08-29 2017-04-10
6  id1 2021-08-29 2021-08-20
7  id1 2021-10-19 2015-03-06
8  id1 2021-10-19 2017-04-10
9  id1 2021-10-19 2021-08-20
10 id2 2018-12-20 2018-12-11
11 id2 2018-12-20 2020-12-15
12 id2 2021-01-29 2018-12-11
13 id2 2021-01-29 2020-12-15
14 id3 2019-07-12 2019-06-04
15 id3 2019-08-23 2019-06-04

score 0 · Answer 2 · answered Apr 27 '23 at 00:15

0

Another approach is to use join_by with dplyr version >=1.1.0. You can join on id and ensure date is between keydate and a calculated end_date (which is keydate plus 10 days).

library(tidyverse)

inner_join(df2,
           mutate(df1, end_date = keydate + 10), 
           join_by(id, between(date, keydate, end_date))) %>%
  select(id, date)

Output

   id       date
1 id1 2015-03-10
2 id1 2021-08-29
3 id2 2018-12-20

answered Apr 27 '23 at 00:15

Ben

28,684
5
23
45

I like your solution as it demonstrates the join_by functionality nicely. I would omit the mutate part, though. You can just use keydate + 10 like I did in my original solution. That keeps the interval quantity of `between` in one line - easier to read. – DocBuckets Apr 27 '23 at 00:34
@DocBuckets Interesting - I didn't believe you could include `keydate + 10` in `join_by` at this time - when I tried this I received an error: "Expressions can't contain computed columns, and can only reference columns by name..."...did you tried it? Did that work for you? – Ben Apr 27 '23 at 12:25

Filter dataframe by another dataframe containing duplicate indices

2 Answers2