Filter a Dataframe by Another Dataframe

Question

Supposedly this question has already been answered. But the user who flagged my question failed to test the solution and the cited question does not work for my problem.

I have found questions on how to filter a dataframe using another list but I have not found something that shows how to filter a dataframe using another dataframe.

I have two dataframes and the first one can be thought of as a key of ID's and dates.

   id       date
1 id1 2016-06-23
2 id2 2016-06-25
3 id3 2016-06-23
4 id4 2016-06-25
5 id5 2016-06-27

structure(list(id = structure(1:5, .Label = c("id1", "id2", "id3", 
"id4", "id5"), class = "factor"), date = structure(c(16975, 16977, 
16975, 16977, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

I then have a second dataframe with ID's and dates and I would like to filter the second dataframe to only return rows that are after the date for the ID in the first row.

Here is the second dataframe:

   id       date
1 id1 2016-06-20
2 id1 2016-06-23
3 id1 2016-06-24
4 id2 2016-06-23
5 id3 2016-06-27

structure(list(id = structure(c(1L, 1L, 1L, 2L, 3L), .Label = c("id1", 
"id2", "id3"), class = "factor"), date = structure(c(16972, 16975, 
16976, 16975, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

And this is what the results would look like:

   id       date
1 id1 2016-06-24
2 id3 2016-06-27

Possible duplicate of [Join two datasets based on an inequality condition](https://stackoverflow.com/questions/32893022/join-two-datasets-based-on-an-inequality-condition) — C8H10N4O2, Sep 26 '17 at 17:53
Did you read [this answer](https://stackoverflow.com/a/38464306/2573061)? "I'll suppose you have a constant variable in each case called 'dummy' (or alternatively, it can be another variable to join by)" -- in your case the "alternatively" — C8H10N4O2, Sep 26 '17 at 18:17

score 3 · Answer 1 · edited Sep 27 '17 at 15:03

3

Thank god there is dplyr. The following code joins df1 which has unique identifiers, and keeps only these rows (filter) which matches condition date >= date.1.

Be careful, because by default when you have identical column names in both data.frames, dplyr will join by all of them. Then we have to specify by parameter and add suffix to variables names to differ identical column names.

library(dplyr)
library(magrittr)

df2 %>%
 left_join(df1, by = "id", suffix=c("",".2") ) %>%
 filter( date > date.2) %>%
 select( -date.2 )

#  id       date
# 1 id1 2016-06-23
# 2 id1 2016-06-24
# 3 id3 2016-06-27

edited Sep 27 '17 at 15:03

halfer

19,824
17
99
186

answered Sep 26 '17 at 17:44

GoGonzo

2,637
1
18
25

2

This result differs from OPs expected output – pogibas Sep 26 '17 at 17:49
3

replace `>=` with `>` because OP specifies "after" not "on or after" – C8H10N4O2 Sep 26 '17 at 17:51

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

3

Use non-equi-join in `data.table`

library(data.table)

setDT(df1)
setDT(df2)

setnames(df1, 'date','date1') # disambiguate for conditional join

df1[df2, on=.(id, date1<date), nomatch=0]

Returns:

  id      date1
1: id1 2016-06-24
2: id3 2016-06-27

On large datasets I expect this approach to be faster than any approach which uses dplyr and/or a cartesian join followed by a filter.

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 26 '17 at 18:01

C8H10N4O2

18,312
8
98
134

Why the downvote? This is probably the most efficient solution offered... – C8H10N4O2 Sep 26 '17 at 18:10

score 1 · Accepted Answer · answered Sep 26 '17 at 17:58

1

Solution using data.table:

library(data.table)
setDT(d1)
setDT(d2)
merge(d1, d2, "id")[date.y > date.x, .(id, date = date.y)]

    id       date
1: id1 2016-06-24
2: id3 2016-06-27

answered Sep 26 '17 at 17:58

pogibas

27,303
19
84
117

@C8H10N4O2 can you give a link? – pogibas Sep 26 '17 at 18:06
1

There is an example [here](https://stackoverflow.com/a/38464382/2573061) that is conditional only (no equi-join on id). See [my answer below](https://stackoverflow.com/a/46433040/2573061) for example w/ both. I still think this question is a dupe though. – C8H10N4O2 Sep 26 '17 at 18:07

score 0 · Answer 4 · answered Sep 26 '17 at 18:55

0

So your first dataframe is basically an index. Assuming that index is called df1, and your second dataframe that you want to filter is df2, I would do this using dplyr:

library(dplyr)

df.result <- left_join(df2, df1, by = "id") %>% 
   filter(date.x > date.y) %>% 
   select(-date.y)

eta: this would be the result:

   id     date.x
 1 id1 2016-06-24
 2 id3 2016-06-27

answered Sep 26 '17 at 18:55

kpress

136
6

This is a duplicate of Gonzo's answer – C8H10N4O2 Sep 26 '17 at 22:47
Well I didn't see that before I posted mine, but it's not an exact duplicate. Mine is simpler. – kpress Sep 27 '17 at 18:00

score 0 · Answer 5 · answered Jul 12 '23 at 16:30

join_by was added in dplyr 1.1.0 for more advanced join specifications:

library(dplyr)

inner_join(df2, df1, by = join_by(id == id, date > date)) |>
  select(id, date = date.x)

Note the equality condition and inequality conditions both have the same variable names. They are distinguished by the order of the data frames. The LHS refers to the first data frame (df2 in this case) and the RHS to the second data frame (df1 in this case).

Filter a Dataframe by Another Dataframe

5 Answers5

Use non-equi-join in `data.table`

Linked

Filter a Dataframe by Another Dataframe

5 Answers5

Use non-equi-join in data.table

Linked

Use non-equi-join in `data.table`