2

Supposedly this question has already been answered. But the user who flagged my question failed to test the solution and the cited question does not work for my problem.

I have found questions on how to filter a dataframe using another list but I have not found something that shows how to filter a dataframe using another dataframe.

I have two dataframes and the first one can be thought of as a key of ID's and dates.

   id       date
1 id1 2016-06-23
2 id2 2016-06-25
3 id3 2016-06-23
4 id4 2016-06-25
5 id5 2016-06-27

structure(list(id = structure(1:5, .Label = c("id1", "id2", "id3", 
"id4", "id5"), class = "factor"), date = structure(c(16975, 16977, 
16975, 16977, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

I then have a second dataframe with ID's and dates and I would like to filter the second dataframe to only return rows that are after the date for the ID in the first row.

Here is the second dataframe:

   id       date
1 id1 2016-06-20
2 id1 2016-06-23
3 id1 2016-06-24
4 id2 2016-06-23
5 id3 2016-06-27

structure(list(id = structure(c(1L, 1L, 1L, 2L, 3L), .Label = c("id1", 
"id2", "id3"), class = "factor"), date = structure(c(16972, 16975, 
16976, 16975, 16979), class = "Date")), .Names = c("id", "date"
), row.names = c(NA, -5L), class = "data.frame")

And this is what the results would look like:

   id       date
1 id1 2016-06-24
2 id3 2016-06-27
pogibas
  • 27,303
  • 19
  • 84
  • 117
DataTx
  • 1,839
  • 3
  • 26
  • 49
  • 3
    Possible duplicate of [Join two datasets based on an inequality condition](https://stackoverflow.com/questions/32893022/join-two-datasets-based-on-an-inequality-condition) – C8H10N4O2 Sep 26 '17 at 17:53
  • Did you read [this answer](https://stackoverflow.com/a/38464306/2573061)? "I'll suppose you have a constant variable in each case called 'dummy' (or alternatively, it can be another variable to join by)" -- in your case the "alternatively" – C8H10N4O2 Sep 26 '17 at 18:17

5 Answers5

3

Thank god there is dplyr. The following code joins df1 which has unique identifiers, and keeps only these rows (filter) which matches condition date >= date.1.

Be careful, because by default when you have identical column names in both data.frames, dplyr will join by all of them. Then we have to specify by parameter and add suffix to variables names to differ identical column names.

library(dplyr)
library(magrittr)

df2 %>%
 left_join(df1, by = "id", suffix=c("",".2") ) %>%
 filter( date > date.2) %>%
 select( -date.2 )

#  id       date
# 1 id1 2016-06-23
# 2 id1 2016-06-24
# 3 id3 2016-06-27
halfer
  • 19,824
  • 17
  • 99
  • 186
GoGonzo
  • 2,637
  • 1
  • 18
  • 25
3

Use non-equi-join in data.table

library(data.table)

setDT(df1)
setDT(df2)

setnames(df1, 'date','date1') # disambiguate for conditional join

df1[df2, on=.(id, date1<date), nomatch=0]

Returns:

  id      date1
1: id1 2016-06-24
2: id3 2016-06-27

On large datasets I expect this approach to be faster than any approach which uses dplyr and/or a cartesian join followed by a filter.

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
1

Solution using data.table:

library(data.table)
setDT(d1)
setDT(d2)
merge(d1, d2, "id")[date.y > date.x, .(id, date = date.y)]

    id       date
1: id1 2016-06-24
2: id3 2016-06-27
pogibas
  • 27,303
  • 19
  • 84
  • 117
  • @C8H10N4O2 can you give a link? – pogibas Sep 26 '17 at 18:06
  • 1
    There is an example [here](https://stackoverflow.com/a/38464382/2573061) that is conditional only (no equi-join on id). See [my answer below](https://stackoverflow.com/a/46433040/2573061) for example w/ both. I still think this question is a dupe though. – C8H10N4O2 Sep 26 '17 at 18:07
0

So your first dataframe is basically an index. Assuming that index is called df1, and your second dataframe that you want to filter is df2, I would do this using dplyr:

library(dplyr)

df.result <- left_join(df2, df1, by = "id") %>% 
   filter(date.x > date.y) %>% 
   select(-date.y) 

eta: this would be the result:

   id     date.x
 1 id1 2016-06-24
 2 id3 2016-06-27
kpress
  • 136
  • 6
0

join_by was added in dplyr 1.1.0 for more advanced join specifications:

library(dplyr)

inner_join(df2, df1, by = join_by(id == id, date > date)) |>
  select(id, date = date.x)

Note the equality condition and inequality conditions both have the same variable names. They are distinguished by the order of the data frames. The LHS refers to the first data frame (df2 in this case) and the RHS to the second data frame (df1 in this case).

LMc
  • 12,577
  • 3
  • 31
  • 43