1

I have two dataframes from different sources that refer to the same people, but due to errors from self-reported data, the dates may be slightly off.

Example data:

df1 <- data.frame(name= c("Ann", "Betsy", "Charlie", "Dave"), 
                  dob= c(as.Date("2000-01-01", "%Y-%m-%d"), as.Date("2001-01-01", "%Y-%m-%d"), 
                         as.Date("2002-01-01", "%Y-%m-%d"), as.Date("2003-01-01", "%Y-%m-%d")), 
                  stringsAsFactors=FALSE)

df2 <- data.frame(name= c("Ann", "Charlie", "Elmer", "Fred"), 
                  dob= c(as.Date("2000-01-11", "%Y-%m-%d"), as.Date("2004-01-01", "%Y-%m-%d"), 
                         as.Date("2001-01-01", "%Y-%m-%d"), as.Date("2006-01-01", "%Y-%m-%d")), 
                  stringsAsFactors=FALSE)

I want to match by exact name, with dplyr like:

library(dplyr)
inner_join(df1, df2, by = c("name"))

name    dob.x   dob.y
Ann     2000-01-01  2000-01-11
Charlie     2002-01-01  2004-01-01

but also by dates of birth (dob) within 30 days, with the fuzzyjoin package like:

library(fuzzyjoin)

difference_inner_join(df1, df2, by=c("dob"), max_dist = 30)

name.x  dob.x   name.y  dob.y
Ann     2000-01-01  Ann     2000-01-11
Betsy   2001-01-01  Elmer   2001-01-01

But combine both criteria, so that only Ann would be returned

Prradep
  • 5,506
  • 5
  • 43
  • 84
pgcudahy
  • 1,542
  • 13
  • 36

2 Answers2

2

Relying on dplyr and base R alone. I rarely need fuzzy joins. inner_join and subsequently filter usually is enough

inner_join(df1, df2, by = c("name")) %>% 
  filter(abs(difftime(dob.x,dob.y, units = "days"))<30)

result

  name      dob.x      dob.y
 1  Ann 2000-01-01 2000-01-11
Community
  • 1
  • 1
Kresten
  • 1,758
  • 12
  • 18
1

Well you could do this:

 difference_inner_join(df1, df2, by=c("dob"), max_dist = 30) %>%
  filter(name.x == name.y)

  name.x      dob.x name.y      dob.y
1    Ann 2000-01-01    Ann 2000-01-11

Humpelstielzchen
  • 6,126
  • 3
  • 14
  • 34
  • Perfect! I had implemented this earlier with loops and it took > 5 minutes on my real data with >100,000 rows. Your solution works in <15 seconds. – pgcudahy Apr 26 '19 at 07:52
  • I was hoping to find a function call to put it all into `fuzzy_inner_join`s `match_fun=list(...)` but it just seems to take `=`, `>=`, `<=` aso.. But glad it helped. – Humpelstielzchen Apr 26 '19 at 07:58
  • I think your solution is more readable than stuffing it all in the inner_join function – pgcudahy Apr 26 '19 at 08:09