0

I am working with the R programming language. Does anyone know that if the "dplyr" library has any way of running "fuzzy joins"?

For instance, supposed I have the following data:

In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:

library(dplyr)


table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id1 = c("0123", "1233", "125  .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

My Question: Does anyone know if the code below would have an equivalent with "dbplyr" (instead of with "fuzzyjoin")?

library(fuzzyjoin)

join = stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
  filter(date_1 >= date_2, date_1 <= date_3) 

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

0

You can perform a full-join and calculate then string editting distance of your choice. You need more than the dplyr package though as you probably do not want do implement the calculation of string editting distance from scratch. The package of choice probably would be stringdist and actually this is what the fuzzyjoin function you are asking about uses under the hood.

library(dplyr)
library(stringdist)

# generate all possible combinations with dplyr
dplyr::full_join(table_1, table_2, by = character()) %>%
    # calculate string editting distance and use to filter
    dplyr::filter(stringdist::stringdist(id2.x, id2.y) < 3)

I have no idea if this will be more or less performant then the fuzzyjoin implementation. If you want to speed up the process or are facing RAM problems, you could segment table_1 and full join on each segment the entire table_2, run the calculation (on multiple cores i.e.) and bind the result.

DPH
  • 4,244
  • 1
  • 8
  • 18