1

I need to match two datasets on three variables. Two of the three variables do not present misspellings (by design). The fuzzy match is required only for the third variable.

The standard fuyyzmerge generate some issues by fuzzy-joining all three variables.

Is there a way to specify which of the three should be fuzzy matched and which exact-matched?

reproducible example:

dataset_1 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB")), c("var_1", "var_2", "var_3"))
dataset_2 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBC"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))


merged <- stringdist_join(dataset_1, dataset_2, 
                                  by=c("var_1","var_2","var_3"),  
                                  max_dist = 2, 
                                  method = c("soundex"),
                                  mode = "full", 
                                  ignore_case = FALSE)

Ideal result:

merged <- setNames(data.frame(rep(1995,4),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))
MCS
  • 1,071
  • 9
  • 23
  • Technically yes but it does a cartesian product on all `by` variables so it might or might not be advisable to do it in one operation, how big is your data ? – moodymudskipper Mar 20 '19 at 12:44
  • Given that I manage to break it down after exact-matching, I remain with around a thousand observations to fuzzy march. – MCS Mar 20 '19 at 13:05
  • `stringdist_join` is a wrapper around `fuzzy_join`, and `fuzzy_join` has a `match_fun` argument that can either be a unique function or a list of functions as long as your `by` argument, so I think you can use `fuzzy_full_join` with ``match_fun = list(`==`, `==`, function(x,y) stringdist::stringdist(x,y, "soundex") < 2)`` . It would help if you could make your example reproducible and provide your expected output. – moodymudskipper Mar 20 '19 at 13:21
  • Please find the reproducible code and expected outcome in the original post. – MCS Mar 20 '19 at 13:36

1 Answers1

1

stringdist_join is a wrapper around fuzzy_join, and fuzzy_join has a match_fun argument that can either be a unique function or a list of functions as long as your by argument, so we can use fuzzy_full_join (which is just fuzzy_join with mode= "full"):

library(fuzzyjoin)
res <- fuzzy_full_join(dataset_1, dataset_2, 
                by=c("var_1","var_2","var_3"),
                list(`==`, `==`, function(x,y) stringdist::stringdist(x,y, "soundex") <= 2))
res
#   var_1.x var_2.x var_3.x var_1.y var_2.y var_3.y var_4
# 1    1995      AA    AAAA    1995      AA    AAAA     A
# 2    1996      AA    AAAA    1996      AA    AAAA     B
# 3    1995      BB    BBBB    1995      BB    BBBB     C
# 4    1996      BB    BBBB    1996      BB    BBBC     D

Because of the nature of fuzzy matching, values are not generally the same on the lhs and rhs, so we end up with two sets of by columns, if you want to preserve only the lhs we can do :

library(dplyr)
res %>% 
  select(-ends_with(".y")) %>%
  rename_all(~sub("\\.x$","",.))

#   var_1 var_2 var_3 var_4
# 1  1995    AA  AAAA     A
# 2  1996    AA  AAAA     B
# 3  1995    BB  BBBB     C
# 4  1996    BB  BBBB     D
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • How can this be corrected so that it ignores fuzzmymatch when var_1.x or var_1.y is NA or NULL? And by extension for all variables which it is considering? – Olivier Mar 16 '23 at 16:36
  • I think you should ask a new question with sample data that reflects your problem, it will be easier – moodymudskipper Mar 16 '23 at 17:51