1

I want to be able to fuzzy match one column and exact match another column.

Say I df1 looks like this:

enter image description here

And df2 looks like this:

enter image description here

I want to fuzzy match the "Name" but exact match the "Year." So "Ashley" and "Ashlee" would be a match. This is what I have so far:

res <- fuzzy_left_join(
  df,
  df2,
  by=c("Year","Name"),
  list(`==`, function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 3)
)
res %>% 
  select(Year = Year.x, everything(), - Year.y)

It appears to be over-matching, though. Not sure what's going on.

hy9fesh
  • 589
  • 2
  • 15
  • What package are you using? – prosoitos Oct 18 '19 at 00:38
  • I'm using the fuzzyjoin package. – hy9fesh Oct 19 '19 at 16:11
  • ok, thanks. You should include this info when you ask a question (for instance by adding to your code `library(fuzzyjoin)`. Otherwise, people who aren't familiar with the package can't guess and can't help you – prosoitos Oct 20 '19 at 18:56
  • That said, I just realized that you had tagged it with [fuzzyjoin]. So I could have guessed :P – prosoitos Oct 20 '19 at 18:57
  • Looking at the documentation for `fuzzy_join()`, there is the argument `match_fun` which allows you to give a matching function. If the default is over-matching compared to the outcome you want to get, I guess you can play with that argument until you get the matching you want – prosoitos Oct 20 '19 at 19:03
  • 1
    If you provide some sample data (not with a picture but in a way that can be replicated), people can try to help you – prosoitos Oct 20 '19 at 19:05
  • 1
    An alternative approach, if using `match_fun` is not easy, would be to transform your `Name` column in `df2` with `gsub()` and a regexp, then use `dplyr::left_join()`. If you provide some sample data, I'll be happy to do that – prosoitos Oct 20 '19 at 19:07
  • 1
    Actually, I think that you are already using `match_fun` with: `list('==', function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 3`. So this is what you want to play with until you get the proper matching. I don't know what `method="lv"` is, so personally I would do it with `gsub()`. – prosoitos Oct 20 '19 at 19:10
  • did you solve your fuzzyjoin problem? – Arthur Yip Dec 08 '20 at 19:57

1 Answers1

0

It seems you are on the right track (hard to tell without your data or you showing us your result!)

The fuzzyjoin will provide all answers with string distance <=3, which may be the "overmatching" you describe.

You can use %>% group_by(Year,Name) %>% slice_min(dist) to get the best answer according to distance.

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50