Fuzzy left join person full names in R - handling tricky edge cases (cannot install fuzzyjoin)

Question

Here is example data with person full names from two tables that need to be left-joined together, with df1 and the left-hand side table, and df2 as the right-hand side:

df1 <- data.frame(fullName = 'Michael Gadson', age = 53) %>%
  rbind(data.frame(fullName = 'Mike Gardnero', age = 43)) %>%
  rbind(data.frame(fullName = 'Nicholas Richards', age = 13)) %>%
  rbind(data.frame(fullName = 'Mikey Richards', age = 53)) %>%
  rbind(data.frame(fullName = 'DeAndre Jamison', age = 28)) %>%
  rbind(data.frame(fullName = 'Anthony Allison', age = 21)) %>%
  rbind(data.frame(fullName = 'Ricky Smith', age = 82)) %>%
  rbind(data.frame(fullName = 'Smith Rickie', age = 60)) %>%
  rbind(data.frame(fullName = 'Johnny Williams', age = 60))

df2 <- data.frame(playerName = 'Mike Gadson', color = 'red') %>%
  rbind(data.frame(playerName = 'Anthony Allison', color = 'green')) %>%
  rbind(data.frame(playerName = 'Mike Gardnero', color = 'purple')) %>%
  rbind(data.frame(playerName = "De Andre' Jamison", color = 'orange')) %>%
  rbind(data.frame(playerName = 'Nicholas Richards III', color = 'yellow')) %>%
  rbind(data.frame(playerName = 'John Kind', color = 'grey')) %>%
  rbind(data.frame(playerName = 'Mike Richards', color = 'white')) %>%
  rbind(data.frame(playerName = 'Rick Smith', color = 'blue')) %>%
  rbind(data.frame(playerName = 'Smith Rickie', color = 'black')) %>%
  rbind(data.frame(playerName = 'Anthony Albados', color = 'violet'))

output_df <- data.frame(fullName = 'Michael Gadson', age = 53, playerName = 'Mike Gadson', color = 'red') %>%
  rbind(data.frame(fullName = 'Mike Gardnero', age = 43, playerName = 'Mike Gardnero', color = 'purple')) %>%
  rbind(data.frame(fullName = 'Nicholas Richards', age = 13, playerName = 'Nicholas Richards III', color = 'yellow')) %>%
  rbind(data.frame(fullName = 'Mikey Richards', age = 53, playerName = 'Mike Richards', color = 'white')) %>%
  rbind(data.frame(fullName = 'DeAndre Jamison', age = 28, playerName = "De Andre' Jamison", color = 'orange')) %>%
  rbind(data.frame(fullName = 'Anthony Allison', age = 21, playerName = 'Anthony Allison', color = 'green')) %>%
  rbind(data.frame(fullName = 'Ricky Smith', age = 82, playerName = 'Rick Smith', color = 'blue')) %>%
  rbind(data.frame(fullName = 'Smith Rickie', age = 60, playerName = 'Smith Rickie', color = 'black')) %>%
  rbind(data.frame(fullName = 'Johnny Williams', age = 60, playerName = NA, color = NA))

> output_df
           fullName age            playerName  color
1    Michael Gadson  53           Mike Gadson    red
2     Mike Gardnero  43         Mike Gardnero purple
3 Nicholas Richards  13 Nicholas Richards III yellow
4    Mikey Richards  53         Mike Richards  white
5   DeAndre Jamison  28     De Andre' Jamison orange
6   Anthony Allison  21       Anthony Allison  green
7       Ricky Smith  82            Rick Smith   blue
8      Smith Rickie  60          Smith Rickie  black
9   Johnny Williams  60                  <NA>   <NA>

A few comments on the tricky situations / edge cases here:

This is a left join, and so the output_df should have the same number of rows that the left-hand side dataframe df1 has.
The left join shouldn't get mixed up by similar names. Michael Gadson --> Mike Gadson, not one of the other Mike names.
The left join shouldn't get mixed up by reversed names. (Ricky Smith --> Rick Smith, not Smith Rickie)
The left join shouldn't get mixed up by III suffix to names, or by extra spaces or symbols (De Andre' vs DeAndre)

Edit: I've tried the following with the following output:

zed <- fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = 0.3, by=c('fullName'='playerName'), method = 'jaccard')

> zed
            fullName age            playerName  color
1     Michael Gadson  53           Mike Gadson    red
2      Mike Gardnero  43           Mike Gadson    red
3      Mike Gardnero  43         Mike Gardnero purple
4  Nicholas Richards  13 Nicholas Richards III yellow
5     Mikey Richards  53         Mike Richards  white
6    DeAndre Jamison  28     De Andre' Jamison orange
7    Anthony Allison  21       Anthony Allison  green
8      Richard Smith  82            Rich Smith   blue
9       Smith Rickie  60            Rich Smith   blue
10      Smith Rickie  60          Smith Rickie  black
11   Johnny Williams  60                  <NA>   <NA>

It does an okay job, but it is still imperfect. Most notably, Mike Gardnero and Smith Rickie are duplicated when using jaccard with max_dist of 0.3, because there are multiple rows on the right-hand side that meet the similarity criteria... however, our output should not create these duplicates (perhaps keep the right-hand side value with the highest similarity).

Is there a reason you are trying to compile the stringdist package? What OS are you using? Does `install.packages("stringdist", type="binary")` work? Or is there any other message about why exactly the compilation failed? — MrFlick, Sep 15 '20 at 02:35
I am using a mac, and `install.packages("stringdist", type="binary")` did work, as did the reinstall of fuzzy join — Canovice, Sep 15 '20 at 02:44
Not helpful at all but ... **ouch**. The use of `data.frame(...) %>% rbind(data.frame(...)) %>% rbind(data.frame(...))` is about the most inefficient, *anti*-idiomatic way of doing that in R. I cannot conceive of a situation that would justify that. (Really, I'm not judging, but ... dang, I really cannot think of a way that is more *punished* by R's memory model.) — r2evans, Sep 15 '20 at 04:25
If you wan't to filter by best fit, you can use this `fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = .3, by=c('fullName'='playerName'), method = 'jaccard', distance_col = "dist") %>% group_by(fullName) %>% filter(dist == min(dist) | is.na(dist)) ` . Or try different methods for a more accurate joyn. For your example, I got your desired output just with `method = "lcs", max_dist = 5". — tamtam, Sep 15 '20 at 14:52
@r2evans thanks for the feedback on how inefficient it was to create a... ...10-row dataframe for a reproducible example for stackoverflow. — Canovice, Sep 15 '20 at 17:19
@tamtam thanks for the input. i didn't realize that `distance_col` was the distance score value, so having this is helpful as well. I'll continue to play around with different methods. — Canovice, Sep 15 '20 at 17:22
@Canovice, my apologies that it came across critical, it does come across harshly. I don't always nit-pick code, but there are some coding patterns that will have significant impacts on performance. — r2evans, Sep 15 '20 at 17:42
For sure. I usually create a list and then use `do.call(rbind, my_list)` when I have very large loops to create dataframes — Canovice, Sep 15 '20 at 17:50

Fuzzy left join person full names in R - handling tricky edge cases (cannot install fuzzyjoin)

0 Answers0

Linked