Here is example data with person full names from two tables that need to be left-joined together, with df1
and the left-hand side table, and df2
as the right-hand side:
df1 <- data.frame(fullName = 'Michael Gadson', age = 53) %>%
rbind(data.frame(fullName = 'Mike Gardnero', age = 43)) %>%
rbind(data.frame(fullName = 'Nicholas Richards', age = 13)) %>%
rbind(data.frame(fullName = 'Mikey Richards', age = 53)) %>%
rbind(data.frame(fullName = 'DeAndre Jamison', age = 28)) %>%
rbind(data.frame(fullName = 'Anthony Allison', age = 21)) %>%
rbind(data.frame(fullName = 'Ricky Smith', age = 82)) %>%
rbind(data.frame(fullName = 'Smith Rickie', age = 60)) %>%
rbind(data.frame(fullName = 'Johnny Williams', age = 60))
df2 <- data.frame(playerName = 'Mike Gadson', color = 'red') %>%
rbind(data.frame(playerName = 'Anthony Allison', color = 'green')) %>%
rbind(data.frame(playerName = 'Mike Gardnero', color = 'purple')) %>%
rbind(data.frame(playerName = "De Andre' Jamison", color = 'orange')) %>%
rbind(data.frame(playerName = 'Nicholas Richards III', color = 'yellow')) %>%
rbind(data.frame(playerName = 'John Kind', color = 'grey')) %>%
rbind(data.frame(playerName = 'Mike Richards', color = 'white')) %>%
rbind(data.frame(playerName = 'Rick Smith', color = 'blue')) %>%
rbind(data.frame(playerName = 'Smith Rickie', color = 'black')) %>%
rbind(data.frame(playerName = 'Anthony Albados', color = 'violet'))
output_df <- data.frame(fullName = 'Michael Gadson', age = 53, playerName = 'Mike Gadson', color = 'red') %>%
rbind(data.frame(fullName = 'Mike Gardnero', age = 43, playerName = 'Mike Gardnero', color = 'purple')) %>%
rbind(data.frame(fullName = 'Nicholas Richards', age = 13, playerName = 'Nicholas Richards III', color = 'yellow')) %>%
rbind(data.frame(fullName = 'Mikey Richards', age = 53, playerName = 'Mike Richards', color = 'white')) %>%
rbind(data.frame(fullName = 'DeAndre Jamison', age = 28, playerName = "De Andre' Jamison", color = 'orange')) %>%
rbind(data.frame(fullName = 'Anthony Allison', age = 21, playerName = 'Anthony Allison', color = 'green')) %>%
rbind(data.frame(fullName = 'Ricky Smith', age = 82, playerName = 'Rick Smith', color = 'blue')) %>%
rbind(data.frame(fullName = 'Smith Rickie', age = 60, playerName = 'Smith Rickie', color = 'black')) %>%
rbind(data.frame(fullName = 'Johnny Williams', age = 60, playerName = NA, color = NA))
> output_df
fullName age playerName color
1 Michael Gadson 53 Mike Gadson red
2 Mike Gardnero 43 Mike Gardnero purple
3 Nicholas Richards 13 Nicholas Richards III yellow
4 Mikey Richards 53 Mike Richards white
5 DeAndre Jamison 28 De Andre' Jamison orange
6 Anthony Allison 21 Anthony Allison green
7 Ricky Smith 82 Rick Smith blue
8 Smith Rickie 60 Smith Rickie black
9 Johnny Williams 60 <NA> <NA>
A few comments on the tricky situations / edge cases here:
- This is a left join, and so the
output_df
should have the same number of rows that the left-hand side dataframedf1
has. - The left join shouldn't get mixed up by similar names.
Michael Gadson
-->Mike Gadson
, not one of the other Mike names. - The left join shouldn't get mixed up by reversed names. (
Ricky Smith
-->Rick Smith
, notSmith Rickie
) - The left join shouldn't get mixed up by
III
suffix to names, or by extra spaces or symbols (De Andre'
vsDeAndre
)
Edit: I've tried the following with the following output:
zed <- fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = 0.3, by=c('fullName'='playerName'), method = 'jaccard')
> zed
fullName age playerName color
1 Michael Gadson 53 Mike Gadson red
2 Mike Gardnero 43 Mike Gadson red
3 Mike Gardnero 43 Mike Gardnero purple
4 Nicholas Richards 13 Nicholas Richards III yellow
5 Mikey Richards 53 Mike Richards white
6 DeAndre Jamison 28 De Andre' Jamison orange
7 Anthony Allison 21 Anthony Allison green
8 Richard Smith 82 Rich Smith blue
9 Smith Rickie 60 Rich Smith blue
10 Smith Rickie 60 Smith Rickie black
11 Johnny Williams 60 <NA> <NA>
It does an okay job, but it is still imperfect. Most notably, Mike Gardnero
and Smith Rickie
are duplicated when using jaccard
with max_dist
of 0.3, because there are multiple rows on the right-hand side that meet the similarity criteria... however, our output should not create these duplicates (perhaps keep the right-hand side value with the highest similarity).