0

I have two tables. Table one has an id column and a full_name column. Table two has only a full name column but the names are near-matches and not full matches. I would like to apply the id column to the second table so that the ids apply to the correct near-match names. My tables look like this:

df1 <- read.table(text="
id  full_name
1   'Tom Jones'
2   'Jim James'", header = TRUE)

df2 <- read.table(text="
full_name
'Tom Jones Jr.'
'Jim James Ii'", header = TRUE)

I would like Table 2 to end up like this:

id full_name
1 Tom Jones Jr.
2 Jim James Ii

I have tried stringdist_join and would like to ideally do in in dplyr but I'm not sure thats possible. Any help would be appreciated. Thanks.

Mark
  • 7,785
  • 2
  • 14
  • 34
  • Hi Ben! Welcome to StackOverflow. `stringdist_left_join(df1, df2, by = "full_name", method = "jw", max_dist = 0.3)` works! But in my experience, the fuzzyjoin package functions are extremely temperamental, and fail to work on more than a couple of rows. So YMMV – Mark Aug 22 '23 at 13:57
  • Mark! Thank you! This works! I'm trying to keep my code clean. How would I do this in dplyr mutating df2 and only transfer over the pid instead of all the info from df1. I know I can just do it outside of dplyr and delete un-needed rows but it would be great if I could leave it as clean as possible, if possible. Thank you! – Ben Blackburn Aug 22 '23 at 14:42
  • I think you can then just select the relevant columns Ben! I would post an answer but this isn't my area of expertise, and there are a half dozen different methods of joining in stringdist_left_join, and I honestly couldn't tell you which one is best! – Mark Aug 22 '23 at 14:44
  • `stringdist_left_join(df2, select(df1, full_name, id), by = "full_name", method = "jw", max_dist = 0.3)` would take each row of `df2` and link it to all matching full_name+id from `df1`, without bringing over any other columns. – Jon Spring Aug 22 '23 at 16:56

0 Answers0