Match two columns based on string distance in R

Question

I have two very large dataframes containing names of people. The two dataframes report different information on these people (i.e. df1 reports data on health status and df2 on socio-economic status). A subset of people appears in both dataframes. This is the sample I am interested in. I would need to create a new dataframe which includes only those people appearing in both datasets. There are, however, small differences in the names, mostly due to typos.

My data is as follows:

df1
name | smoker | age
"Joe Smith" | Yes | 43
"Michael Fagin" | Yes | 35
"Ellen McFarlan" | No | 55
...
...

df2
name | occupation | location
"Joe Smit" | Postdoc | London
"Joan Evans" | IT consultant | Bristol
"Michael Fegin" | Lawyer | Liverpool
...
...

What I would need is to have a third dataframe df3 with the following information:

df3
name1 | name2 | distance | smoker | age | occupation | location 
"Joe Smith" | "Joe Smit" | a measure of their Jaro distance | Yes | 43 | Postdoc | London
"Michael Fagin" | "Michael Fegin" | a measure of their Jaro distance | Yes | 35 | Lawyer | Liverpool
...
...

So far I have worked with the stringdist package to get a vector of possible matches, but I am struggling to use this information to create a new dataframe with the information I need. Many thanks in advance should anyone have an idea for this!

It would be helpful if you provide the sample data using `dput(x)`. — Nad Pat, Mar 01 '22 at 11:34
Yes of course, I performed a match using `match <- amatch(df1$name, df2$name, maxDist=0.1, method="jw")`, and my vector match is as follows: `c(1, 3, NA, ...)` — srocco, Mar 01 '22 at 11:47

score 0 · Accepted Answer · answered Mar 01 '22 at 12:45

library(tidyverse)
library(fuzzyjoin)

df1  <- tibble(
  name = c("Joe Smith", "Michael Fagin"),
  smoker = c("yes", "yes")
)

df2 <- tibble(
  name = c("Joe Smit", "Michael Fegin"),
  occupation = c("post doc", "IT consultant")
)

df1 %>%
  # max 3 chars different
  stringdist_inner_join(df2, max_dist = 3)
#> Joining by: "name"
#> # A tibble: 2 × 4
#>   name.x        smoker name.y        occupation   
#>   <chr>         <chr>  <chr>         <chr>        
#> 1 Joe Smith     yes    Joe Smit      post doc     
#> 2 Michael Fagin yes    Michael Fegin IT consultant

^{Created on 2022-03-01 by the reprex package (v2.0.0)}

Match two columns based on string distance in R

1 Answers1