0

Say we have the following datasets:

Dataset A:

name        age
Sally       22
Peter       35
Joe         57
Samantha    33
Kyle        30
Kieran      41
Molly       28

Dataset B:

name        company
Samanta     A
Peter       B
Joey        C
Samantha    A

My aim is to match both datasets while ordering the subsequent one's values by distance and keeping only the relevant matches. In other words, the output should look as follows below:

name_a       name_b         age     company     distance
Peter        Peter          35      B           0.00
Samantha     Samantha       33      A           0.00
Samantha     Samanta        33      A           0.04166667
Joe          Joey           57      C           0.08333333
  

In this example I'm calculating the distance using method = "jw" in stringdist, but I'm happy with any other method that might work. Until now I've been doing attempts with packages such as stringr or stringdist.

teogj
  • 289
  • 1
  • 11
  • [Relevant](https://stackoverflow.com/questions/55959725/merging-two-dataframes-by-stringmatch-with-dplyr-and-stringdist/55961589#comment98571598_55959725) – Sotos Aug 20 '21 at 08:41

1 Answers1

0

You can use stringdist_inner_join to join the two dataframes and use levenshteinSim to get the similarity between the two names.

library(fuzzyjoin)
library(dplyr)

stringdist_inner_join(A, B, by = 'name') %>%
  mutate(distance = 1 - RecordLinkage::levenshteinSim(name.x, name.y))  %>%
  arrange(distance)

#    name.x age   name.y company distance
#1    Peter  35    Peter       B    0.000
#2 Samantha  33 Samantha       A    0.000
#3 Samantha  33  Samanta       A    0.125
#4      Joe  57     Joey       C    0.250
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213