Say we have the following datasets:
Dataset A:
name age
Sally 22
Peter 35
Joe 57
Samantha 33
Kyle 30
Kieran 41
Molly 28
Dataset B:
name company
Samanta A
Peter B
Joey C
Samantha A
My aim is to match both datasets while ordering the subsequent one's values by distance and keeping only the relevant matches. In other words, the output should look as follows below:
name_a name_b age company distance
Peter Peter 35 B 0.00
Samantha Samantha 33 A 0.00
Samantha Samanta 33 A 0.04166667
Joe Joey 57 C 0.08333333
In this example I'm calculating the distance using method = "jw"
in stringdist
, but I'm happy with any other method that might work. Until now I've been doing attempts with packages such as stringr
or stringdist
.