Applying Jaro-Winkler distance to two dataframes

Question

I have two dataframes of unequal length and would like to compare the similarity of strings in df2 with df1. Is it possible to apply Jaro-Winkler distance method to calculate the string similarity on two dataframes through map/lambda function.

df1
Behavioral disorders
Behçet disease
AV-Block

df2
Behavioral disorder
Behçet syndrome

The desired output is:

name_left                 name_right            score   
Behavioral disorders      Behavioral disorder   0.933333
Behçet disease            Behçet syndrome       0.865342

The scores mentioned above are hypothetical. Any help is highly appreciated

mozway · Answer 1 · 2022-11-28T06:09:52.763

0

Assuming you want the max score and that the original columns in the input are "name":

# pip install jaro-winkler
# https://pypi.org/project/jaro-winkler/
from jaro import jaro_winkler_metric as jw

pd.DataFrame([[n2, *max([(n1, jw(n1, n2)) for n1 in df1['name']],
                        lambda x: x[1])]
              for n2 in df2['name']],
              index=df2.index,
              columns=['name_right', 'name_left', 'score']
            )[['name_left', 'name_right', 'score']]

edited Nov 28 '22 at 06:09

answered Nov 27 '22 at 22:47

mozway

194,879
13
39
75

When I try to install it, I get following error – rshar Nov 27 '22 at 22:54
ERROR: Could not find a version that satisfies the requirement jaro (from versions: none) ERROR: No matching distribution found for jaro – rshar Nov 27 '22 at 22:54
@rshar I have to say I haven't tried to install it (it seemed recent enough to me, last update being a few months ago). You can use any other function. Is your question about how to calculate the distance or, assuming an existing function, how to use it with your DataFrames to generate the desired output? – mozway Nov 27 '22 at 22:59
My bad, I checked the [page](https://pypi.org/project/jaro-winkler/) again, this is `pip install jaro-winkler` – mozway Nov 27 '22 at 23:02
Just one question. There is no mention of df1. `n2, *max([(n1, jw(n1, n2)) for n1 in df2['name']], lambda x: x[1])] for n2 in df2['name']` – rshar Nov 27 '22 at 23:20
This was a typo, `n1` is from `df1` – mozway Nov 28 '22 at 06:10

Applying Jaro-Winkler distance to two dataframes

1 Answers1