Fuzzy match for 2 lists with very similar names

Question

I know this question has been asked in some way so apologies. I'm trying to fuzzy match list 1(sample_name) to list 2 (actual_name). Actual_name has significantly more names than list 1 and I keep runninng into fuzzy match not working well. I've tried the multiple fuzzy match methods(partial, set_token) but keep running into issues since there are many more names in list 2 that are very similar. Is there any way to improve matching here. Ideally want to have list 1, matched name from list 2, with the match score in column 3 in a new dataframe. Any help would be much appreciated. Thanks.

Have used this so far:

df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
   for name_master in df2:
     if fuzz.partial_ratio(name_to_find,name_master) > 90:
       response[name_to_find] = name_master
       break
for key, value in response.item():
  print('sample name' + key + 'actual_name' + value)

sample_name	actual_name
jtsports	JT Sports LLC
tombaseball	Tom Baseball Inc.
context express	Context Express LLC
zb sicily	ZB Sicily LLC
lightening express	Lightening Express LLC
fire roads	Fire Road Express
N/A	Earth Treks
N/A	TS Sports LLC
N/A	MM Baseball Inc.
N/A	Contact Express LLC
N/A	AB Sicily LLC
N/A	Lightening Roads LLC

so are you just basically trying to get a "match score" of `sample_name` column to `actual_name` column? — chitown88, Aug 25 '21 at 15:01

score 3 · Answer 1 · answered Aug 25 '21 at 15:16

Not sure if this is your expected output (and you may need to adjust the threshold), but I think this is what you are looking for?

import pandas as pd
from fuzzywuzzy import process

threshold = 50

list1 = ['jtsports','tombaseball','context express','zb sicily',
         'lightening express','fire roads']
list2 = ['JT Sports LLC','Tom Baseball Inc.','Context Express LLC',
'ZB Sicily LLC','Lightening Express LLC','Fire Road Express',
'Earth Treks','TS Sports LLC','MM Baseball Inc.','Contact Express LLC',
'AB Sicily LLC','Lightening Roads LLC']


response = []
for name_to_find in list1:
    resp_match =  process.extractOne(name_to_find ,list2)
    if resp_match[1] > threshold:
         row = {'sample_name':name_to_find,'actual_name':resp_match[0], 'score':resp_match[1]}
         response.append(row)
         print(row)

results = pd.DataFrame(response)

# If you need all the 'actual_name' tp be in the datframe, continue below
# Otherwise don't include these last 2 lines of code
unmatched = pd.DataFrame([x for x in list2 if x not in list(results['actual_name'])], columns=['actual_name'])
results = results.append(unmatched, sort=False).reset_index(drop=True)

Output:

print(results)
           sample_name             actual_name  score
0             jtsports           JT Sports LLC   79.0
1          tombaseball       Tom Baseball Inc.   81.0
2      context express     Context Express LLC   95.0
3            zb sicily           ZB Sicily LLC   95.0
4   lightening express  Lightening Express LLC   95.0
5           fire roads       Fire Road Express   86.0
6                  NaN             Earth Treks    NaN
7                  NaN           TS Sports LLC    NaN
8                  NaN        MM Baseball Inc.    NaN
9                  NaN     Contact Express LLC    NaN
10                 NaN           AB Sicily LLC    NaN
11                 NaN    Lightening Roads LLC    NaN

It is a slow algorithm. You can look into how to vectorize it (to avoid iterating with a for loop), Or as stated by the fuzzywuzzy package `"Using slow pure-python SequenceMatcher. Install python-Levenshtein"`, So try to install `python-Levenshtein` and see if that speeds it up. — chitown88, Mar 15 '22 at 11:47

jhylands · Answer 2 · 2021-08-23T15:57:07.027

It won't be the most efficient way to do it, being of order O(n) in the number of correct matches but you could calculate the Levenshtein distance between the left and right and then match based on the closest match. That is how a lot of nieve spell check systems work.

I'm suggesting that you run this calculation for each of the correct names and return the match with the lowest score.

Adjusting the code you have posted I would follow something like the following. Bear in mind the Levenshtein distance lower is closer so it'll need some adjusting. It seems the function you are using higher is more close and so the following should work using that.

df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
   highest_so_far = ("", 0)
   for name_master in df2:
     score = fuzz.partial_ratio(name_to_find, name_master)
     if score > highest_so_far[1]:
       highest_so_far = (name_master, score)
   response[name_to_find] = highest_so_far[0]
      
for key, value in response.item():
  print('sample name' + key + 'actual_name' + value)

I get an error saying highest_so_far is undefined. Do you know how i might change the code? Thanks. — Derek Henderson, Aug 23 '21 at 15:55
Yep sorry it's a typo the highest so far in the if statement was missing and "h" — jhylands, Aug 23 '21 at 15:56

Fuzzy match for 2 lists with very similar names

2 Answers2

Linked