How to apply entity name matching (rapidfuzz) at scale

Question

I have one dataframe of user inputted car names which I need to match against another database of car names (millions of records) and extract the unique identifier. What's the best way of improving this matching whilst using rapidfuzz?

#Libraries
import pandas as pd
from rapidfuzz import fuzz, process, utils as fuzz_utils

#Data sample
df = pd.DataFrame({'strings_variable':['Avalon Toyota loan', 'Blazer Chevrolet', 'Blazer Chevrolety', 'Blazer Chevroletys', 'Blazer Chevroletys3', 'Suzuki Vitara sales', 'Suzuki Vita sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar','beetle','aston martin']})

df2 = pd.DataFrame({'strings_variable2': ['Suzuki Vitara','Avalon Toyota loan co','some car','random car', 'Blazer Chevrolet', 'Vauxhall Astra X', 'Buick Special car', 'Ford Aerostar','Aston Martin Valkyrie','Volkswagen Beetle','GMC Syclone'],
                    'unique_id': ['100','101','102','103','104','105','106','107','108','109','110']})

I already cleaned the data (strip characters, spaces, lowercase, stopwords, etc.) so any direct matches (such as aston martin : Aston Martin) should be removed and the remaining records should have differences (extra words, abbreviations, etc.) which is the reason for why I need to do fuzzy matching.

Using this tutorial I am able to recreate this code which provides successful results

df['Match'] = ''
for index, row in df.iterrows():
    matches = []
    for i, r in df2.iterrows():
        score = fuzz.WRatio(row['strings_variable'], r['strings_variable2'])
        if score >= 90:
            matches.append(r['strings_variable2'])
            matches.append(r['unique_id'])
    df.at[index, 'Match'] = matches
    df[['match','unique_id']] = pd.DataFrame(df.Match.to_list(), index= df.index)
df = df.drop(columns = ['Match'])

strings_variable	match	unique_id
Avalon Toyota loan	Avalon Toyota loan co	101
Blazer Chevrolet	Blazer Chevrolet	104
Blazer Chevrolety	Blazer Chevrolet	104
Blazer Chevroletys	Blazer Chevrolet	104
Blazer Chevroletys3	Blazer Chevrolet	104
Suzuki Vitara sales	Suzuki Vitara	100
Suzuki Vita sales
Vauxhall Astra	Vauxhall Astra X	105
Ford Aerostar	Ford Aerostar	107
beetle	Volkswagen Beetle	109

This works fine for such a small sample. The problem is that in reality I have to match 150 thousand records against a database of 8 million records so even even trying to match one single record against the 8 mil. database takes ages. What would be the most optimal way of doing this fuzzy matching? Thank you in advance.

Have you looked into locality-sensitive-hashing? It can be used as a function that maps your strings into hash values, where similar strings map to the same hash value. Also, is this question relevant https://stackoverflow.com/questions/21408760/better-fuzzy-matching-performance? — Oli, Mar 10 '23 at 16:42

How to apply entity name matching (rapidfuzz) at scale

0 Answers0