0

I have one dataframe of user inputted car names which I need to match against another database of car names (millions of records) and extract the unique identifier. What's the best way of improving this matching whilst using rapidfuzz?

#Libraries
import pandas as pd
from rapidfuzz import fuzz, process, utils as fuzz_utils
#Data sample
df = pd.DataFrame({'strings_variable':['Avalon Toyota loan', 'Blazer Chevrolet', 'Blazer Chevrolety', 'Blazer Chevroletys', 'Blazer Chevroletys3', 'Suzuki Vitara sales', 'Suzuki Vita sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar','beetle','aston martin']})

df2 = pd.DataFrame({'strings_variable2': ['Suzuki Vitara','Avalon Toyota loan co','some car','random car', 'Blazer Chevrolet', 'Vauxhall Astra X', 'Buick Special car', 'Ford Aerostar','Aston Martin Valkyrie','Volkswagen Beetle','GMC Syclone'],
                    'unique_id': ['100','101','102','103','104','105','106','107','108','109','110']})

I already cleaned the data (strip characters, spaces, lowercase, stopwords, etc.) so any direct matches (such as aston martin : Aston Martin) should be removed and the remaining records should have differences (extra words, abbreviations, etc.) which is the reason for why I need to do fuzzy matching.

Using this tutorial I am able to recreate this code which provides successful results

df['Match'] = ''
for index, row in df.iterrows():
    matches = []
    for i, r in df2.iterrows():
        score = fuzz.WRatio(row['strings_variable'], r['strings_variable2'])
        if score >= 90:
            matches.append(r['strings_variable2'])
            matches.append(r['unique_id'])
    df.at[index, 'Match'] = matches
    df[['match','unique_id']] = pd.DataFrame(df.Match.to_list(), index= df.index)
df = df.drop(columns = ['Match'])
strings_variable match unique_id
Avalon Toyota loan Avalon Toyota loan co 101
Blazer Chevrolet Blazer Chevrolet 104
Blazer Chevrolety Blazer Chevrolet 104
Blazer Chevroletys Blazer Chevrolet 104
Blazer Chevroletys3 Blazer Chevrolet 104
Suzuki Vitara sales Suzuki Vitara 100
Suzuki Vita sales
Vauxhall Astra Vauxhall Astra X 105
Ford Aerostar Ford Aerostar 107
beetle Volkswagen Beetle 109

This works fine for such a small sample. The problem is that in reality I have to match 150 thousand records against a database of 8 million records so even even trying to match one single record against the 8 mil. database takes ages. What would be the most optimal way of doing this fuzzy matching? Thank you in advance.

  • Have you looked into locality-sensitive-hashing? It can be used as a function that maps your strings into hash values, where similar strings map to the same hash value. Also, is this question relevant https://stackoverflow.com/questions/21408760/better-fuzzy-matching-performance? – Oli Mar 10 '23 at 16:42

0 Answers0