I have one dataframe of user inputted car names which I need to match against another database of car names (millions of records) and extract the unique identifier. What's the best way of improving this matching whilst using rapidfuzz?
#Libraries
import pandas as pd
from rapidfuzz import fuzz, process, utils as fuzz_utils
#Data sample
df = pd.DataFrame({'strings_variable':['Avalon Toyota loan', 'Blazer Chevrolet', 'Blazer Chevrolety', 'Blazer Chevroletys', 'Blazer Chevroletys3', 'Suzuki Vitara sales', 'Suzuki Vita sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar','beetle','aston martin']})
df2 = pd.DataFrame({'strings_variable2': ['Suzuki Vitara','Avalon Toyota loan co','some car','random car', 'Blazer Chevrolet', 'Vauxhall Astra X', 'Buick Special car', 'Ford Aerostar','Aston Martin Valkyrie','Volkswagen Beetle','GMC Syclone'],
'unique_id': ['100','101','102','103','104','105','106','107','108','109','110']})
I already cleaned the data (strip characters, spaces, lowercase, stopwords, etc.) so any direct matches (such as aston martin : Aston Martin) should be removed and the remaining records should have differences (extra words, abbreviations, etc.) which is the reason for why I need to do fuzzy matching.
Using this tutorial I am able to recreate this code which provides successful results
df['Match'] = ''
for index, row in df.iterrows():
matches = []
for i, r in df2.iterrows():
score = fuzz.WRatio(row['strings_variable'], r['strings_variable2'])
if score >= 90:
matches.append(r['strings_variable2'])
matches.append(r['unique_id'])
df.at[index, 'Match'] = matches
df[['match','unique_id']] = pd.DataFrame(df.Match.to_list(), index= df.index)
df = df.drop(columns = ['Match'])
strings_variable | match | unique_id |
---|---|---|
Avalon Toyota loan | Avalon Toyota loan co | 101 |
Blazer Chevrolet | Blazer Chevrolet | 104 |
Blazer Chevrolety | Blazer Chevrolet | 104 |
Blazer Chevroletys | Blazer Chevrolet | 104 |
Blazer Chevroletys3 | Blazer Chevrolet | 104 |
Suzuki Vitara sales | Suzuki Vitara | 100 |
Suzuki Vita sales | ||
Vauxhall Astra | Vauxhall Astra X | 105 |
Ford Aerostar | Ford Aerostar | 107 |
beetle | Volkswagen Beetle | 109 |
This works fine for such a small sample. The problem is that in reality I have to match 150 thousand records against a database of 8 million records so even even trying to match one single record against the 8 mil. database takes ages. What would be the most optimal way of doing this fuzzy matching? Thank you in advance.