I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy. Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.
I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.
Any ideas on how to improve?
Many thanks!
Update link to files: https://fromsmash.com/STLz.VEub2-ct
import pandas as pd
from fuzzywuzzy import process, fuzz
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
for i in companies.Customers:
ratio = process.extract(i, correct.Correct, limit=1)
actual_comp.append(ratio[0][0])
similarity.append(ratio[0][1])
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)