0

I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy. Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.

I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.

Any ideas on how to improve?

Many thanks!

Update link to files: https://fromsmash.com/STLz.VEub2-ct

import pandas as pd
from fuzzywuzzy import process, fuzz 
import matplotlib.pyplot as plt 

correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")

actual_comp = []
similarity = []

for i in companies.Customers: 
    ratio = process.extract(i, correct.Correct, limit=1)
    actual_comp.append(ratio[0][0])
    similarity.append(ratio[0][1])
    
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity) 

companies.head(10)
user34624
  • 1
  • 1

1 Answers1

1

There are a couple of things you can change to improve the performance:

  1. Use Rapidfuzz instead of Fuzzywuzzy, since it implements the same algorithms, but is quite a bit faster (I am the author)

  2. The process functions are preprocessing all strings you pass to them (lowercases them, removes non alpha numeric characters and trims whitespaces). Right now your preprocessing correct.Correct len(companies.Customers) times, which costs a lot of time and could be done once in front of the loop instead

  3. Your only using the best match, so it is better to use process.extractOne instead of process.extract. This is more readable and inside extractOne rapidfuzz is using the results of previous comparision to improve the performance

The following snippet implements these changes for your code. Keep in mind, that your still performing 50k^2 comparisions, so while this should be a lot faster than your current solution it will still take a while.

import pandas as pd
from rapidfuzz import process, fuzz, utils
import matplotlib.pyplot as plt 

correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")

actual_comp = []
similarity = []

company_mapping = {company: utils.default_process(company) for company in correct.Correct}

for customer in companies.Customers:
    _, score, comp = process.extractOne(
        utils.default_process(customer),
        company_mapping,
        processor=None)
    actual_comp.append(comp)
    similarity.append(score)
    
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity) 

companies.head(10)

Out of interest I performed a quick benchmark calculating the average runtime when using your datasets. On my machine each lookup requires around 1 second with this solution (so a total of around 4.7 hours), while your previous solution took around 55 seconds per lookup (so a total of around 10.8 days).

maxbachmann
  • 2,862
  • 1
  • 11
  • 35
  • Thanks a ton! I tried to install with pip install, but still get this error: "ImportError: DLL load failed while importing fuzz: The specified module could not be found." – user34624 Sep 14 '20 at 14:33
  • On windows you have to install the C++ 19 redistributable: https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads – maxbachmann Sep 14 '20 at 16:48
  • Thanks @maxbachmann . Should I define 'utils' somewhere? _NameError: name 'utils' is not defined_ – user34624 Sep 14 '20 at 17:18
  • This is part of the library. In my answer it is imported from rapidfuzz – maxbachmann Sep 14 '20 at 17:59
  • Thanks again, my bad! Now it's running, however, the result is showing "NaN" for both company & similarity... For easier reference, I've added the two files in the WeTransfer in the original post – user34624 Sep 14 '20 at 18:36
  • This is happening since there are elements inside your companies list, that are no strings (There are numbers in a lot of places). I guess they should not be in there, so it probably makes sence to filter all elements that are numpy.NaN or a number out before hand. (As an alternative you simply convert them all into strings, but thats probably not what you want) – maxbachmann Sep 14 '20 at 20:50