How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

Question

I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to speed things up by making it process multiple rows at once.

If it it helps - I am running it on my machine with 16Gb RAM and 1.9 GHz dual core CPU.

Below is the code I'm running.

d = []
n = len(Africa_Company) #original list with 2m string records
for i in range(1,n):
    choices = Africa_Company[i+1:n]
    word = Africa_Company[i]
    try:
        output= process.extractOne(str(word), str(choices), score_cutoff=85)
    except Exception:
        print (word) #to identify which string is throwing an exception
    print (i) #to know how many rows are processed, can do without this also
    if output:
        d.append({'Company':Africa_Company[i], 
                  'NewCompany':output[0],
                  'Score':output[1], 
                  'Region':'Africa'})
    else:
        d.append({'Company':Africa_Company[i], 
                  'NewCompany':None,
                  'Score':None, 
                  'Region':'Africa'})


Africa_Corrected = pd.DataFrame(d) #output data in a pandas dataframe

Thanks in advance !

Passing `str(choices)` to `extractOne` seems wrong to me, shouldn't that be just `choices`? — mata, Jan 10 '17 at 15:00

Emil Vikström · Answer 1 · 2017-01-10T14:55:56.527

This is a CPU-bound problem. By going parallel you can just speed it up by a factor of two at most (because you have two cores). What you really should do is speed up the single-thread performance. Levenshtein distance is quite slow so there are lots of opportunity to speed things up.

Use pruning. Don't try to run the full fuzzywuzzy match between two strings if there is no way it will give a good result. Try to find a simple linear algorithm to filter out irrelevant choices before the fuzzywuzzy match.
Consider indexing. Is there some way you can index your list? For example: if your matching is based on whole words, create a hashmap that maps words to strings. Only try to match against choices that have at least one word in common with your current string.
Preprocessing. Is there some work done on the strings in every match that you can preprocess? If, for example, your Levenshtein implementation starts by creating sets out of your strings, consider creating all sets first so you don't have to do the same work over and over in each match.
Is there some better algorithm to use? Maybe Levenshtein distance is not the best algorithm to begin with.
Is the implementation of Levenshtein distance you're using optimal? This goes back to step 3 (preprocessing). Are there other things you can do to speed up the runtime?

Multiprocessing will only speed up with a constant factor (depending on the number of cores). Indexing can take you to a lower complexity class! So focus on pruning and indexing first, then steps 3-5. Only when you squeezed enough out of these steps should you consider multiprocessing.

How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

1 Answers1