Python multiprocessing against lists for fuzzywuzzy

Question

I have two lists to match against one another. I Need to match each str1 word with each list of str2 words. I have a list of 40k words in str2. I want to try using multiprocessing to make it run faster.

For example:

str1 = ['how', 'are', 'you']
str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]

The code I tried:

from multiprocessing import Process, Pool
from fuzzywuzzy import process 


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))      
    print(res)
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]
    for i in str2:
        p = Process(target=f, args=(i, str1))
        p.start()
        p.join()

This does not return what I expect - I was expecting the output to look like a data frame:

words                   how are you
['this', 'how', 'done'] 100 0   0
['they', 'were', 'here'] 0  90  0
['can', 'you', 'leave']  0  80 100
['how', 'sad']           100 0   0

`p.start() p.join()` in your loop isn't going to make your code any faster — Jean-François Fabre, Jan 14 '20 at 20:30

Jean-François Fabre · Answer 1 · 2020-01-14T21:06:44.953

You're not really using parallel multiprocessing because of this loop:

for i in str2:
    p = Process(target=f, args=(i, str1))
    p.start()
    p.join()

p.join() waits for each process to complete, sequentially. So there's no speedup with that construct (note that it can be useful just to create a new clean process for each case, in some situation where you're loading native code in DLLs for instance).

You have to store the process objects and wait for them in a separate loop instead.

# create & store process objects
processes = [Process(target=f, args=(i, str1)) for i in str2]
# start processes
for p in processes:
   p.start()
# wait for processes to complete
for p in processes:
   p.join()

Note that that approach has several major issues:

this may create too many processes running at the same time
how to get hold of the return values from f simply?

With your current method, the return value is lost, unless you store it in a manager object. The map method allows to get hold of the results, like the example shows above.

That's why objects like process pools exist. Small example of use:

from multiprocessing import Pool

def sq(x):
    return x**2

if __name__=="__main__":
    p = Pool(2)
    n = p.map(sq, range(10))
    print(n)

Here only 2 processes are active at the same time.

Your code, adapted to pools (untested)

from multiprocessing import Pool
from fuzzywuzzy import process


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad']]

    nb_processes = 4
    p = Pool(nb_processes)
    results = p.map(f, [(i,str1) for i in str2])

results is a list of the return values (a list) from each call to f, in the order specified by str2

It does run faster than the previous one. Does map help to separate the scores based of words in str1 ? — code_learner, Jan 14 '20 at 21:08
map just applies f on each (i,str1) argument. If you want to separate, I suggest that you pass a combination of elements from str1 / str2 instead of looping in `f`. — Jean-François Fabre, Jan 14 '20 at 21:09

Python multiprocessing against lists for fuzzywuzzy

1 Answers1