3

I have two lists to match against one another. I Need to match each str1 word with each list of str2 words. I have a list of 40k words in str2. I want to try using multiprocessing to make it run faster.

For example:

str1 = ['how', 'are', 'you']
str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]

The code I tried:

from multiprocessing import Process, Pool
from fuzzywuzzy import process 


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))      
    print(res)
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]
    for i in str2:
        p = Process(target=f, args=(i, str1))
        p.start()
        p.join()

This does not return what I expect - I was expecting the output to look like a data frame:

words                   how are you
['this', 'how', 'done'] 100 0   0
['they', 'were', 'here'] 0  90  0
['can', 'you', 'leave']  0  80 100
['how', 'sad']           100 0   0
Izaak van Dongen
  • 2,450
  • 13
  • 23
code_learner
  • 233
  • 1
  • 9

1 Answers1

2

You're not really using parallel multiprocessing because of this loop:

for i in str2:
    p = Process(target=f, args=(i, str1))
    p.start()
    p.join()

p.join() waits for each process to complete, sequentially. So there's no speedup with that construct (note that it can be useful just to create a new clean process for each case, in some situation where you're loading native code in DLLs for instance).

You have to store the process objects and wait for them in a separate loop instead.

# create & store process objects
processes = [Process(target=f, args=(i, str1)) for i in str2]
# start processes
for p in processes:
   p.start()
# wait for processes to complete
for p in processes:
   p.join()

Note that that approach has several major issues:

  • this may create too many processes running at the same time
  • how to get hold of the return values from f simply?

With your current method, the return value is lost, unless you store it in a manager object. The map method allows to get hold of the results, like the example shows above.

That's why objects like process pools exist. Small example of use:

from multiprocessing import Pool

def sq(x):
    return x**2

if __name__=="__main__":
    p = Pool(2)
    n = p.map(sq, range(10))
    print(n)

Here only 2 processes are active at the same time.

Your code, adapted to pools (untested)

from multiprocessing import Pool
from fuzzywuzzy import process


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad']]

    nb_processes = 4
    p = Pool(nb_processes)
    results = p.map(f, [(i,str1) for i in str2])

results is a list of the return values (a list) from each call to f, in the order specified by str2

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • It does run faster than the previous one. Does map help to separate the scores based of words in str1 ? – code_learner Jan 14 '20 at 21:08
  • 1
    map just applies f on each (i,str1) argument. If you want to separate, I suggest that you pass a combination of elements from str1 / str2 instead of looping in `f`. – Jean-François Fabre Jan 14 '20 at 21:09