-1

I am new to this multiprocessing concept. I am trying to implement multiprocessing to a spelling function to make it run faster. I tried as below but did not get results in previous order, token here is the huge list of tokenized sentences.

from spellchecker import SpellChecker
from wordsegment import load, segment
from timeit import default_timer as timer
from multiprocessing import Process, Pool, Queue, Manager

def text_similarity_spellings(self, token):
    """Uses spell checker to separate incorrect spellings and correct them"""
    spell = SpellChecker()
    unknown_words = [list(spell.unknown(word)) for word in token]
    known_words = [list(spell.known(word)) for word in token]
    load()
    segmented = [[segment(word) for word in sub] for sub in unknown_words]
    flat_list = list(self.unpacker(segmented))
    new_list = [[known_words[x], flat_list[x]] for x in range(len(known_words))]
    new_list = list(self.unpacker(new_list))
    newlist = [sorted(set(mylist), key=lambda x: mylist.index(x)) for mylist in new_list]
    return newlist

def run_all(self):
    tread_vta = Manager().list()
    processes = []
    arg_split = np.array_split(np.array(token),10)
    arg_tr_cl = []
    finds = []
    trdclean1 = []
    for count, k in enumerate(arg_split):
        arg_tr_cl.append((k, [], tread_vta, token[t]))
    for j in range(len(arg_tr_cl)):
        p = Process(target= self.text_similarity_spellings, args=arg_tr_cl[j])
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

Can anyone suggest me a better way to apply multiprocessing to a specific function and get results in correct order?

martineau
  • 119,623
  • 25
  • 170
  • 301
code_learner
  • 233
  • 1
  • 9
  • 1
    I think your application is better suited for `map`, especially if you care about the order. Process is better suited for asynchronous operations where you don't care that the results are returned in the same order. This is because individual processes can complete faster // slower – Shffl Jun 03 '21 at 22:08

1 Answers1

1

First, there is a certain amount of overhead in creating processes and then again more overhead in passing arguments from the main process to a subprocess, which "lives" in another address space, and getting return values back (by the way, you have made no provisions for actually getting return values back from worker function text_similarity_spellings). So for you to profit from using multiprocessing, the gains from performing your tasks (invocations of your worker function) in parallel must be enough to offset the additional aforementioned costs. All of this is just a way of saying that your worker function has to be sufficiently CPU-intensive to justify multiprocessing.

Second, given the cost of creating processes, you do not want to be creating more processes than you can possibly use. If you have N tasks to complete (the length of arg_tr_cl) and M CPU processors to run them on and your worker function is pure CPU (no I/O involved), then you can never gain anything by trying to run these tasks using more than M processes. If, however, they do combine some I/O, then perhaps using more processes could be profitable. If there is a lot of I/O involved and only some CPU-intensive processing involved, then using a combination of multithreading and multiprocessing might be the way to go. Finally, if the worker function is mostly I/O, then multithreading is what you want.

There is a solution to using X processes (based on whatever value of X you have settled on) to complete N tasks and to be able to get return values back from your worker function, namely using a process pool of size X.

MULTITHREADING = False

n_tasks = len(arg_tr_cl)

if MULTITHREADING:
    from multiprocessing.dummy import Pool

    # To use multithreading instead (we can use a much larger pool size):
    pool_size = min(n_tasks, 100) # 100 is fairly arbitrary

else:
    from multiprocessing import Pool, cpu_count

    # No point in creating pool size larger than the number of tasks we have
    # Otherwise, assuming we are mostly CPU-intensive, just create pool size
    # equal to the number of cpu cores that we have:
    n_processors = cpu_count()
    pool_size = min(n_tasks, n_processors)

pool = Pool(pool_size)
return_values = pool.map(self.text_similarity_spellings, arg_tr_cl)
# You can now iterate return_values to get the return values:
for return_value in return_values:
    ...
# or create a list, for example: return_values = list(return_values)

But it may be that the SpellChecker is doing lots of I/O if each invocation has to read in an external dictionary. If that is the case, is it not possible that your best performance is to initialize the SpellChecker once and then just loop checking each word and forget completely about multiprocessing (or multithreading)?

Booboo
  • 38,656
  • 3
  • 37
  • 60