I am trying to use pathos for triggering multiprocessing within a function. I notice, however, an odd behaviour and don't know why:
import spacy
from pathos.multiprocessing import ProcessPool as Pool
nlp = spacy.load("es_core_news_sm")
def preworker(text, nlp):
return [w.lemma_ for w in nlp(text)]
worker = lambda text: preworker(text, nlp)
texts = ["Este es un texto muy interesante en español"] * 10
# Run this in jupyter:
%%time
pool = Pool(3)
r = pool.map(worker, texts)
The output is
CPU times: user 6.6 ms, sys: 26.5 ms, total: 33.1 ms
Wall time: 141 ms
So far so good... Now I define the same exact calculation, but from a function:
def out_worker(texts, nlp):
worker = lambda text: preworker(text, nlp)
pool = Pool(3)
return pool.map(worker, texts)
# Run this in jupyter:
%%time
r = out_worker(texts, nlp)
The output now is
CPU times: user 10.2 s, sys: 591 ms, total: 10.8 s
Wall time: 13.4 s
Why is there such a large difference? My hypothesis, though I don't know why, is that in the second case a copy of the nlp object is sent to every single job.
Also, how can I correctly call this multiprocessing from within a function?
Thanks
EDIT:
For reproducibility of the issue, here is a Python script that shows the situation:
import spacy
from pathos.multiprocessing import ProcessPool as Pool
import time
# Install with python -m spacy download es_core_news_sm
nlp = spacy.load("es_core_news_sm")
def preworker(text, nlp):
return [w.lemma_ for w in nlp(text)]
worker = lambda text: preworker(text, nlp)
texts = ["Este es un texto muy interesante en español"] * 10
st = time.time()
pool = Pool(3)
r = pool.map(worker, texts)
print(f"Usual pool took {time.time()-st:.3f} seconds")
def out_worker(texts, nlp):
worker = lambda text: preworker(text, nlp)
pool = Pool(3)
return pool.map(worker, texts)
st = time.time()
r = out_worker(texts, nlp)
print(f"Pool within a function took {time.time()-st:.3f} seconds")
def out_worker2(texts, nlp, pool):
worker = lambda text: preworker(text, nlp)
return pool.map(worker, texts)
st = time.time()
pool = Pool(3)
r = out_worker2(texts, nlp, pool)
print(f"Pool passed to a function took {time.time()-st:.3f} seconds")
In my case, the output is this one:
Usual pool took 0.219 seconds
Pool within a function took 8.164 seconds
Pool passed to a function took 8.265 seconds
The spacy nlp object is quite heavy (a few MBs). My spacy version is 3.0.3