Python parallel computation without pickling

Question

I have a very simple list comprehension I would like to parallelize:

nlp = spacy.load(model)
texts = sorted(X['text'])
# TODO: Parallelize
docs = [nlp(text) for text in texts]

However, when I try using Pool from the multiprocessing module like so:

docs = Pool().map(nlp, texts)

It gives me the following error:

Traceback (most recent call last):
  File "main.py", line 117, in <module>
    main()
  File "main.py", line 99, in main
    docs = parse_docs(X)
  File "main.py", line 81, in parse_docs
    docs = Pool().map(nlp, texts)
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 608, in get
    raise self._value
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 385, in _handle_tasks
    put(task)
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'

Is it possible to do this parallel computation without having to make objects pickleable? I'm open to examples tied to third-party libraries such as joblib, etc.

edit: I also tried

docs = Pool().map(nlp.__call__, texts)

and that didn't work either.

sudo · Answer 1 · 2018-02-28T00:52:08.073

1

Most likely not. You're probably trying to share something that's at a lower level unsafe to share across processes, e.g. something with open file descriptors. There's some discussion here on why it's not picklable, and they vaguely say it's for something like that reason. Why not load nlp separately in each process?

More here too, seems to be a general issue with spacy that they're working on resolving: https://github.com/explosion/spaCy/issues/1045

edited Feb 28 '18 at 00:52

answered Feb 28 '18 at 00:43

sudo

5,604
5
40
78

Thanks for the link you provided. I was able to learn that spaCy pickles objects using the `dill` module, so to avoid the pickling error I did `import multiprocessing_on_dill as multiprocessing` – James Ko Feb 28 '18 at 01:03
Ah, so spacy 2 is out now, pretty recent. I thought you were using spacy 1. Nice. – sudo Feb 28 '18 at 01:32

score 0 · Answer 2 · answered Jun 23 '18 at 20:15

A workaround could be the follows

texts = ["Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season.",
        "The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title.",
        "The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.",
        "As this was the 50th Super Bowl, the league emphasized the"]

def init():
    global nlp
    nlp = spacy.load('en')

def func(text):
    global nlp
    return nlp(text)

with mp.Pool(initializer=init) as pool:
    docs = pool.map(func, texts)

which outputs

for doc in docs:
    print(list(w.text for w in doc))

['Super', 'Bowl', '50', 'was', 'an', 'American', 'football', 'game', 'to', 'determine', 'the', 'champion', 'of', 'the', 'National', 'Football', 'League', '(', 'NFL', ')', 'for', 'the', '2015', 'season', '.']
['The', 'American', 'Football', 'Conference', '(', 'AFC', ')', 'champion', 'Denver', 'Broncos', 'defeated', 'the', 'National', 'Football', 'Conference', '(', 'NFC', ')', 'champion', 'Carolina', 'Panthers', '24–10', 'to', 'earn', 'their', 'third', 'Super', 'Bowl', 'title', '.']
['The', 'game', 'was', 'played', 'on', 'February', '7', ',', '2016', ',', 'at', 'Levi', "'s", 'Stadium', 'in', 'the', 'San', 'Francisco', 'Bay', 'Area', 'at', 'Santa', 'Clara', ',', 'California', '.']
['As', 'this', 'was', 'the', '50th', 'Super', 'Bowl', ',', 'the', 'league', 'emphasized', 'the']

Python parallel computation without pickling

2 Answers2