1

I would like to access nltk.corpus.wordnet in a multithreaded environment. As soon as I enable multithreading, methods such as synsets() fail. If I disable it, everything works fine.

The error messages change. For example, an error could look like this, which looks very much like a race condition to me:

File "/home/lhk/anaconda3/envs/dlab/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1342, in synset_from_pos_and_offset
    assert synset._offset == offset

There are other questions about this:

The solution to the first linked question was to load the corpus before your program branches up into individual threads. I've done that: wordnet.ensure_loaded() is called before the multithreading.

The recommendation in the GitHub issue is to import wordnet within my threaded function. But that doesn't change anything.

Lii
  • 11,553
  • 8
  • 64
  • 88
lhk
  • 27,458
  • 30
  • 122
  • 201

1 Answers1

0

A workaround is to make a deep copy of the corpus, for every thread. Of course this needs lots of memory and is not very efficient:

import copy
from nltk.corpus import wordnet as wn
wn.ensure_loaded()

# at the beginning of the multi-threaded environment
my_wn = copy.deepcopy(wn)
lhk
  • 27,458
  • 30
  • 122
  • 201
  • 2
    This solution appears to generate `LookupError` for other corpus datasets that are not downloaded. I have not been able to make `deepcopy(wn)` work successfully in any case. – ely Jul 26 '19 at 13:06
  • Were you guys able to find any working solution? – Victor C Mar 16 '23 at 22:12