This is because the wordnet
module was "lazily read" and not evaluated yet.
One hack to make it work is to first use the WordNetLemmatizer()
once before using it in the Dask dataframe, e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> import dask.dataframe as dd
>>> df = dd.read_csv('something.csv')
>>> df.head()
text label
0 this is a sentence 1
1 that is a foo bar thing 0
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('cats') # Use it once first, to "unlazify" wordnet.
'cat'
# Now you can use it with Dask dataframe's .apply() function.
>>> lemmatize_text = lambda sent: [wnl.lemmatize(word) for word in sent.split()]
>>> df['lemmas'] = df['text'].apply(lemmatize_text)
>>> df.head()
text label lemmas
0 this is a sentence 1 [this, is, a, sentence]
1 that is a foo bar thing 0 [that, is, a, foo, bar, thing]
Alternatively, you can try pywsd
:
pip install -U pywsd
Then in code:
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.131901025772095 secs.
>>> import dask.dataframe as dd
>>> df = dd.read_csv('something.csv')
>>> df.head()
text label
0 this is a sentence 1
1 that is a foo bar thing 0
>>> df['lemmas'] = df['text'].apply(lemmatize_sentence)
>>> df.head()
text label lemmas
0 this is a sentence 1 [this, be, a, sentence]
1 that is a foo bar thing 0 [that, be, a, foo, bar, thing]