1

I am trying to do stemming on a dask dataframe

wnl = WordNetLemmatizer()

def lemmatizing(sentence):
    stemSentence = ""

    for word in sentence.split():
        stem = wnl.lemmatize(word)
        stemSentence += stem
        stemSentence += " "

        stemSentence = stemSentence.strip()

    return stemSentence

df['news_content'] = df['news_content'].apply(stemming).compute()

But I am getting the following error:

AttributeError: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'

I already tried what was recommended here, but without any luck.

Thanks for the help.

alvas
  • 115,346
  • 109
  • 446
  • 738
osterburg
  • 447
  • 5
  • 24

1 Answers1

2

This is because the wordnet module was "lazily read" and not evaluated yet.

One hack to make it work is to first use the WordNetLemmatizer() once before using it in the Dask dataframe, e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> import dask.dataframe as dd

>>> df = dd.read_csv('something.csv')
>>> df.head()
                      text  label
0       this is a sentence      1
1  that is a foo bar thing      0


>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('cats') # Use it once first, to "unlazify" wordnet.
'cat'

# Now you can use it with Dask dataframe's .apply() function.
>>> lemmatize_text = lambda sent: [wnl.lemmatize(word) for word in sent.split()]

>>> df['lemmas'] = df['text'].apply(lemmatize_text)
>>> df.head()
                      text  label                          lemmas
0       this is a sentence      1         [this, is, a, sentence]
1  that is a foo bar thing      0  [that, is, a, foo, bar, thing]

Alternatively, you can try pywsd:

pip install -U pywsd

Then in code:

>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.131901025772095 secs.

>>> import dask.dataframe as dd

>>> df = dd.read_csv('something.csv')
>>> df.head()
                      text  label
0       this is a sentence      1
1  that is a foo bar thing      0

>>> df['lemmas'] = df['text'].apply(lemmatize_sentence)
>>> df.head()
                      text  label                          lemmas
0       this is a sentence      1         [this, be, a, sentence]
1  that is a foo bar thing      0  [that, be, a, foo, bar, thing]
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    Thank you, that helped a lot. – osterburg Mar 04 '19 at 08:30
  • Question, why does that workflow work and ```.compute()``` errors out? – osterburg Mar 04 '19 at 08:51
  • 1
    It's because wordnet needs to be evaluated first. Unlike modern python libraries where most things are pre-evaluated. Many things works like generators in older times when machine resources are limited. So using the `WordNetLemmatizer()` once would have kicked the wordnet to be evaluated. – alvas Mar 04 '19 at 09:26
  • 1
    "Lazy loading" is a design pattern that's less talked about today because machines are bigger but it can be beneficial when machine resources are limited =) https://en.wikipedia.org/wiki/Lazy_loading – alvas Mar 04 '19 at 09:28
  • Sidenote: if you are working in a jupyter notebook you have to make sure that you take your time in between executing the cell ```wnl.lemmatize('cats')``` and the rest. Otherwise, you will get the same error. If the code is in one cell add a sleep statement to wait 5 seconds or so. – osterburg Mar 05 '19 at 10:49