0

I am new to dask and was wondering if anyone could give me a hand. I have a large text dataset >20GB and need/want to lemmatize a column. My current function - which works with pandas directly is

wnl = WordNetLemmatizer()

def lemmatizing(sentence):    
    stemSentence = ""

    for word in sentence.split():
        stem = wnl.lemmatize(word)
        stemSentence += stem
        stemSentence += " "

    stemSentence = stemSentence.strip()

    return stemSentence

And usually would do the following

df['news_content'] = df['news_content'].apply(lemmatizing)

I was looking at delayed but I am puzzled on how to implement it.

Any help is highly appreciated.

osterburg
  • 447
  • 5
  • 24

0 Answers0