I am new to dask and was wondering if anyone could give me a hand. I have a large text dataset >20GB and need/want to lemmatize a column. My current function - which works with pandas directly is
wnl = WordNetLemmatizer()
def lemmatizing(sentence):
stemSentence = ""
for word in sentence.split():
stem = wnl.lemmatize(word)
stemSentence += stem
stemSentence += " "
stemSentence = stemSentence.strip()
return stemSentence
And usually would do the following
df['news_content'] = df['news_content'].apply(lemmatizing)
I was looking at delayed
but I am puzzled on how to implement it.
Any help is highly appreciated.