5

I have a Python Pandas dataframe, where I need to lemmatize the words in two of the columns. I am using using spacy for this.

import spacy
nlp = spacy.load("en")

I am trying to use lemmatization based on this example (which works perfectly fine):

doc3 = nlp(u"this is spacy lemmatize testing. programming books are more better than others")
for token in doc3: 
    print (token, token.lemma, token.lemma_)

I have rewritten this to loop through each row of one of the columns in my dataframe:

for row in example['col1']:
    for token in row:
        print(token.lemma_)

This works, however, I have not been able to figure out how to replace the words in col1 with the lemmatized words.

I have tried this, which does not return an error, but also does not replace any words. Any idea what is going wrong?

for row in example['col1']:
    for token in row:
        token = token.lemma_
Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
Mia
  • 559
  • 4
  • 9
  • 21

1 Answers1

8

In the last for loop of your code, you are repeatedly assigning to the variable token its attribute token.lemma_ and then doing this again and again (overwriting this at every iteration and not keeping track of the previous values).

Instead, assuming that your dataframe contains strings, as in

example = pd.DataFrame({"col1":["this is spacy lemmatization testing.", "some programming books are better than others", "sounds like a quote from the Smiths"]})

apply and list comprehensions can do the job with:

example["col1"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72