When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.
A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.
code:
nlp = spacy.load('es_core_news_sm')
def lemmatizer(text):
doc = nlp(text)
return ' '.join([word.lemma_ for word in doc])
df['column'] = df['column'].apply(lambda x: lemmatizer(x))
I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:
text = 'personas, ideas, cosas'
# translation: persons, ideas, things
print(lemmatizer(text))
# Current output:
personar , ideo , coser
# translation:
personify, ideo, sew
# The expected output should be:
persona, idea, cosa
# translation:
person, idea, thing