10

When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.

A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.

code:

nlp = spacy.load('es_core_news_sm')

def lemmatizer(text):  
  doc = nlp(text)
  return ' '.join([word.lemma_ for word in doc])

df['column'] = df['column'].apply(lambda x: lemmatizer(x))

I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:

text = 'personas, ideas, cosas' 
# translation: persons, ideas, things

print(lemmatizer(text))
# Current output:
personar , ideo , coser 
# translation:
personify, ideo, sew

# The expected output should be:
persona, idea, cosa

# translation: 
person, idea, thing
Y4RD13
  • 937
  • 1
  • 16
  • 42
  • 1
    I'm not super familiar with SpaCy, but are you retraining it on your data or using it out of the box? – Engineero Mar 04 '20 at 21:40
  • @Engineero I'm not retraining it, I'm using it directly on the df (the df is completly clean with regular expressions). That's why I tried with a simple text to find If it was working wrong. If there's any other library to lemmatize in spanish, let me know! – Y4RD13 Mar 04 '20 at 21:45
  • Maybe retraining the model is a good idea, becuase in the exaple wans't any complex words, but I don't know how to do it. – Y4RD13 Mar 04 '20 at 21:52
  • 1
    Once I tried to do lemmatization in Spanish, but the only useful thing I found was to go with stemming, using `SnowBallStemmer` from NLTK. – jjsantoso Mar 04 '20 at 21:58
  • @JuanJavierSantosOchoa, yes I know my last option is this one, but I understand lemmatize is more efficient than steamming. – Y4RD13 Mar 04 '20 at 22:01
  • 2
    I'm not a Spanish speaker but for English lemmatization SpaCy relies on knowing what the part-of-speech is for each word. It gets this info during the tagging step of `nlp(text)`, however it doesn't look like your text is real sentences so it's probably getting the POS tags wrong a lot. This will lead to errors. BTW... SpaCy is only about 85% correct for English lemmatization. You might want to look at Stanford's CoreNLP or CLiPS/pattern.en, although all of these solutions only get to low 90% accuracy, and all need to know the POS of the word. – bivouac0 Mar 04 '20 at 22:24
  • 3
    If you know the part-of-speech for each word (ie... if they're all nouns) you can skip the tagging step (`nlp(text)`) and call the lemmatizer directly with the POS type. This will speed up the process significantly and will likely improve accuracy as well. – bivouac0 Mar 04 '20 at 22:36
  • @bivouac0 The test text is not a real sentece, are words that were wrongly lemmatize in the original text. Instead, the dataframe is full of senteces. I think the Stanford's CoreNLP doesn't have a spanish module. – Y4RD13 Mar 05 '20 at 00:06
  • The problem about the POS words is that the dataframe have 60k+ words. I have them If I apply the stopwords, even this way doesn't work it out, because I have verbs and nouns. – Y4RD13 Mar 05 '20 at 00:08
  • As e.g. in the question: the word `person` is lemmatize out to `personify`. Would you recommend me use steamming instead of lemma? – Y4RD13 Mar 05 '20 at 00:11
  • 1
    If you know the POS for each word, try calling the lemmatizer directly and passing in the POS. If you don't know the POS for each word, then stemming is probably your only option. – bivouac0 Mar 05 '20 at 00:22

4 Answers4

19

Unlike the English lemmatizer, spaCy's Spanish lemmatizer does not use PoS information at all. It relies on a lookup list of inflected verbs and lemmas (e.g., ideo idear, ideas idear, idea idear, ideamos idear, etc.). It will just output the first match in the list, regardless of its PoS.

I actually developed spaCy's new rule-based lemmatizer for Spanish, which takes PoS and morphological information (such as tense, gender, number) into account. These fine-grained rules make it a lot more accurate than the current lookup lemmatizer. It will be released soon!

Meanwhile, you can maybe use Stanford CoreNLP or FreeLing.

  • 1
    I'll be waiting when you have the project realased. Meanwhile I will look up Standford CoreNLP and FreeLing (in your experience which one you recommend?) – Y4RD13 Mar 05 '20 at 20:15
  • 1
    I think both are very accurate, but I haven't used them that much to have a preference. FreeLing is rule-based and Stanford is neural. – Guadalupe Romero Mar 07 '20 at 01:18
  • When you release your new rule_based, post it as an update of your answer. It will be really helpful. – Y4RD13 Mar 07 '20 at 03:56
  • Finally, I used SandfordNLP, it is pretty accurate and accomplish with the requiements that I was looking for. – Y4RD13 Mar 18 '20 at 20:14
  • Hi @GuadalupeRomero. Thanks for the hint! Are you going to release new Spanish lemmatizer inside SpaCy project? How can I know about that? Besides, Does it happen the same with `matcher` and Spanish? I have tried a lot of different options, but always return a dark and discouraging void – Juan Luis Chulilla Apr 04 '20 at 00:38
  • @JuanLuisChulilla Yes, there will be an official release of the new lemmatizer. I am not working on the matcher, but what do you mean exactly? – Guadalupe Romero Apr 04 '20 at 18:25
  • @GuadalupeRomero My mistake. I was wrong about Matcher. Sorry – Juan Luis Chulilla Apr 05 '20 at 12:32
  • @Y4RD13 Can you give any pointers about how to do this with StanfordNLP? The only information I can find is [this](https://github.com/stanfordnlp/CoreNLP/issues/137) Issue on github saying it is not possible. – Will Jul 26 '20 at 05:22
  • @Will seems Stanford is outdated. Try instead [Stanza](https://stanfordnlp.github.io/stanza/) which is the lastest and improved librarie created by Stanford NLP group – Y4RD13 Jul 26 '20 at 18:26
  • 2
    `!pip install stanza import stanza` `stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=True)` `stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True)` `doc = stNLP('Barack Obama nació en Hawaii.')` `print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')` – Y4RD13 Jul 26 '20 at 18:30
  • Hi @GuadalupeRomero do you have a lauch date for your spanish lemmatizer? Thank you : ) – Rubiales Alberto Aug 09 '20 at 18:52
  • 2
    @RubialesAlberto it will be released with spacy v3 – Guadalupe Romero Aug 11 '20 at 10:34
  • Hey @GuadalupeRomero, thank you for your work with Spacy, it will be very useful for all Spanish speakers analyzing texts. I am trying to use the rule base Spanish lemmatizer with nightly-spacy, like this: `nlp_es_trf = spacy.load('es_dep_news_trf'); config = {"mode": "rule"}; nlp_es_trf.remove_pipe("lemmatizer"); nlp_es_trf.add_pipe("lemmatizer", config=config)` But I found this error when trying to use it: `ValueError: [E1004] Missing lemmatizer table(s) found for lemmatizer mode 'rule'. Required tables: ['lemma_rules']. Found: []` Maybe I am using it wrong? – skuda Dec 16 '20 at 16:30
  • @GuadalupeRomero Thanks for your contribution to the spacy project. I guess since I have installed version 3.0.3 on my environment I am using the lemmatizer update you mentioned on your answer, am not I? It seems to do its work propertly, at least for my usecase – Luiscri Feb 24 '21 at 11:54
3

Maybe you can use FreeLing, this library offers, among many functionalities lemmatization in Spanish, Catalan, Basque, Italian and other languages.

In my experience, lemmatization in Spanish and Catalan is quite accurate and although it natively supports C++, it has an API for Python and another for Java.

2

One option is to make your own lemmatizer.

This might sound frightening, but fear not! It is actually very simple to do one.

I've recently made a tutorial on how to make a lemmatizer, the link is here:

https://medium.com/analytics-vidhya/how-to-build-a-lemmatizer-7aeff7a1208c

As a summary, you'd have to:

  • Have a POS Tagger (you can use spaCy tagger) to tag input words.
  • Get a corpus of words and their lemmas - here, I suggest you download a Universal Dependencies Corpus for Spanish - just follow the steps in the tutorial mentioned above.
  • Create a lemma dict from the words extracted in the corpus.
  • Save the dict and make a wrapper function that receives both the word and its PoS.

In code, it'd look like this:

def lemmatize(word, pos):
   if word in dict:
      if pos in dict[word]:
          return dict[word][pos]
   return word

Simple, right?

In fact, simple lemmatization doesn't require a lot of processing as one would think. The hard part lies at PoS Tagging, but you have that for free. Either way, if you want to do Tagging yourself, you can see this other tutorial I made:

https://medium.com/analytics-vidhya/part-of-speech-tagging-what-when-why-and-how-9d250e634df6

Hope you get it solved.

Tiago Duque
  • 1,956
  • 1
  • 12
  • 31
1

You can use spacy-stanza. It has spaCy's API with the Stanza's models:

import stanza
from spacy_stanza import StanzaLanguage

text = "personas, ideas, cosas"

snlp = stanza.Pipeline(lang="es")
nlp = StanzaLanguage(snlp)
doc = nlp(text)
for token in doc:
    print(token.lemma_)
Guillem
  • 144
  • 4
  • 13