2

I am using the latest version of spacy_hunspell with Portuguese dictionaries. And, I realized that when I have inflected verbs containing special characters, such as the acute accent (`) and the tilde (~), the spellchecker fails to retrieve the correct verification:

import hunspell

spellchecker = hunspell.HunSpell('/usr/share/hunspell/pt_PT.dic',
                                 '/usr/share/hunspell/pt_PT.aff')

#Verb: fazer
spellchecker.spell('fazer') # True, correct
spellchecker.spell('faremos') # True, correct
spellchecker.spell('fará') # False, incorrect
spellchecker.spell('fara') # True, incorrect
spellchecker.spell('farão') # False, incorrect

#Verb: andar
spellchecker.spell('andar') # True, correct
spellchecker.spell('andamos') # True, correct
spellchecker.spell('andará') # False, incorrect
spellchecker.spell('andara') # True, correct

#Verb: ouvir
spellchecker.spell('ouvir') # True, correct
spellchecker.spell('ouço') # False, incorrect

Another problem is when the verb is irregular, like ir:

spellchecker.spell('vamos') # False, incorrect
spellchecker.spell('vai') # False, incorrect
spellchecker.spell('iremos') # True, correct
spellchecker.spell('irá') # False, incorrect

As far as noticed, the problem does not happen with nouns with special characters:

spellchecker.spell('coração') # True, correct
spellchecker.spell('órgão') # True, correct
spellchecker.spell('óbvio') # True, correct
spellchecker.spell('pivô') # True, correct

Any suggestions?

revy
  • 647
  • 2
  • 10
  • 29

2 Answers2

2

To clarify some important ideas: spell checking, together with lemmatizing, usually happens by using a set of predefined rules (yeah, no machine learning, nor extensive annotated thesaurus). However, as you noticed, some of these rules does not apply to irregular verbs and flexions.

It turns out that Spacy Model and rules (in fact not only spacy, but any tool out there for Portuguese) are very weak when compared to other languages.

In conclusion: you are not getting wrong results because of any mistake you commited, but rather because the model provided by spacy (and hunspell) is wrong.

As an open source project, you could try to enchance it yourself. If that is not an opption, you could try some other tool, such as dicio (which is thesaurus based, but very slow, since you would have to integrate it with Ajax and that would require a request for every word!)

Welcome to Portuguese NLP!

Community
  • 1
  • 1
Tiago Duque
  • 1,956
  • 1
  • 12
  • 31
  • 1
    You were right. I managed the problem by changing the dictionary to the latest version provided in https://natura.di.uminho.pt/download/sources/Dictionaries/hunspell/ (2019-03-29). – revy Aug 07 '19 at 10:58
1

This question is about hunspell and not spacy or spacy_hunspell.

I think it is an encoding issue even though it might not look like it in all of your test cases. I'm not sure how you found those Portuguese dictionaries, but they are not in UTF-8 and they aren't the current/standard hunspell pt_PT libraries, which are from LibreOffice:

https://github.com/LibreOffice/dictionaries/tree/master/pt_PT

These are the Portuguese dictionaries installed by debian/ubuntu if you install the package hunspell-pt-pt (e.g., with apt-get install hunspell-pt-pt) and they have the right behavior with your test cases above, either with hunspell on the command-line or pyhunspell as in your code above.

aab
  • 10,858
  • 22
  • 38