I'm using Lucene 8.6.2 (currently the latest available) with AdoptOpenJDK 11 on Windows 10, and I'm having odd problems with the Portuguese and Brazilian Portuguese analyzers mangling the tokenization.
Let's take a simple example: the first line of the chorus from Jorge Aragão's famous samba song, "Já É", first using a org.apache.lucene.analysis.standard.StandardAnalyzer
for reference.
Pra onde você for
String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.println("term: charTermAttribute.toString());
}
tokenStream.end();
}
}
This gives me following terms (collapsed to one line for readability):
pra onde você for
OK, that's pretty much what I would expect with any analyzer. But here is what I get if I use the org.apache.lucene.analysis.pt.PortugueseAnalyzer
instead, using the no-args constructor:
pra onde
Huh? Maybe it thinks that "você" ("you") and "for" ("may go") are stop words and removed them.
But now let's try the org.apache.lucene.analysis.br.BrazilianAnalyzer
, again using the no-args constructor:
pra ond voc for
Now that is just broken and mangled. It changed "onde" ("where") to "ond", which to my knowledge is not even a Portuguese word. And for "você" it just dropped the "ê".
Other lines are as bad or worse:
- Text: "A saudade é dor, volta meu amor"
StandardAnalyzer
:a saudade é dor volta meu amor
PortugueseAnalyzer
:saudad é dor volt amor
BrazilianAnalyzer
:saudad é dor volt amor
Here you can see that the Portuguese and Brazilian Portuguese analyzers produced the same output—but it is the same broken output, as "volta" sure needs to stay "volta" (and not "volt") if I'm very going to get my love to come back to me.
Am I making some serious mistake with the Lucene core libraries and language analyzers? The output makes no sense, and I'm surprised that analyzers for such a common language would mangle the tokens like that.