Solr Snowball stemmer is inconsistent with Spanish

Question

I have this stemmed field:

<fieldtype name="textes" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords-es.txt" enablePositionIncrements="true"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Spanish" protected="protwords-es.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Spanish" protected="protwords-es.txt"/>
  </analyzer>
</fieldtype>

The expected result of the search query alquileres (rents) would be a match of alquiler (rent). But when I go to "Field Analysis" in the Solr Admin site, and check an index value of alquiler and a query value of alquileres, the following happens:

When indexing alquiler, it gets stemmed into alquil.
When querying alquileres, it gets stemmed into alquiler.

So the simple case of searching the plural form of a word (alquileres) would not match its singular form (alquiler).

Shouldn't both the index and the query be stemmed into the same stem (either alquiler or alquil)? Is this a limitation of the algorithm or a misunderstanding/misconfiguration from my part?

score 1 · Accepted Answer · answered Dec 07 '11 at 15:57

1

Snowball stemming is very limited... You'd get better result by using a dictionary (Hunspell stemmer) : http://wiki.apache.org/solr/Hunspell

answered Dec 07 '11 at 15:57

Romain Meresse

3,044
25
29

Didn't know about that. I'll definitely take a look a it. Thanks! – Chewie Dec 13 '11 at 12:26
1

I tried Hunspell, but it suffers from the same malfunction. `alquileres` keeps stemming into `alquiler`, and `alquiler` into `alquil`. My kingdom for a decent Spanish stemmer! – Chewie Dec 20 '11 at 12:56
Could you try `solr.SpanishLightStemFilterFactory` ? – Romain Meresse Dec 21 '11 at 13:32
@Chewie I'm having the same problem with variations of "enfermería", did you find a solution or just turned off stemming? – danielv Apr 19 '12 at 15:39

score 0 · Answer 2 · answered Oct 05 '15 at 19:09

I use hunspell from openoffice and it does an excelent job.

My example:

URL-Elastic/_analyze?analyzer=es_AR&text=alquileres

And return:

{
  tokens:
  [
    {
      token: "alquiler",
      start_offset: 0,
      end_offset: 10,
      type: "<ALPHANUM>",
      position: 1
    }
  ]

}

Link: https://www.openoffice.org/download/index.html

Solr Snowball stemmer is inconsistent with Spanish

2 Answers2