2

I'm working on a basic German analyzer in Elasticsearch which is defined as follows

{
  "settings": {
    "analysis": {
      "filter": {
        "german_stemmer": {
          "type": "snowball",
          "language": "German"
        },
        "german_stop": {
          "type": "stop",
          "stopwords": "_german_"
        }
      },
      "analyzer": {
        "german_search": {
          "filter": ["lowercase", "german_stop", "german_stemmer"],
          "tokenizer": "standard"
        }
      }
    }
  }
}

While testing it I realized that it is not dealing well with Kürbis and Kürbisse. Stemming those two words brings different output while from my understanding (just what I read online) Kurbis stands for Pumpkin and Kurbisse is Pumpkins. It looks like the stemmer is not dealing well with plurals.

Any ideas on how can I solve this?

Evaldas Buinauskas
  • 13,739
  • 11
  • 55
  • 107
Lior Magen
  • 1,533
  • 2
  • 15
  • 33
  • 1
    My initial gut feeling was to test different token stemmers, however, none of them produce expected results. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html This might be an edge-case and you could deal with that using synonyms, however, that's not the nicest way. – Evaldas Buinauskas Feb 02 '21 at 10:06
  • 1
    Indeed, I tried the normal `stemmer` token filter with `light_german`, `german`, `german2` and `minimal_german` languages and they all produce different tokens for singular and plural... You're not the only one, see this old thread from 12 years ago: https://lists.tartarus.org/pipermail/snowball-discuss/2009-October/001121.html – Val Feb 02 '21 at 10:10
  • I probably should have said that I tried those stemmers and it didn't work. I'm getting same mistake with Auto and Autos. It's good to hear that this is probably an edge case, I guess I'll tackle them individually – Lior Magen Feb 02 '21 at 10:52

0 Answers0