Elasticsearch hunspell cuts words too much

Question

Consider the following mappings as an example:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_hunspell": {
          "type": "hunspell",
          "language": "en_GB"
        }
      },
      "analyzer": {
        "my_test": {
          "type" : "custom",
          "tokenizer": "lowercase",
          "filter": ["my_hunspell"]
        }
      }
    }
  }
}

I've downloaded hunspell dictionaries from official Mozilla page.

Now the issue is that some words, for instance beer are over-analyzed. Following query transforms beer into bee, which is not entirely correct?

POST /test/_analyze?analyzer=my_test&text=beer

{
   "tokens": [
      {
         "token": "bee",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 1
      }
   ]
}

Hunspell syntax is quite hard to understand. What can be done to avoid such a behaviour? Is it possible preserve some words or to add some rule?

score 1 · Accepted Answer · answered Oct 07 '15 at 20:05

If you can make it work with coming up with a list of words to preserve, then the Keyword Marker Token Filter might be worth looking into. It looks like that will prevent the words you want to protect from getting stemmed. Your updated analyzer might look something like:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_hunspell": {
          "type": "hunspell",
          "language": "en_GB"
        },
        "protect_my_words": {
          "type": "keyword_marker",
          "keywords_path": <PATH TO TEXT FILE WITH THE WORDS>
        }
      },
      "analyzer": {
        "my_test": {
          "type" : "custom",
          "tokenizer": "lowercase",
          "filter": ["protect_my_words", "my_hunspell"]
        }
      }
    }
  }
}

There is also the Pattern Replace Token Filter that might prove useful if you do want to transform particular words or patterns. This can precede the keyword marker token filter.

I wanted something more dynamic and hunspell related, but keyword token filter does exactly `preserving some keywords`. I'll mark this as correct. — Evaldas Buinauskas, Oct 08 '15 at 05:12

Elasticsearch hunspell cuts words too much

1 Answers1