1

Right now I am using hunspell dictionary as my search engine in ES. It works weirdly and I don't understand why. For example, I have several entries in my index with the word "перец" in different forms:

1 ч. л. смеси перцев горошком;
2–3 колечка красного перца чили с семенами;
черный молотый перец;

and several entries with the word "колодец" in different forms:

несколько колодцев;
3 колодца;
1 колодец;

My index has the following settings:

PUT http://localhost:9200/ingredient
Content-Type: application/json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ru_RU",
            "my_stemmer"
          ],
          "char_filter": [
            "html_strip"
          ]
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "language": "russian"
        },
        "ru_RU": {
          "type": "hunspell",
          "locale": "ru_RU"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "custom_analyzer"
      }
    }
  }
}

When I make my search query for "колодец" like this:

GET http://localhost:9200/ingredient/_search?pretty
Content-Type: application/json

{
  "query": {
    "query_string": {
      "query": "колодец",
      "default_field": "name"
    }
  }
}

I receive the following JSON:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 5.0841255,
    "hits": [
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "2940d2bc-59ca-4c41-98d6-803d50913d04",
        "_score": 5.0841255,
        "_source": {
          "name": "несколько колодцев",
          "id": "2940d2bc-59ca-4c41-98d6-803d50913d04",
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "2940d2bc-59ca-4c41-98d6-803d50913d05",
        "_score": 5.0841255,
        "_source": {
          "name": "3 колодца",
          "id": "2940d2bc-59ca-4c41-98d6-803d50913d05",
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "2940d2bc-59ca-4c41-98d6-803d50913d06",
        "_score": 5.0841255,
        "_source": {
          "name": "1 колодец",
          "id": "2940d2bc-59ca-4c41-98d6-803d50913d06",
          "_meta": {}
        }
      }
    ]
  }
}


Response code: 200 (OK); Time: 45ms; Content length: 1199 bytes

But when I make the similar request with "перец":

GET http://localhost:9200/ingredient/_search?pretty
Content-Type: application/json

{
  "query": {
    "query_string": {
      "query": "перец",
      "default_field": "name"
    }
  }
}

I only get this:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 23,
      "relation": "eq"
    },
    "max_score": 3.1693017,
    "hits": [
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "9c72cba2-2986-40dd-b15b-0df0288e91f1",
        "_score": 2.8541024,
        "_source": {
          "name": "свежемолотый черный перец",
          "id": "9c72cba2-2986-40dd-b15b-0df0288e91f1",
          "_meta": {}
        }
      },
    ]
  }
}

I do not get neither 1 ч. л. смеси перцев горошком nor 2–3 колечка красного перца чили с семенами. It seems strange to me because колодец and перец have a similar way of making their morphological forms. Do I have this problem because my hunspell dictionary is not full enough? If so where can I find the most complete hunspell dictionary or the other dictionary for the Russian language?

0 Answers0