Right now I am using hunspell dictionary as my search engine in ES. It works weirdly and I don't understand why. For example, I have several entries in my index with the word "перец" in different forms:
1 ч. л. смеси перцев горошком;
2–3 колечка красного перца чили с семенами;
черный молотый перец;
and several entries with the word "колодец" in different forms:
несколько колодцев;
3 колодца;
1 колодец;
My index has the following settings:
PUT http://localhost:9200/ingredient
Content-Type: application/json
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"ru_RU",
"my_stemmer"
],
"char_filter": [
"html_strip"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"language": "russian"
},
"ru_RU": {
"type": "hunspell",
"locale": "ru_RU"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
When I make my search query for "колодец" like this:
GET http://localhost:9200/ingredient/_search?pretty
Content-Type: application/json
{
"query": {
"query_string": {
"query": "колодец",
"default_field": "name"
}
}
}
I receive the following JSON:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 5.0841255,
"hits": [
{
"_index": "ingredient",
"_type": "_doc",
"_id": "2940d2bc-59ca-4c41-98d6-803d50913d04",
"_score": 5.0841255,
"_source": {
"name": "несколько колодцев",
"id": "2940d2bc-59ca-4c41-98d6-803d50913d04",
"_meta": {}
}
},
{
"_index": "ingredient",
"_type": "_doc",
"_id": "2940d2bc-59ca-4c41-98d6-803d50913d05",
"_score": 5.0841255,
"_source": {
"name": "3 колодца",
"id": "2940d2bc-59ca-4c41-98d6-803d50913d05",
"_meta": {}
}
},
{
"_index": "ingredient",
"_type": "_doc",
"_id": "2940d2bc-59ca-4c41-98d6-803d50913d06",
"_score": 5.0841255,
"_source": {
"name": "1 колодец",
"id": "2940d2bc-59ca-4c41-98d6-803d50913d06",
"_meta": {}
}
}
]
}
}
Response code: 200 (OK); Time: 45ms; Content length: 1199 bytes
But when I make the similar request with "перец":
GET http://localhost:9200/ingredient/_search?pretty
Content-Type: application/json
{
"query": {
"query_string": {
"query": "перец",
"default_field": "name"
}
}
}
I only get this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 23,
"relation": "eq"
},
"max_score": 3.1693017,
"hits": [
{
"_index": "ingredient",
"_type": "_doc",
"_id": "9c72cba2-2986-40dd-b15b-0df0288e91f1",
"_score": 2.8541024,
"_source": {
"name": "свежемолотый черный перец",
"id": "9c72cba2-2986-40dd-b15b-0df0288e91f1",
"_meta": {}
}
},
]
}
}
I do not get neither 1 ч. л. смеси перцев горошком
nor 2–3 колечка красного перца чили с семенами
.
It seems strange to me because колодец
and перец
have a similar way of making their morphological forms. Do I have this problem because my hunspell dictionary is not full enough? If so where can I find the most complete hunspell dictionary or the other dictionary for the Russian language?