I want to use common_gram token filter based on this link. My elasticsearch version is: 7.17.8
Here is the setting of my index in ElasticSearch. I have defined a filter named "common_grams" that uses "common_grams" as type.
I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.
I have just one field named as "title_fa" and I have used my custom analyzer for this field.
PUT /my-index-000007
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "the","is" ]
}
}
}
}
,
"mappings": {
"properties": {
"title_fa": {
"type": "text",
"analyzer": "index_grams",
"boost": 40
}
}
}
}
It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.
GET /my-index-000007/_analyze
{
"analyzer": "index_grams",
"text" : "brown is the"
}
Here is the result of the tokens for the text.
{
"tokens" : [
{
"token" : "brown",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "brown_is",
"start_offset" : 0,
"end_offset" : 8,
"type" : "gram",
"position" : 0,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "is_the",
"start_offset" : 6,
"end_offset" : 12,
"type" : "gram",
"position" : 1,
"positionLength" : 2
},
{
"token" : "the",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
When I search the query "brown is the", I expect these tokens to be searched:
["brown", "brown_is", "is", "is_the", "the" ]
But these are the tokens that will actually be searched:
["brown is the", "brown is_the", "brown_is the"]
Here you can see the details
UPDATE: I have added a sample document like this:
POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }
When I search "brown coat"
GET /my-index-000007/_search
{
"query": {
"query_string": {
"query": "brown is coat",
"default_field": "title_fa"
}
}
}
it returns the document because it searches:
["brown", "coat"]
When I search "brown is coat", it can't find the document because it is searching for
["brown is coat", "brown_is coat", "brown is_coat"]
Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.
Do you know where I am getting this wrong? Why is it acting differently?