0

I want to use common_gram token filter based on this link. My elasticsearch version is: 7.17.8

Here is the setting of my index in ElasticSearch. I have defined a filter named "common_grams" that uses "common_grams" as type.

I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.

I have just one field named as "title_fa" and I have used my custom analyzer for this field.

PUT /my-index-000007
{
  "settings": {
    "analysis": {
      "analyzer": {
        "index_grams": {
          "tokenizer": "whitespace",
          "filter": [ "common_grams" ]
        }
      },
      "filter": {
        "common_grams": {
          "type": "common_grams",
          "common_words": [ "the","is" ]
        }
      }
    }
  }
  ,
    "mappings": {
        "properties": {
            "title_fa": {
                "type": "text",
                "analyzer": "index_grams",
                "boost": 40
                
            }
        }
    }
}

It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.

GET /my-index-000007/_analyze
{
  "analyzer": "index_grams",
  "text" : "brown is the"
}

Here is the result of the tokens for the text.

{
  "tokens" : [
    {
      "token" : "brown",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "brown_is",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "gram",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "is",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "is_the",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "gram",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "the",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    }
  ]
}

When I search the query "brown is the", I expect these tokens to be searched:

["brown", "brown_is", "is", "is_the", "the" ]

But these are the tokens that will actually be searched:

["brown is the", "brown is_the", "brown_is the"]

Here you can see the details

Query Time Tokens

UPDATE: I have added a sample document like this:

POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }

When I search "brown coat"

GET /my-index-000007/_search
{
  "query": {
    "query_string": {
      "query": "brown is coat",
      "default_field": "title_fa"
    }
  }
}

it returns the document because it searches: ["brown", "coat"]

When I search "brown is coat", it can't find the document because it is searching for

["brown is coat", "brown_is coat", "brown is_coat"]

Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.

Do you know where I am getting this wrong? Why is it acting differently?

  • Note that it is advised to use a different variant of the `common_grams` filter at search time, [which specifies `"query_mode": true`](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-common-grams-tokenfilter.html#analysis-common-grams-tokenfilter-configure-parms) – Val Feb 07 '23 at 14:31
  • I tested that and these tokens will be searched: [ "is_the", "brown_is"], still not what I'm looking for... – Farnaz Maleki Feb 07 '23 at 14:50
  • What you're showing in the picture is how the query is **profiled**, not the query time tokens and how your input has been analyzed at search time. The same tokens are produced at search time since you're using the same analyzer. – Val Feb 07 '23 at 15:05
  • How I can see the exact query time tokens? I have added an update to my question. My main concern is when I'm searching a query which contains one of the common words. – Farnaz Maleki Feb 08 '23 at 07:44

0 Answers0