0

I have tried both and they seem to produce the same results when I test the analyzers

settings: {
    analysis: {
        filter: {
            ngram_filter: {
                type: "ngram",
                min_gram: 2,
                max_gram: 20
            }
        },
        tokenizer: {
            ngram_tokenizer: {
                type: "ngram",
                min_gram: 2,
                max_gram: 20
            }
        },
        analyzer: {
            index_ngram: {
                type: "custom",
                tokenizer: "keyword",
                filter: [ "ngram_filter", "lowercase" ]
            },
            index_ngram2: {
                type: "custom",
                tokenizer: "ngram_tokenizer",
                filter: [ "lowercase" ]
            },
        },
    }
}

I get the same results doing:

curl -X GET "localhost:9200/my_index/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "index_ngram", 
  "text":     "P&G 40-Bh"
}
'

and

curl -X GET "localhost:9200/my_index/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "index_ngram2", 
  "text":     "P&G 40-Bh"
}
'

Which one should I use? Is there a performance difference? The it looks like they just do the operations in a different order but I'm not sure which is more performant, or what is better convention.

Glen Thompson
  • 9,071
  • 4
  • 54
  • 50
  • Possible duplicate of [how edge ngram token filter differs from ngram token filter?](https://stackoverflow.com/questions/31398617/how-edge-ngram-token-filter-differs-from-ngram-token-filter) – sidgate Oct 27 '19 at 07:03
  • Yeh I saw that question, its about how `edge_ngram` is different from `ngram`, `edge` being the key difference. Mine is more about how `ngram` is different as a `tokenizer` vs a `filter`. – Glen Thompson Oct 27 '19 at 14:22

1 Answers1

2

It's hard to weigh in on the performance difference since I haven't myself run into this particular scenario and attempted it against large and varying sets of sample texts. However, I don't think it's a good idea to be applying such analyzers to large sets of texts and so I assume this is not a common use case. If I had to guess, I'd guess that the performance is pretty similar. In each instance, the analysis process would have to window over the same length of text and as you pointed out, it must emit an identical set of tokens (ignoring the differing token offsets reported). I used a personal visualizer to also observe this.

I'd go with the simpler, more concise analyzer description (ngram tokenizer) instead of going with the roundabout keyword tokenizer (a "noop" tokenizer) and defining an extra ngram filter. That may be easier to justify, understand, and explain in the future.

Related References:

eemp
  • 1,156
  • 9
  • 17
  • Thanks for the answer, also, cool app! The reason I am using this is essentially to perform a `contains` type query. Do you know of a more efficient way to do this? The field I am using it for are generally small e.g. `P&G 40-Bh`. I need the contains to work with spaces and special characters. – Glen Thompson Oct 27 '19 at 17:04
  • You also have a regexp and wildcard queries at your disposal (paired with just keyword + lowercase without ngram). However, I think the ngram routes that you have are the more popular/recommended options (also more performant at query time). Here's a similar substring/contains problem posed that recommends going with ngram route for challenges like yours: https://stackoverflow.com/questions/6467067/how-to-search-for-a-part-of-a-word-with-elasticsearch. – eemp Oct 27 '19 at 17:13
  • Cool, thanks yeh I saw that answer, which is is why I asked this question becuase this answer https://stackoverflow.com/a/30077747/3866246 has the `ngram` as a filter vs a `tokenizer` so I tried both but then wasn't sure about the difference.. Thanks for you help! – Glen Thompson Oct 27 '19 at 17:15
  • You hit upon a special case/scenario that has them return the same result. Without that keyword tokenizer, the two stand out pretty well on their own in most other situations which I think you may have already realized. https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch discusses each in detail for more common usage when they stand apart. – eemp Oct 27 '19 at 17:19