3

I'm trying to replicate the below mappings using NEST and facing an issue while mapping the token chars to the tokenizer.

{
   "settings": {
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 2,
               "max_gram": 20,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            }
         }
      }
   }

I was able to replicate everything except the token chars part. Can some one help in doing so. Below is my code replicating the above mappings. (except for the token chars part)

 var nGramFilters1 = new List<string> { "lowercase", "asciifolding", "nGram_filter" };
 var tChars = new List<string> { "letter", "digit", "punctuation", "symbol" };

    var createIndexResponse = client.CreateIndex(defaultIndex, c => c
                 .Settings(st => st
                 .Analysis(an => an
                 .Analyzers(anz => anz
                 .Custom("nGram_analyzer", cc => cc
                 .Tokenizer("whitespace").Filters(nGramFilters1)))
               .TokenFilters(tf=>tf.NGram("nGram_filter",ng=>ng.MinGram(2).MaxGram(20))))));

References

  1. SO Question
  2. GitHub Issue
Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
ASN
  • 1,655
  • 4
  • 24
  • 52

1 Answers1

7

NGram Tokenizer supports token characters (token_chars), using these to determine which characters should be kept in tokens and split on anything that isn't represented in the list.

NGram Token Filter on the other hand operates on the tokens produced by a tokenizer, so only has options for the min and max grams that should be produced.

Based on your current analysis chain, it's likely you want something like the following

var createIndexResponse = client.CreateIndex(defaultIndex, c => c
    .Settings(st => st
        .Analysis(an => an
            .Analyzers(anz => anz
                .Custom("ngram_analyzer", cc => cc
                    .Tokenizer("ngram_tokenizer")
                    .Filters(nGramFilters))
                )
            .Tokenizers(tz => tz
                .NGram("ngram_tokenizer", td => td
                    .MinGram(2)
                    .MaxGram(20)
                    .TokenChars(
                        TokenChar.Letter,
                        TokenChar.Digit,
                        TokenChar.Punctuation,
                        TokenChar.Symbol
                    )
                )          
            )
        )
    )
);
Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
Russ Cam
  • 124,184
  • 33
  • 204
  • 266
  • Thanks Russ. But changing the tokenizer from `whitespace` to `ngram_tokenizer` will not have charecteristics of whitespace right? Instead can I add `TokenChar.whitespace`?? And in the `.Filters(nGramFilters)` I defined a custom filter called `ngram_filter` as in my post. Should I still define it as that part is taken care my the tokenizer. – ASN Jun 29 '16 at 03:22
  • 1
    The `token_chars` for `ngram_tokenizer` are whitelist, so any characters not covered will not be included in tokens and will be split upon. So, with the above, the `ngram_tokenizer` will split on whitespace when tokenizing and create grams between 2 and 20. The outcome will be similar to having a `whitespace` tokenizer and an `ngram` filter as part of the filters. – Russ Cam Jun 29 '16 at 08:26
  • Hi Russ, For the above mapping style on title field, when I search for `ASN - Functional Specification for IMS v1.2` or `Elasticsearch The Definitive Guide-Ascetic_trip` on title field (match phrase) it is not showing any results. But where as when I searched for `attachment` or `quiz` keyword, it showed results for the same query. My understanding is anything with space in the keyword is returning 0 results. Query `client.Search(s => s.Query(q => q.MatchPhrase(mp => mp.Field(fi => fi.Title).Query(keyword))));` How can I solve this? TIA – ASN Jul 01 '16 at 02:02
  • You can see what tokens will be generated for a given input an analyzer using the `_analyze` API - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html. This gives you insight into the tokens that will be stored in the inverted index **at index time**. Now, for another input, you can see what tokens are generated with the same analyzer. Does it produce the same tokens as another input? This'll give you some indication of what a specific analyzer does. – Russ Cam Jul 01 '16 at 02:19
  • To understand why a particular document does or does not match a specific query, you can use the `_explain` API - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html – Russ Cam Jul 01 '16 at 02:19
  • It's worth taking some time to go through the Definitive Guide as it will give you a good understanding of Analysis - https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html – Russ Cam Jul 01 '16 at 02:20
  • Sorry Russ for bothering you a lot. I already saw the tokens using `_analyze` and found that there are tokens for the those keywords available in lowercase `asn- functional specification for ims v1.2` and `elasticsearch-c13 - full text search` (`http://localhost:9200/trialforpathfiltering/_analyze?pretty=1&text=ElasticSearch-C13%20-%20Full%20Text%20Search&analyzer=myNGramAnalyzer`) – ASN Jul 01 '16 at 02:35
  • I did `_explain` too.. It gave me something like this. `Failure to meet condition(s) of required/prohibited clause(s)`- `no match on required clause (title:\"elasticsearch-c13 - full text search\")` – ASN Jul 01 '16 at 02:36