Exclude from CamelCase tokenizer in Elasticsearch

Question

Struggling to make iPhone match when searching for iphone in Elasticsearch.

Since I have some source code at stake, I surely need CamelCase tokenizer, but it appears to break iPhone into two terms, so iphone can't be found.

Anyone knows of a way to add exceptions to breaking camelCase words into tokens (camel + case)?

UPDATE: to make it clear, I want NullPointerException to be tokenized as [null, pointer, exception], but I don't want iPhone to become [i, phone].

Any other solution?

UPDATE 2: @ChintanShah's answer suggests a different approach that gives us even more - NullPointerException will be tokenized as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is definitely much more useful from the point of view of the one that searches. And indexing is also faster! Price to pay is index size, but it is a superior solution.

what don't you use lowercase filter? It will lowercase all words — ChintanShah25, Jan 02 '16 at 14:59
@ChintanShah25 How does that help fixing tokenizer? (and btw - I use lowercase filter) — tishma, Jan 02 '16 at 15:33
tokenizers are different from filters. iPhone will be indexed as iphone with lowercase filter. It would help if you post your current anlayzer and mapping — ChintanShah25, Jan 02 '16 at 15:34
unless I got it all wrong, tokenizer acts before filter, so camel case chops iPhone to [i, Phone], and lowercase just turns it into [i, phone] — tishma, Jan 02 '16 at 15:36
tokenizer is the exact one from the link. let me try and make something more complete — tishma, Jan 02 '16 at 15:37
actually - I won't. complete example is contained in the linked docs page. problem is in tokenizer, not in mapping or filter. thanks for the effort. — tishma, Jan 02 '16 at 15:42
You cannot have camel casing as well as not break "iPhone" at the same time. There is no way for Elasticsearch to figure out when to apply camel casing and when to not. You have to come up with those rules. There may be other tokens besides "iPhone" that you do not want broken, or are there? — bittusarkar, Jan 02 '16 at 15:59
I'm working on an updated pattern, that will not break words starting with single 'i' followed by an uppercase into tokens. If I find more words I don't want to be broken, I guess I'll be adding more exceptions. — tishma, Jan 02 '16 at 16:02
@bittusarkar good point! What would I do with WiFi, and who knows what else...? I'm wondering if it's worth having same field analyzed twice (e.g. with standard and camel analyzer), so I can pick up both matches. — tishma, Jan 02 '16 at 16:12

ChintanShah25 · Accepted Answer · 2016-01-02T16:48:31.183

You can achieve your requirements with word_delimiter token filter. This is my setup

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "camel_filter",
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "camel_filter": {
          "type": "word_delimiter",
          "generate_number_parts": false,
          "stem_english_possessive": false,
          "split_on_numerics": false,
          "protected_words": [
            "iPhone",
            "WiFi"
          ]
        }
      }
    }
  },
  "mappings": {
  }
}

This will split the words on case changes so NullPointerException will be tokenized as null, pointer and exception but iPhone and WiFi will remain as it is as they are protected. word_delimiter has lot of options for flexibility. You can also preserve_original which will help you a lot.

GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer

Result

{
   "tokens": [
      {
         "token": "iphone",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 1
      }
   ]
}

Now with

GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer

Result

{
   "tokens": [
      {
         "token": "null",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 1
      },
      {
         "token": "pointer",
         "start_offset": 4,
         "end_offset": 11,
         "type": "word",
         "position": 2
      },
      {
         "token": "exception",
         "start_offset": 11,
         "end_offset": 20,
         "type": "word",
         "position": 3
      }
   ]
}

Another approach is to analyze your field twice with different analyzers but I feel word_delimiter will do the trick.

Does this help?

It helps a lot! I think preserve_original plus even catenate_all is a must, as it is a catch-all solution. Who wouldn't expect to find NullPointerException when looking for NullPointerException instead of only when looking for parts of it! — tishma, Jan 02 '16 at 16:34
As I said It has lot of options, you can tweak it according to your various requirements. — ChintanShah25, Jan 02 '16 at 16:36
ok, got it, letter tokenizer could misbehave for things like iphone 4s, audi r8 etc — ChintanShah25, Jan 02 '16 at 22:37

Exclude from CamelCase tokenizer in Elasticsearch

1 Answers1