Elasticsearch custom analyzer for hyphens, underscores, and numbers

Question

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "my_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "my_filter"]
                }
            }
        }
    }
}

You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":

{
    "query": {
        "match": {
            "hostname": "WIN_1"
        }
    }
}

The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.

{
    "tokens": [
        {
            "token": "win_1",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "win",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "1",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        }
    ]
}

What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.

{
    "ipaddress": "192.168.1.253",
    "hostname": "WIN_8_ENT_1"
}
{
    "ipaddress": "10.0.0.1",
    "hostname": "server1"
}
{
    "ipaddress": "172.20.10.36",
    "hostname": "ServA-1"
}

Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.

Dan Tuffery · Answer 1 · 2014-08-11T16:11:33.513

You could change your analysis to use a pattern analyzer that discards the digits and under scores:

{
   "analysis": {
      "analyzer": {
          "word_only": {
              "type": "pattern",
              "pattern": "([^\p{L}]+)"
          }
       }
    }
}

Using the analyze API:

curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'

returns:

"tokens" : [ {
    "token" : "win",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
}, {
    "token" : "ent",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 2
} ]

Your mapping would become:

{
    "event": {
        "properties": {
            "ipaddress": {
                 "type": "string"
             },
             "hostname": {
                 "type": "string",
                 "analyzer": "word_only",
                 "fields": {
                     "raw": {
                         "type": "string",
                         "index": "not_analyzed"
                     }
                 }
             }
         }
    }
}

You can use a multi_match query to get the results you want:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN_1"
       }
   }
}

The issue there is that I get back two different hostnames. One for WIN_8_ENT_1 and one for ServA-1. — Deviation, Aug 11 '14 at 14:30
Yes I did. It got me in the right direction. I'm going to post what I ended up with shortly. — Deviation, Aug 12 '14 at 19:21
Can you please tell me if i want to use different analyzer for digits and under scores then how can i define that, in short i mean one analyzer for (the digits and under scores) and the other analyzer for words that do not contain (the digits and under scores). Please tell me i am really stuck in this and nowhere found the solution, it would be really appreciable i have a similar question here if you know anything about this please answer here on my question http://stackoverflow.com/questions/38830256/use-two-filter-such-that-if-i-will-have-an-apostrohe-and-s-in-my-word-then-it-ge — Sudhanshu Gaur, Aug 08 '16 at 15:18

score 2 · Accepted Answer · answered Aug 12 '14 at 19:49

Here's the analyzer and queries I ended up with:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "hostname_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "hostname_filter": {
                    "type": "pattern_capture",
                    "preserve_original": 0,
                    "patterns": [
                        "(\\p{Ll}{3,})"
                    ]
                }
            },
            "analyzer": {
                "hostname_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [  "lowercase", "hostname_filter" ]
                }
            }
        }
    }
}

Queries: Find host name starting with:

{
    "query": {
        "prefix": {
            "hostname.raw": "WIN_8"
        }
    }
}

Find host name containing:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN"
       }
   }
}

Thanks to Dan for getting me in the right direction.

score 1 · Answer 3 · answered Aug 15 '14 at 17:43

When ES 1.4 is released, there will be a new filter called 'keep types' that will allow you to only keep certain types once the string is tokenized. (i.e. keep words only, numbers only, etc).

Check it out here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter

This may be a more convenient solution for your needs in the future

score 0 · Answer 4 · answered Aug 11 '14 at 22:56

0

It looks like you want to apply two different types of searches on your hostname field. One for exact matches, and one for a variation of wildcard (maybe in your specific case, a prefix query).

After trying to implement all types of different searches using several different analyzers, I've found it sometimes simpler to add another field to represent each type of search you want to do. Is there a reason you do not want to add another field like the following:

{ "ipaddress": "192.168.1.253", "hostname": "WIN_8_ENT_1" "system": "WIN" }

Otherwise, you could consider writing your own custom filter that does effectively the same thing under the hood. Your filter will read in your hostname field and index the exact keyword and a substring that matches your stemming pattern (e.g. WIN in WIN_8_ENT_1).

I do not think there is any existing analyzer/filter combination that can do what you are looking for, assuming I have understood your requirements correctly.

answered Aug 11 '14 at 22:56

coffeeaddict

858
5
3

I am using multi-fields to store the full host name as well. Ideally, I'd like to avoid adding another field. – Deviation Aug 12 '14 at 12:30
Can you please tell me if i want to use different analyzer for digits and under scores then how can i define that, in short i mean one analyzer for (the digits and under scores) and the other analyzer for words that do not contain (the digits and under scores). Please tell me i am really stuck in this and nowhere found the solution, it would be really appreciable i have a similar question here if you know anything about this please answer here on my question http://stackoverflow.com/questions/38830256/use-two-filter-such-that-if-i-will-have-an-apostrohe-and-s-in-my-word-then-it-ge – Sudhanshu Gaur Aug 08 '16 at 15:20
i do not know of any existing plugin that does what you are looking for, but you can't use more than 1 analyzer for a field. if you want custom logic, you will need to write your own token filter that handles the use case you described, and then add that token filter into your analyzer setting. – coffeeaddict Aug 10 '16 at 18:30

Elasticsearch custom analyzer for hyphens, underscores, and numbers

4 Answers4

Linked