elasticsearch custom tokenizer - split token by length

Question

I am using elasticsearch version 1.2.1. I have a use case in which I would like to create a custom tokenizer that will break the tokens by their length up to a certain minimum length. For example, assuming minimum length is 4, the token "abcdefghij" will be split into: "abcd efgh ij".

I am wondering if I can implement this logic without the need of coding a custom Lucene Tokenizer class?

Thanks in advance.

It's a bit different than the example you've provided, but [NGram Tokenizer](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html) might be worth looking into. — femtoRgon, Feb 08 '15 at 20:32

centic · Answer 1 · 2018-10-15T15:11:41.303

The Pattern Tokenizer supports a parameter "group"

It has a default of "-1", which means to use the pattern for splitting, which is what you saw.

However by defining a group >= 0 in your pattern and setting the group-parameter this can be done! E.g. the following tokenizer will split the input into 4-character tokens:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(.{4})",
          "group": "1"
        }
      }
    }
  }
}

Analyzing a document via the following:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}

Results in the following tokens:

{
  "tokens": [
    {
      "token": "comm",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "a,se",
      "start_offset": 4,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "para",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    },
    {
      "token": "ted,",
      "start_offset": 12,
      "end_offset": 16,
      "type": "word",
      "position": 3
    },
    {
      "token": "valu",
      "start_offset": 16,
      "end_offset": 20,
      "type": "word",
      "position": 4
    }
  ]
}

Why comma is not part of the tokens? `.` from regex should match comma as well. — Tom Raganowicz, Oct 15 '18 at 14:17
In my test i get the comma included in the tokens, see the updated answer with the results included as well — centic, Oct 15 '18 at 15:12
Surprisingly if you use link: http://localhost:9200/my_index/_analyze?text=comma,separated,values&analyzer=my_analyzer Instead of curl command the results are different. Perhaps commas gets weirdly encoded, not sure though. Thanks for the update, that's really a neat way of splitting tokens by length. — Tom Raganowicz, Oct 15 '18 at 15:32

score 3 · Accepted Answer · answered Feb 08 '15 at 17:14

3

For your requirement, if you can't do it using the pattern tokenizer then you'll need to code up a custom Lucene Tokenizer class yourself. You can create a custom Elasticsearch plugin for it. You can refer to this for examples about how Elasticsearch plugins are created for custom analyzers.

answered Feb 08 '15 at 17:14

bittusarkar

6,247
3
30
50

I am trying to follow your advise and use pattern tokenizer but I am not sure this functionality can be achieved. I tried is the following pattern: "([.]{0,5})" but it seems to break the token in to characters (probably because of the greedy regex). – ybensimhon Feb 09 '15 at 09:29
Don't think pattern tokenizer is applicable here based on the docs: "IMPORTANT: The regular expression should match the token separators, not the tokens themselves." But in my case I have no actual seperator. – ybensimhon Feb 09 '15 at 09:57
I suspected that too. Seems like writing an Elasticsearch plugin with a custom analyzer is the only option left. – bittusarkar Feb 09 '15 at 11:03
1

The second link is broken, it just leads to the home page – Clashsoft Dec 02 '21 at 19:27

elasticsearch custom tokenizer - split token by length

2 Answers2