The Pattern Tokenizer supports a parameter "group"
It has a default of "-1", which means to use the pattern for splitting, which is what you saw.
However by defining a group >= 0 in your pattern and setting the group-parameter this can be done! E.g. the following tokenizer will split the input into 4-character tokens:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(.{4})",
"group": "1"
}
}
}
}
}
Analyzing a document via the following:
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "comma,separated,values"
}
Results in the following tokens:
{
"tokens": [
{
"token": "comm",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "a,se",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "para",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "ted,",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 3
},
{
"token": "valu",
"start_offset": 16,
"end_offset": 20,
"type": "word",
"position": 4
}
]
}