How can I remove one delimiter from elasticsearch tokenizer?

Question

I am using elasticsearch 6.8 for text searching. And I realised that elasticsearch tokenizer breaks text into words by using delimiters listed here: http://unicode.org/reports/tr29/#Default_Word_Boundaries. I am using match_phase to search one of the fields in my document and I'd like to remove one delimiter used by tokenizer.

I did some search and found some solutions like, using keyword rather than text. This solution will have a big impact on my search function because it doesn't support partial query.

Another solution is to use keyword query but use wildcard to support partial query. But this may impact performance on the query. And also, I still like using tokenizer for other delimiters.

A third options is to use tokenize_on_chars to define all characters used to tokenize text. But this requires me to list all other delimiters. So I am looking for something like tokenize_except_chars.

So is there a easy way for me to take one character out from delimiters tokenizer is using in elasticsearch6.8?

score 0 · Answer 1 · answered Jan 20 '20 at 02:46

0

I found elasticsearch supports protected_words which can do the job. More info can be found in https://www.elastic.co/guide/en/elasticsearch/reference/6.8/analysis-word-delimiter-tokenfilter.html

answered Jan 20 '20 at 02:46

Joey Yi Zhao

37,514
71
268
523

How can I remove one delimiter from elasticsearch tokenizer?

1 Answers1