I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b
. What workarounds can I use?
Asked
Active
Viewed 2,493 times
6

dimid
- 7,285
- 1
- 46
- 85
-
Do you want the `4 text word and wordb` string to be returned, too (if `word` is what you are looking for)? – Wiktor Stribiżew Jan 30 '18 at 09:34
-
no, just `word` – dimid Jan 30 '18 at 09:37
-
If you are using a tokenizer, you may use Java regex. Then the `\b` is supported. See [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html). – Wiktor Stribiżew Jan 30 '18 at 09:40
-
And it seems to me you might use something like `~([A-Za-z0-9_]word|word[A-Za-z0-9_])word~([A-Za-z0-9_]word|word[A-Za-z0-9_])` in the query that uses ES Lucene regex flavor, matching a string that does not contain `word`s with word chars on either end, a word, and again any text but a `word` which is a part of a word. – Wiktor Stribiżew Jan 30 '18 at 09:42
-
Thanks, I'll try. – dimid Jan 30 '18 at 09:46
-
2On second thought, try `(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?` – Wiktor Stribiżew Jan 30 '18 at 09:49
-
Excellent, thank you sir. Please consider making your comment an answer, and I'll be glad to accept. – dimid Jan 30 '18 at 10:13
-
Added with explanations. – Wiktor Stribiżew Jan 30 '18 at 10:18
1 Answers
11
In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b
is something like (^|[^A-Za-z0-9_])
if the word
starts with a word char, and the trailing \b
is like ($|[^A-Za-z0-9_])
if the word
ends with a word char.
Thus, we need to make sure that there is a non-word char before and after word
or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_]
optional at start/end of string is add .*
beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
Details
(.*[^A-Za-z0-9_])?
- either start of string or any 0+ chars (but a line break char, else use(.|\n)*
) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)word
- a word([^A-Za-z0-9_].*)?
- an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).

Wiktor Stribiżew
- 607,720
- 39
- 448
- 563
-
What would I need to do to add certain special characters to the boundaries? Is this correct: `(.*[^A-Za-z0-9#+&=-_])?`? – Florian Walther Aug 08 '22 at 05:36
-
My regex above also matches `[` and `?` which I don't want. How can I avoid this? – Florian Walther Aug 08 '22 at 05:47
-
1@FlorianWalther `=-_` created a range. You need to put `-` at the start of the class, `[^-A-Za-z0-9#+&=_]` – Wiktor Stribiżew Aug 08 '22 at 08:17
-