Word boundaries in Atlas Search regex operator

Question

I use the regex operator in my MongoDB Atlas search because the data is indexed using a keyword analyzer, meaning that the whole string of a field is indexed as one single "word".

Because of this, exact word matches don't work like they would with the default analyzer. If the title of a document is JavaScript beginner tutorial then a search for JavaScript will not match. Instead, I use regex wildcards to find single words within this larger string. Which looks like this:

'(.*)' + 'JavaScript' + '(.*)'

This works. However, I want to give exact word matches an extra boost in the search score. Hence I want to run an additional regex query that looks for either only whitespace before or after the word, or the word being at the beginning and/or end of the string.

For example, when I search for Java, I want Java beginner tutorial to rank higher than JavaScript beginner tutorial. Currently, this is not the case, because the wildcards look for any characters.

The normal Regex operators don't seem to work with the Atlas regex search. I tried things like word boundaries \bJava\b or others, but none of them had an effect.

Try `(.*[^A-Za-z0-9_])?Java([^A-Za-z0-9_].*)?`. From what I see in the docs, the regex flavor is Lucene, and those patterns are all automatically anchored at start/end. — Wiktor Stribiżew, Jul 29 '22 at 09:30
I used it like this: `regex: { query: '\b' + word + '\b' [...]}` — Florian Walther, Jul 29 '22 at 09:30
Please see the above comment. Also, maybe `regex: { query: '.*\b' + word + '\b.*' [...]}` will work, but if it is really Lucene, this won't help. — Wiktor Stribiżew, Jul 29 '22 at 09:32
@WiktorStribiżew Your first solution seems to work! I have to try it out for a few more minutes to be really sure. The documentation says `The Atlas Search regex operator uses the Lucene regular expression engine , which differs from Perl Compatible Regular Expressions .` Is this the reason why `\b` is not working? Do I understand your first solution correctly, it excludes results where a letter or digit appears before the word? — Florian Walther, Jul 29 '22 at 09:37
@WiktorStribiżew Just found your answer on another question with explains the Lucene behavior! Awesome! — Florian Walther, Jul 29 '22 at 09:38

Word boundaries in Atlas Search regex operator

0 Answers0