2

I've got a large collection of text data stored in MondoDB that users can query via keyword or phrase, and have an issue where some data has unicode character U+00A0 (no-break space) instead of a regular space.

Fixing up the data not being an option (those nbsps are there intentionally), I still want the user to be able to search and find that data. So I updated our Mongo query-building code to search for any whitespace [\s] in places where the user entered a space, resulting in a query like so:

{ "tt" : { "$elemMatch" : { "x" : { "$regex" : "high[\s]performance" , "$options" : "i"} }}}

(there's more to the query, that's just the relevant bit).

Unfortunately, this doesn't return the expected results. So I play around with a bunch of other ways to accomplish this, and eventually discover that I get the correct results when I search for "not non-whitespace" [^\S], as so:

{ "tt" : { "$elemMatch" : { "x" : { "$regex" : "high[^\S]performance" , "$options" : "i"} }}}

Which leads to my question -- why does "any whitespace" ("\s") fail finding this text while "not-non whitespace" ("^\S") finds it successfully? Does Mongo have a different set of rules for what counts as whitespace and non-whitespace?

Data is all in UTF-8 throughout, MongoDB version is 2.2.2

devin
  • 23
  • 1
  • 3

1 Answers1

6

I suppose that the problem here is with \, not with spaces. Can you please write \\ to prove my conjecture?

Medet Tleukabiluly
  • 11,662
  • 3
  • 34
  • 69
Igor Chubin
  • 61,765
  • 13
  • 122
  • 144
  • Yup, that was it -- just realized that my upstream code already had \\ but that only generated one \ in the query, I needed to build the query with "\\\\s" :D – devin Jan 20 '14 at 20:17