how can I query for unicode characters in mongodb using Ruby?

Question

Let's say I have a record in my database that has:

name: "World\u0092s Greatest Jet Fighter Pilot"

OK I need to get in there and clean out the \u0092 (there were a ton of these in the db). I can query like this:

# encoding: UTF-8
...
def self.by_partial name 
  return Movie.find(:all, :conditions => {:name => /^.*#{name}.*/i})
end


# console: 
>> sel = Movie.by_partial(/Greatest/) and sel.size
=> 1

and get back the correct number of records. But when I throw in the unicode, it fails:

>> sel = Movie.by_partial(/\u0092/) and sel.size
=> 0
>> sel = Movie.by_partial(/\\u0092/) and sel.size
=> 0
>> sel = Movie.by_partial('\u0092') and sel.size
=> 0
>> sel = Movie.by_partial('\\u0092') and sel.size
=> 0

What do I need to do to be able to query for records that contain unicode characters? Is this a setting in the rails console? I managed to solve this by iterating the records and checking like so if mov.name =~ /\u0092/ ... but I can't figure out how to pass a unicode string into my mongoid selector. Iterating the records seemed way too brute force. Luckily I don't need to do this very often.

score 2 · Accepted Answer · edited May 23 '17 at 12:14

I don't think your problem is with Unicode, your problems are:

The string interpolation inside by_partial.
And \u only works inside double quoted strings.

Second things first:

> '\u0070'
=> "\\u0070" 

> '\\u0070'
=> "\\u0070" 

> "\u0070"
=> "p"

So Movie.by_partial("\u0092") should work.

Your first problem is that you're passing /\u0092/ (which does match the character in question) to by_partial but by_partial does this:

/^.*#{name}.*/i

And /^.*#{/\u0092/}.*/i and that ends up as /^.*(?-mix:\u0092).*/i. I'd guess that the MongoDB driver is having some issues translating that Ruby regex into a JavaScript regex.

The MongoDB driver doesn't seem to like \u in a regex at all. Feeding /^\u0070/ into MongoDB doesn't get me any matches but /^p/ does find what I'm expecting, /^#{"\u0070"}/ also works. I'm not sure what's going on in the guts of the MongoDB regex translator but we're not the only ones to come across this. I'd guess that the MongoDB regex translator doesn't understand \u so it ends up being converted to a raw \\u0092 and since you don't have that sequence of six characters in your database, you don't find anything.

Sweet, all it needed was the double quotes. Thanks! – jcollum Feb 28 '12 at 23:46 — jcollum, Feb 28 '12 at 23:46

how can I query for unicode characters in mongodb using Ruby?

1 Answers1