This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between ' '
and '~'
when you have UTF-8 encoded text:
>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"
There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.
If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is ASCII-8BIT
or BINARY
then you can probably get away with s.force_encoding('utf-8')
. If you end up with something other than UTF-8
and ASCII-8BIT
then you can use Iconv to re-encode it.
References: