12

In order to convert a string to UTF-8 and replace all encoding errors, you can do:

str.encode('utf-8', :invalid=>:replace)

The only problem with this is it doesn't work if str is already UTF-8, in which case any errors remain:

irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false

To quote the Ruby Docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:

str.encode('utf-16', :invalid=>:replace).encode('utf-8')

For example:

irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true

Is there a better way to do this without converting to a dummy encoding?

Matt
  • 21,026
  • 18
  • 63
  • 115

2 Answers2

20

Ruby 2.1 has added a String#scrub method that does what you want:

2.1.0dev :001 > x = "foo\x92bar"
 => "foo\x92bar" 
2.1.0dev :002 > x.valid_encoding?
 => false 
2.1.0dev :003 > y = x.scrub
 => "foo�bar" 
2.1.0dev :004 > y.valid_encoding?
 => true 

The same commit also changes the behaviour of encode so that it works when the source and dest encodings are the same:

2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
 => "foo�bar" 
2.1.0dev :006 > x.valid_encoding?
 => true 

As far as I know there is no built in way to do this before 2.1 (otherwise scrub wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.

matt
  • 78,533
  • 8
  • 163
  • 197
6

Try this:

 "foo\x92bar".chars.select(&:valid_encoding?).join
  # => "foobar"

Or to replace

"foo\x92bar".chars.map{|c| c.valid_encoding? ? c : "?"}.join
 # =>  "foo?bar"
tihom
  • 7,923
  • 1
  • 25
  • 29
  • @Matt no, that's exactly what you want. Finding and removing invalid characters. – Reactormonk Oct 03 '13 at 18:13
  • @Tass There's a difference between "doing what I want it to do" and "doing it the right way". Yes it does what I want, but so does my example above with the UTF-16 conversion. I was just hoping for a built-in way to do it, which will likely be more efficient than iterating through string myself. I will admit, though, that this is a better solution than the one I had. – Matt Oct 03 '13 at 18:16