In order to convert a string to UTF-8 and replace all encoding errors, you can do:
str.encode('utf-8', :invalid=>:replace)
The only problem with this is it doesn't work if str
is already UTF-8, in which case any errors remain:
irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false
To quote the Ruby Docs:
Please note that conversion from an encoding
enc
to the same encodingenc
is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:
str.encode('utf-16', :invalid=>:replace).encode('utf-8')
For example:
irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true
Is there a better way to do this without converting to a dummy encoding?