How can I replace UTF-8 errors in Ruby without converting to a different encoding?

Question

In order to convert a string to UTF-8 and replace all encoding errors, you can do:

str.encode('utf-8', :invalid=>:replace)

The only problem with this is it doesn't work if str is already UTF-8, in which case any errors remain:

irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false

To quote the Ruby Docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:

str.encode('utf-16', :invalid=>:replace).encode('utf-8')

For example:

irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true

Is there a better way to do this without converting to a dummy encoding?

matt · Accepted Answer · 2013-10-03T19:18:28.807

20

Ruby 2.1 has added a String#scrub method that does what you want:

2.1.0dev :001 > x = "foo\x92bar"
 => "foo\x92bar" 
2.1.0dev :002 > x.valid_encoding?
 => false 
2.1.0dev :003 > y = x.scrub
 => "foo�bar" 
2.1.0dev :004 > y.valid_encoding?
 => true

The same commit also changes the behaviour of encode so that it works when the source and dest encodings are the same:

2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
 => "foo�bar" 
2.1.0dev :006 > x.valid_encoding?
 => true

As far as I know there is no built in way to do this before 2.1 (otherwise scrub wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.

edited Oct 03 '13 at 19:18

answered Oct 03 '13 at 18:21

matt

78,533
8
163
197

Thanks for the info! Unfortunately, I am stuck with Ruby 1.9 and won't be able to upgrade (on this project, at least). – Matt Oct 03 '13 at 18:23
1

Use can use the 'scrub_rb' gem on earlier versions of Ruby – Nigel Sheridan-Smith Oct 27 '14 at 03:47
Appears with 2.1 that encoding to the same encoding is no longer a no-op... – rogerdpack Feb 28 '23 at 16:05

score 6 · Answer 2 · answered Oct 03 '13 at 17:45

6

Try this:

 "foo\x92bar".chars.select(&:valid_encoding?).join
  # => "foobar"

Or to replace

"foo\x92bar".chars.map{|c| c.valid_encoding? ? c : "?"}.join
 # =>  "foo?bar"

answered Oct 03 '13 at 17:45

tihom

7,923
1
25
29

@Matt no, that's exactly what you want. Finding and removing invalid characters. – Reactormonk Oct 03 '13 at 18:13
@Tass There's a difference between "doing what I want it to do" and "doing it the right way". Yes it does what I want, but so does my example above with the UTF-16 conversion. I was just hoping for a built-in way to do it, which will likely be more efficient than iterating through string myself. I will admit, though, that this is a better solution than the one I had. – Matt Oct 03 '13 at 18:16

How can I replace UTF-8 errors in Ruby without converting to a different encoding?

2 Answers2