How can I globally ignore invalid byte sequences in UTF-8 strings?

Question

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards compatibility.

I can't know the input encoding.

Exemple:

> "- Men\xFC -".split("n")
ArgumentError: invalid byte sequence in UTF-8
    from (irb):4:in `split'
    from (irb):4
    from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>'

I can overcome this problem in one line, by using the following, for example:

> "- Men\xFC -".unpack("C*").pack("U*").split("n")
 => ["- Me", "ü -"]

However I would like to always ignore the invalid byte sequences and disable this errors. On Ruby itself or in Rails.

Show some samples of the invalid data. What is the encoding in your database or tables? Rails needs to match that. Data Rails receives needs to be coerced to the same encoding the database will store, otherwise you have to use binary <--> ASCII or binary <--> UTF-8 encoding. — the Tin Man, Jun 07 '13 at 15:35
@fotanus: it worked with ruby 1.8 because ruby 1.8 didn't treat encoding the same way (in fact, at all). See e.g. http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html and http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ — Denis de Bernardy, Jun 07 '13 at 16:08
@Denis thanks, I'm aware that it changed, so that is why I'm fighting this issues. — fotanus, Jun 07 '13 at 16:18
You could try going through all your strings and changing them to something that works. Another versoin would be to re-open the `::String` class and manipulate all methods. BTW This looks like it was a standart 8-bit-encoding that was used by your system by default. — User, Jun 07 '13 at 20:22
@User The objective is to avoid going all over the code changing things - it is a huge code. Reopening string is a good suggestion, ty. This example can be 8-bid-encoding, but in other cases can be anything else. — fotanus, Jun 08 '13 at 02:44
Anything else.. maybe you can track down some encodings that are possible. e.g. 75 % of your string ate `\x00` => it is utf32 or it looks like utf8, you know that you do not have japanese 8-bit/russian alphabet. If it is russian, it should contain russian words. Is something like this possible? — User, Jun 08 '13 at 06:00
@User cleaver idea, but unfortunately no, because it is a translation software :( — fotanus, Jun 08 '13 at 06:59

David Grayson · Accepted Answer · 2013-06-16T19:34:49.273

I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"

Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
s.valid_encoding?  # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
  str = str.force_encoding("UTF-8")
  return str if str.valid_encoding?
  str = str.force_encoding("BINARY")
  str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

Thanks for your answer, it is the closest of a global solution. You can always redefine string methods to do this, but I think I'll have to give up and go all the code dealing with case-by-case, because adding this to string would be very hackish. — fotanus, Jun 19 '13 at 17:57

score 6 · Answer 2 · answered Jun 14 '13 at 19:15

6

In ruby 2.0 you could use the String#b method, that is a short alias to String#force_encoding("BINARY")

answered Jun 14 '13 at 19:15

Bruno Coimbra

301
3
7

1

This is -very- informative, and I'm happy with the information, but maybe it fits better as a comment? – fotanus Jun 19 '13 at 17:51

score 3 · Answer 3 · edited May 23 '17 at 10:28

If you just want to operate on the raw bytes, you can try encoding it as ASCII-8BIT/BINARY.

str.force_encoding("BINARY").split("n")

This isn't going to get your ü back, though, since your source string in this case is ISO-8859-1 (or something like it):

"- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
 => "- Menü -"

If you want to get multibyte characters, you have to know what the source charset is. Once you force_encoding to BINARY, you're going to literally just have the raw bytes, so multibyte characters won't be interpreted accordingly.

If the data is from your database, you can change your connection mechanism to use an ASCII-8BIT or BINARY encoding; Ruby should flag them accordingly then. Alternately, you can monkeypatch the database driver to force encoding on all strings read from it. This is a massive hammer, though, and might be the absolutely wrong thing to do.

The right answer is going to be to fix your string encodings. This may require a database fix, a database driver connection encoding fix, or some combination thereof. All the bytes are still there, but if you're dealing with a given charset, you should, if at all possible, let Ruby know that you expect your data to be in that encoding. A common mistake is to use the mysql2 driver to connect to a MySQL database which has data in latin1 encodings, but to specify a utf-8 charset for the connection. This causes Rails to take the latin1 data from the DB and interpret it as utf-8, rather than interpreting it as latin1 which you can then convert to UTF-8.

If you can elaborate on where the strings are coming from, a more complete answer might be possible. You might also check out this answer for a possible global(-ish) Rails solution to default string encodings.

score 2 · Answer 4 · answered Jun 10 '13 at 16:10

If you can configure your database/page/whatever to give you strings in ASCII-8BIT, this will get you their real encoding.

Use Ruby's stdlib encoding guessing library. Pass all your strings through something like this:

require 'nkf'
str = "- Men\xFC -"
str.force_encoding(NKF.guess(str))

The NKF library will guess the encoding (usually successfully), and force that encoding on the string. If you don't feel like trusting the NKF library totally, build this safeguard around string operations too:

begin
  str.split
rescue ArgumentError
  str.force_encoding('BINARY')
  retry
end

This will fallback on BINARY if NKF didn't guess correctly. You can turn this into a method wrapper:

def str_op(s)
  begin
    yield s
  rescue ArgumentError
    s.force_encoding('BINARY')
    retry
  end
end

0x4a6f4672 · Answer 5 · 2013-10-02T17:30:45.780

Encoding in Ruby 1.9 and 2.0 seems to be a bit tricky. \xFC is the code for the special character ü in ISO-8859-1, but the code FC also occurs in UTF-8 for ü U+00FC = \u0252 (and in UTF-16). It could be an artifact of the Ruby pack/unpack functions. Packing and unpacking Unicode characters with the U* template string for Unicode is not problematic:

>> "- Menü -".unpack('U*').pack("U*")
=> "- Menü -"

You can create the "wrong" string, i.e. a string that has an invalid encoding, if you first unpack Unicode UTF-8 characters (U), and then pack unsigned characters (C):

>> "- Menü -".unpack('U*').pack("C*")
=> "- Men\xFC -"

This string has no longer a valid encoding. Apparently the conversion process can be reversed by applying the opposite order (a bit like operators in quantum physics):

>> "- Menü -".unpack('U*').pack("C*").unpack("C*").pack("U*")
=> "- Menü -"

In this case it is also possible to "fix" the broken string by first converting it to ISO-8859-1, and then to UTF-8, but I am not sure if this works accidentally because the code is contained in this character set

>> "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
=> "- Menü -"
>> "- Men\xFC -".encode("UTF-8", 'ISO-8859-1')
=> "- Menü -"

Interesting post, but doesn't really answer the question - it is not a global solution, it is only for one string. — fotanus, Oct 02 '13 at 17:56
Yes, probably. I got a similar problem, where did your invalid string with the \xFC character come from? I had an UTF-8 encoded text file with special characters like ä,ö,ü in it, and somehow File.open returned invalid strings, although the encoding was recognized correctly as UTF-8 :-( — 0x4a6f4672, Oct 04 '13 at 08:42
My strings came from the worst possible place: a file uploaded by the user. — fotanus, Oct 04 '13 at 13:39

How can I globally ignore invalid byte sequences in UTF-8 strings?

5 Answers5

Linked