9

I've just updated from ruby 1.9.2 to ruby 1.9.3p0 (2011-10-30 revision 33570). My rails application uses postgresql as its database backend. The system locale is UTF8, as is the database encoding. The default encoding of the rails application is also UTF8. I have Chinese users who input Chinese characters as well as English characters. The strings are stored as UTF8 encoded strings.

Rails version: 3.0.9

Since the update some of the existing Chinese strings in the database are no longer displayed correctly. This does not affect all strings, but only those that are part of a serialized hash. All other strings that are stored as plain strings still appear to be correct.


Example:

This is a serialized hash that is stored as a UTF8 string in the database:

broken = "--- !map:ActiveSupport::HashWithIndifferentAccess \ncheckbox: \"1\"\nchoice: \"Round Paper Clips  \\xEF\\xBC\\x88\\xE5\\x9B\\x9E\\xE5\\xBD\\xA2\\xE9\\x92\\x88\\xEF\\xBC\\x89\\r\\n\"\ninfo: \"10\\xE7\\x9B\\x92\"\n"

In order to convert this string to a ruby hash, I deserialize it with YAML.load:

broken_hash = YAML.load(broken)

This returns a hash with garbled contents:

{"checkbox"=>"1", "choice"=>"Round Paper Clips  ï¼\u0088å\u009B\u009Eå½¢é\u0092\u0088ï¼\u0089\r\n", "info"=>"10ç\u009B\u0092"}

The garbled stuff is supposed to be UTF8-encoded Chinese. broken_hash['info'].encoding tells me that ruby thinks this is #<Encoding:UTF-8>. I disagree.

Interestingly, all other strings that were not serialized before look fine, however. In the same record a different field contains Chinese characters that look just right---in the rails console, the psql console, and the browser. Every string---no matter if serialized hash or plain string---saved to the database since the update looks fine, too.


I tried to convert the garbled text from a possible wrong encoding (like GB2312 or ANSI) to UTF-8 despite ruby's claim that this was already UTF-8 and of course I failed. This is the code I used:

require 'iconv'
Iconv.conv('UTF-8', 'GB2312', broken_hash['info'])

This fails because ruby doesn't know what to do with illegal sequences in the string.

I really just want to run a script to fix all the old, presumably broken serialized hash strings and be done with it. Is there a way to convert these broken strings to something resembling Chinese again?


I just played with the encoded UTF-8 string in the raw string (called "broken" in the above example). This is the Chinese string that is encoded in the serialized string:

chinese = "\\xEF\\xBC\\x88\\xE5\\x9B\\x9E\\xE5\\xBD\\xA2\\xE9\\x92\\x88\\xEF\\xBC\\x89\\r\\n\"

I noticed that it is easy to convert this to a real UTF-8 encoded string by unescaping it (removing the escape backslashes).

chinese_ok = "\xEF\xBC\x88\xE5\x9B\x9E\xE5\xBD\xA2\xE9\x92\x88\xEF\xBC\x89\r\n"

This returns a proper UTF-8-encoded Chinese string: "(回形针)\r\n"

The thing falls apart only when I use YAML.load(...) to convert the string to a ruby hash. Maybe I should process the raw string before it is fed to YAML.load. Just makes me wonder why this is so...


Interesting! This is likely due to the YAML engine "psych" that's used by default now in 1.9.3. I switched to the "syck" engine with YAML::ENGINE.yamler = 'syck' and the broken strings are correctly parsed.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • What is the column type for the serialized hashes? – mu is too short Dec 19 '11 at 07:42
  • @muistooshort: the column type is `text`. –  Dec 19 '11 at 08:06
  • What happens if you change the column to `binary`? That should get the string out as "8bit ASCII" (i.e. raw bytes) and maybe that will kick `YAML.load` into shape. As a quick test you can `broken.force_encoding('binary')` before `YAML.load(broken)`. – mu is too short Dec 19 '11 at 08:20
  • Have a look at `Iconv.conv('UTF-8', 'ISO-8859-1', "\xEF\xBC\x88\xE5\x9B\x9E\xE5\xBD\xA2\xE9\x92\x88\xEF\xBC\x89")` inside `irb`. The strings claim to be UTF-8 but I think they've been mangled into Latin-1. – mu is too short Dec 19 '11 at 08:38
  • Converting to binary didn't help. The resulting hash is the same as without. –  Dec 20 '11 at 01:05
  • `YAML.load` works just fine, when I manually edit the serialized hash string by removing the double backslashes. So instead of the above string, loading this string works: `"--- !map:ActiveSupport::HashWithIndifferentAccess \ncheckbox: \"1\"\nchoice: \"Round Paper Clips \xEF\xBC\x88\xE5\x9B\x9E\xE5\xBD\xA2\xE9\x92\x88\xEF\xBC\x89\\r\\n\"\ninfo: \"10\xE7\x9B\x92\"\n"` –  Dec 20 '11 at 02:21
  • When you manually remove the double backslashes in `irb` you're leaving raw bytes (`\xEF\xBC`...) in the string and the Ruby interpreter will you a simple UTF-8 string with the Chinese characters intact, do a `puts broken_string` and you'll see Chinese. – mu is too short Dec 20 '11 at 02:47

2 Answers2

12

This seems to have been caused by a difference in the behaviour of the two available YAML engines "syck" and "psych". To set the YAML engine to syck:

YAML::ENGINE.yamler = 'syck'

To set the YAML engine back to psych:

YAML::ENGINE.yamler = 'psych'

The "syck" engine processes the strings as expected and converts them to hashes with proper Chinese strings. When the "psych" engine is used (default in ruby 1.9.3), the conversion results in garbled strings.

Adding the above line (the first of the two) to config/application.rb fixes this problem. The "syck" engine is no longer maintained, so I should probably only use this workaround to buy me some time to make the strings acceptable for "psych".

  • Seems that we were looking at the same things at the same time. I'd re-encode everything to the Psych format or ditch YAML completely and manually serialize using JSON or some other stable/portable format. – mu is too short Dec 20 '11 at 02:43
  • BTW, you can accept your own answer and I think it make sense to do so in this case. – mu is too short Dec 20 '11 at 21:46
9

From the 1.9.3 NEWS file:

* yaml
  * The default YAML engine is now Psych. You may downgrade to syck by setting
    YAML::ENGINE.yamler = 'syck'.

Apparently the Syck and Psych YAML engines treat non-ASCII strings in different and incompatible ways.

Given a Hash like you have:

h = {
    "checkbox" => "1",
    "choice"   => "Round Paper Clips  (回形针)\r\n",
    "info"     => "10盒"
}

Using the old Syck engine:

>> YAML::ENGINE.yamler = 'syck'
>> h.to_yaml
=> "--- \ncheckbox: "1"\nchoice: "Round Paper Clips  \\xEF\\xBC\\x88\\xE5\\x9B\\x9E\\xE5\\xBD\\xA2\\xE9\\x92\\x88\\xEF\\xBC\\x89\\r\\n"\ninfo: "10\\xE7\\x9B\\x92"\n"

we get the ugly double-backslash format the you currently have in your database. Switching to Psych:

>> YAML::ENGINE.yamler = 'psych'
=> "psych"
>> h.to_yaml
=> "---\ncheckbox: '1'\nchoice: ! "Round Paper Clips  (回形针)\\r\\n"\ninfo: 10盒\n"

The strings stay in normal UTF-8 format. If we manually screw up the encoding to be Latin-1:

>> Iconv.conv('UTF-8', 'ISO-8859-1', "\xEF\xBC\x88\xE5\x9B\x9E\xE5\xBD\xA2\xE9\x92\x88\xEF\xBC\x89") 
=> "ï¼\u0088å\u009B\u009Eå½¢é\u0092\u0088ï¼\u0089"

then we get the sort of nonsense that you're seeing.

The YAML documentation is rather thin so I don't know if you can force Psych to understand the old Syck format. I think you have three options:

  1. Use the old unsupported and deprecated Syck engine, you'd need to YAML::ENGINE.yamler = 'syck' before you YAML anything.
  2. Load and decode all your YAML using Syck and then re-encode and save it using Psych.
  3. Stop using serialize in favor of manually serializing/deserializing using JSON (or some other stable, predictable, and portable text format) or use an association table so that you're not storing serialized data at all.
mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • Ha, that's cool: you've submitted your answer a minute after I figured it out. I've now temporarily fixed the applications by forcing "syck" to be used. Eventually, I will have to do it the hard way and re-encode everything with "psych". Really don't like incompatible changes. –  Dec 20 '11 at 02:47
  • 2
    @rekado: I'd move away from YAML entirely, I think it is a horrible format for data serialization and the Rails guys were foolish to use it for `serialize`. But I'm also a natural born heretic :) – mu is too short Dec 20 '11 at 03:01