17

This might sound minor, but it's been driving me nuts. Since releasing an application to production last Friday on Ruby 1.9, I've been having lots of minor exceptions related to character encodings. Almost all of it is some variation on:

Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

We have an international user base so plenty of names contain umlauts, etc. If I fix the templates to use force_encoding in a bunch of places, it pops up in the flash message helper. Et cetera.

At the moment it looks like I've nailed down all the ones I knew about, by patching ActiveSupport's string concatenation in one place and then by setting # encoding: utf-8 at the top of every one of my source files. But the feeling that I might have to remember to do that for every file of every Ruby project I ever do from now on, forever, just to avoid string assignment problems, does not sit well in my stomach. I read about the -Ku switch but everything seems to warn that it's for backwards compatibility and might go away at any time.

So my question for 1.9-experienced folks: is setting #encoding in every one of my files really necessary? Is there a reasonable way to do this globally? Or, better, a way to set the default encoding on non-literal values of strings that bypass the internal/external defaults?

Thanks in advance for any suggestions.

SFEley
  • 7,660
  • 5
  • 28
  • 31

4 Answers4

13

Don't confuse file encoding with string encoding

The purpose of the #encoding statement at the top of files is to let Ruby know during reading / interpreting your code, and your editor know how to handle any non-ASCII characters while editing / reading the file -- it is only necessary if you have at least one non-ASCII character in the file. e.g. it's necessary in your config/locale files.

To define the encoding in all your files at once, you can use the magic_encoding gem, it can insert uft-8 magic comment to all ruby files in your app.

The error you're getting at runtime Encoding::CompatibilityError is an error which happens when you try to concatenate two Strings with different encoding during program execution, and their encodings are incompatible.

This most likely happens when:

  • you are using L10N strings (e.g. UTF-8), and concatenate them to e.g. ASCII string (in your view)

  • the user types in a string in a foreign language (e.g. UTF-8), and your view tries to print it out in some view, along with some fixed string which you pre-defined (ASCII). force_encoding will help there. There's also Encoding::primary_encoding in Rails 1.9 to set the default encoding for new Strings. And there is config.encoding in Rails in the config/application.rb file.

  • String which come from your database, and then are combined with other Strings in your view. (their encodings could be either way around, and incompatible).

Side-Note: Make sure to specify a default encoding when you create your database!

    create database yourproject  DEFAULT CHARACTER SET utf8;

If you want to use EMOJIs in your strings:

    create database yourproject DEFAULT CHARACTER SET utf8mb4 collate utf8mb4_bin;

and all indexes on string columns which may contain EMOJI need to be 191 characters in length. CHARACTER SET utf8mb4 COLLATE utf8mb4_bin

The reason for this is that normal UTF8 uses up to 3 bytes, whereas EMOJI use 4 bytes storage.

Please check this Yehuda Katz article, which covers this in-depth, and explains it very well: (there is specifically a section 'Incompatible Encodings')

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

http://yehudakatz.com/2010/05/17/encodings-unabridged/

and:

http://zargony.com/2009/07/24/ruby-1-9-and-file-encodings

http://graysoftinc.com/character-encodings

Tilo
  • 33,354
  • 5
  • 79
  • 106
  • 6
    I do not want to deal with all this encoding mess, its nice to know all the edge-cases, but I wish there where no edge-cases. Simply treat everything as utf8, and if something is something else it has to be declared as such. – grosser Oct 19 '11 at 18:26
  • 2
    @grosser: I agree - it's a huge pain! and what's worse, because of it they messed up the lower-level IO classes, which used to return strings of 8-bit bytes .. now they return interpreted strings of 'who knows what' - super annoying if you need to deal with uninterpreted raw bytes.. – Tilo Oct 19 '11 at 18:28
  • 1
    @grosser - let's be honest. Before UTF8 existed, Japan had to get by on their own. With Ruby being what it is in Japan, and the presence of ISO-2022-JP and Shift_JIS, this is how it's going to be. If you want to be a REAL purist, there are still SOME characters that don't encode into UTF-8, too. But on the whole I ABSOLUTELY agree with you, we should all use UTF8 and be done with it. – makdad May 29 '12 at 08:30
6

In your config/application.rb add

config.encoding = "utf-8"

and above the Application.initialize! line in config/environment.rb, add following two lines:

Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8

Hope this helps.

nathanvda
  • 49,707
  • 13
  • 117
  • 139
  • look promissing, but still getting same old multibyte errors when e.g. load 'xxx.rb', where xxx.rb contains utf8 – grosser Oct 17 '11 at 19:11
  • config.encoding is for rails html output encoding afaik, nothing to do with ruby`s string encoding – grosser Oct 18 '11 at 09:52
  • This answer also makes the assumption (albeit fairly) that the OP is asking about Rails. – makdad May 29 '12 at 08:31
3

http://zargony.com/2009/07/24/ruby-1-9-and-file-encodings

Don't confuse file encoding and string encoding!

Trevoke
  • 4,115
  • 1
  • 27
  • 48
  • 1
    Thanks Trevoke; I do know the difference. However, strings inherit the encoding of the source file in which they're created. (Unless they come from an IO operation on another file; hence the default_internal and default_external properties.) So while they're not the same, they're deeply and frustratingly related. What I want is a way to set the default _string_ encoding without having to use that `#encoding` comment. – SFEley Jan 19 '10 at 19:47
  • 1
    Everything you -ever- wanted to know about encodings: http://blog.grayproductions.net/categories/character_encodings And probably more that you hoped never to learn :) – Trevoke Jan 19 '10 at 20:17
-1
String.module_eval "def initialize\nsuper\nputs encoding\nend"
=> nil
irb(main):006:0> String.new
ASCII-8BIT
=> ""

Not sure how implement your strings in your system, but by hooking into the initialize method of the String object, you can set the encoding for any strings you create in the entire application.

kojaktsl
  • 122
  • 4
  • does not seem to fix loading files with utf8 I tried: String.module_eval "def initialize\nsuper\nencoding = Encoding::UTF_8\nend" load 'xxx.rb' – grosser Oct 17 '11 at 19:12
  • After doing a bit more testing, I did notice that the initialize method for strings is rarely called. But that was just a suggestion, maybe there's a method for all strings that you call when you create them in your application? Just add the encoding line to that instead of initialize. (and by create, I mean load into memory, parse, or what-have-you) – kojaktsl Oct 19 '11 at 04:33
  • maybe overwriting require could do the trick, but I am not willing to go this far :D – grosser Oct 19 '11 at 18:24