1

Suppose I want to convert "\xBD" to UTF-8.

If I use pack & unpack, I'll get ½:

puts "\xBD".unpack('C*').pack('U*')    #=> ½

as "\xBD" is ½ in ISO-8859-1.

BUT "\xBD" is œ in ISO-8859-9.

My question is: why pack used ISO-8859-1 instead of ISO-8859-9 to convert the char to UTF-8? Is there some way to configure that character encoding?

I know I can use Iconv in Ruby 1.8.7, and String#encode in 1.9.2, but I'm curious about pack because I use it in some code.

Sony Santos
  • 5,435
  • 30
  • 41

1 Answers1

4

This actually has nothing to do with how \xBD is represented in ISO-8859-x. The critical part is the pack into UTF-8.

The pack receives [189]. The code point 189 is defined in UTF-8 (more precisely, Unicode) as being ½. Don't think of this as the Unicode spec writers for "preferring" ISO-8859-1 over ISO-8859-9. They had to make a choice of what code point represented ½ and they just chose 189.

Since you're trying to learn more about pack/unpack, let me explain more:

When you unpack with the C directive, ruby interprets the string as ascii-8bit, and extracts the ascii codes. In this case \xBD translates to 0xBD a.k.a. 189. This is a really basic conversion.

When you pack with the U directive, ruby will look up in its UTF-8 translation table to see what codepoints map to each of the integers in the array.

pack/unpack have very specific behavior depending on the directives you provide it. I suggest reading up on ruby-doc.org. Some of the directives still don't make sense to me, so don't be discouraged.

Kelvin
  • 20,119
  • 3
  • 60
  • 68
  • I've read [ruby-doc.org](http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-pack) before, and there are other two good tutorials about Perl's `pack/unpack` [here](http://www.perlmonks.org/?node_id=224666) and [here](http://perldoc.perl.org/perlpacktut.html). I'm going to study them later. I hadn't found encoding information anywhere, but now I got the point. Thank you! – Sony Santos Jul 12 '12 at 21:27
  • @SonySantos A good primer on encodings and character sets: http://blog.grayproductions.net/articles/the_unicode_character_set_and_encodings . Once you're done with that the table of contents link has even more articles. – Kelvin Jul 12 '12 at 22:12
  • "Some of the directives still don't make sense to me, so don't be discouraged.", so true! – Dorian Mar 05 '17 at 01:47