4

I have to convert Latin chars like éáéíóúÀÉÍÓÚ etc., into a string to similar ones without special accents or wired symbols:

é -> e
è -> e
Ä -> A

I have a file named "test.rb":

require 'iconv'

puts Iconv.iconv("ASCII//translit", "utf-8", 'è').join

When I paste those lines into irb it works, returning "e" as expected.

Running:

$ ruby test.rb

I get "?" as output.

I'm using irb 0.9.5(05/04/13) and Ruby 1.8.7 (2011-06-30 patchlevel 352) [i386-linux].

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
zambotn
  • 735
  • 1
  • 7
  • 20
  • yes, is encoded as utf-8. It also the default system encoding. I think also it couldn't work because, if i had good memory, the "#encoding" magic-comment was introduce in 1.9. although i've just tried and it doesn't work. – zambotn Dec 09 '11 at 13:01

1 Answers1

3

Ruby 1.8.7 was not multibyte character savvy like 1.9+ is. In general, it treats a string as a series of bytes, rather than characters. If you need better handling of such characters, consider upgrading to 1.9+.

James Gray has a series of articles about dealing with multibyte characters in Ruby 1.8. I highly recommend taking the time to read through them. It's a complex subject so you'll want to read the entire series he wrote a couple times.

Also, 1.8 encoding support needs the $KCODE flag set:

$KCODE = "U"

so you'll need to add that to code running in 1.8.

Here is a bit of sample code:

#encoding: UTF-8

require 'rubygems'
require 'iconv'

chars = "éáéíóúÀÉÍÓÚ"

puts Iconv.iconv("ASCII//translit", "utf-8", chars)

puts chars.split('')
puts chars.split('').join

Using ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-darwin10.7.0] and running it in IRB, I get:

1.8.7 :001 > #encoding: UTF-8
1.8.7 :002 >   
1.8.7 :003 >   require 'iconv'
true
1.8.7 :004 > 
1.8.7 :005 >   chars = "\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232"
"\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232"
1.8.7 :006 > 
1.8.7 :007 >   puts Iconv.iconv("ASCII//translit", "utf-8", chars)
'e'a'e'i'o'u`A'E'I'O'U
nil
1.8.7 :008 > 
1.8.7 :009 >   puts chars.split('')
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
nil
1.8.7 :010 > puts chars.split('').join
éáéíóúÀÉÍÓÚ

At line 9 in the output I told Ruby to split the line into its concept of characters, which in 1.8.7, was bytes. The resulting '?' mean it didn't know what to do with the output. A line 10 I told it to split, which resulted in an array of bytes, which join then reassembled into the normal string, allowing the multibyte characters to be translated normally.

Running the same code using Ruby 1.9.2 shows better, and more expected and desirable, behavior:

1.9.2p290 :001 > #encoding: UTF-8
1.9.2p290 :002 >   
1.9.2p290 :003 >   require 'iconv'
true
1.9.2p290 :004 > 
1.9.2p290 :005 >   chars = "éáéíóúÀÉÍÓÚ"
"éáéíóúÀÉÍÓÚ"
1.9.2p290 :006 > 
1.9.2p290 :007 >   puts Iconv.iconv("ASCII//translit", "utf-8", chars)
'e'a'e'i'o'u`A'E'I'O'U
nil
1.9.2p290 :008 > 
1.9.2p290 :009 >   puts chars.split('')
é
á
é
í
ó
ú
À
É
Í
Ó
Ú
nil
1.9.2p290 :010 > puts chars.split('').join
éáéíóúÀÉÍÓÚ

Ruby maintained the multibyte-ness of the characters, through the split('').

Notice that in both cases, Iconv.iconv did the right thing, it created characters that were visually similar to the input characters. While the leading apostrophe looks out of place, it's there as a reminder the characters were accented originally.

For more information, see the links on the right to related questions or try this SO search for [ruby] [iconv]

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Ok thank you so much, i understand a bit more! Now I have 2 problems: first i can't upgrade version 'cause the software *must* work also on 1.8.7, the second is in the title: if i run your script using `ruby` (instead of IRB) the output differs and seems that Iconv doesn't works in this case. – zambotn Dec 09 '11 at 23:55
  • And also my output for the 7th line is different: i have `eaeiouAEIOU` instead your `'e'a'e'i'o'u\`A'E'I'O'U`. – zambotn Dec 10 '11 at 00:15
  • Ruby 1.9+ automatically does a `require 'rubygems'`, so you'll need to add that to your script with 1.8.7. I'll add it to my sample to reduce confusion. – the Tin Man Dec 12 '11 at 16:24
  • @user1089668, check out the link to James Gray's articles I added to the example also. – the Tin Man Dec 12 '11 at 16:31
  • i spent a lot of time to solve the problem: i convinced my boss to upgrade ruby! =) – zambotn Feb 03 '12 at 18:15