12

I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.

irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "Norrlandsv\xC3\xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen" 

I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.

I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?

Ruby newbie still pulling hair like crazy but feeling grateful for all the replies here... :)

Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!

[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:

def convert_encoding(string, originalEncoding) 
  puts "#{string.encoding}" # ASCII-8BIT
  string.encode(originalEncoding)
  puts "#{string.encoding}" # still ASCII-8BIT
  string.encode!('utf-8')
end

but the last line gives me the following error:

Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8

Thanks to @Amadan's answer below, I noticed that \xC3 actually shows up in irb if you run:

irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "\xC3\xA4"

I have also tried to assign a new variable to the result of string.encode(originalEncoding) but got an even weirder error:

newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')

and the error is Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1

I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)

charint
  • 155
  • 1
  • 1
  • 10
  • What I learned from this: basically don't trust anything lol (your browsers, your text editors, your code, irb, the header in the xml, your console, etc.) All of them can go wrong and disguise the encoding problem so double check each one of the points of failure as you go. Happy Debugging! :) – charint Jul 27 '15 at 21:06

3 Answers3

19

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä" 
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
  :use_ssl => uri.scheme == 'https', 
  :verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
  https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Oh thank you! What you said makes sense, but somehow when I read from my web-service, Ruby actually think it's ASCII-8BIT instead of ISO-8859-1. – charint Jul 27 '15 at 04:27
  • [Here](https://gist.githubusercontent.com/chuyaguo2014/67317d3399407f85d246/raw/5b54195084c8d5210da6a02daed2dd8c138624da/rubyEncodingMadness.txt) is an example of the xml I am trying to get. My code is: `def convert_encoding(string, originalEncoding)` `string.encode(originalEncoding)` `string.encode!('utf-8')` (with some puts in between to show encoding & contents of the string) but I get UndefinedConversionError "\xC3" from ASCII-8BIT to UTF-8 Am I missing something obvious here? – charint Jul 27 '15 at 04:35
  • How are you reading the XML from the web service? BTW, the file you linked to is UTF-8, not `ISO-8859-1`, whatever it itself claimed. So you actually have two-byte UTF-8 representation in the file, the first byte being `\xC3`; and ASCII-8BIT -> UTF-8 conversion chokes on it. The irony is, you don't even need conversion :) Just either open the stream properly as UTF-8, or force the string to UTF-8 when you've read it. – Amadan Jul 27 '15 at 05:21
  • I am using `response = Net::HTTP.get_response(uri)` and `response.body` to get the xml. I What's curious is that, if I skip the encoding conversion step, save the file naively in my storage (AWS S3) and manually download the file, I still see `"Norrlandsvägen"` instead of `"Norrlandsvägen"`. Also can you please teach me how to identify the actual encoding of a file (regardless of its claim :))? The gist is a simplified example I have been working with; eventually I would like to get files like [this one](https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full) – charint Jul 27 '15 at 05:52
  • Thank you for all your help so far though! I don't have a solution yet but your answers helped me clarify my confusion and I think I am getting closer to a solution. :) – charint Jul 27 '15 at 05:56
  • That file is `CP1252`, or `ISO-8859-1` :) The easiest way is to open it in a browser and check the autodetected Encoding (assuming you can see the file properly; if not, then fiddle with Encoding until you can). – Amadan Jul 27 '15 at 05:58
  • I cannot believe how long it took me to actually ask the right question (and I am so glad I finally did!) - thank you for showing me how to detect the encoding on a browser. Turns out, my code did work as it should and it was my browser (Chrome) that somehow had the wrong encoding set in the display (so it was a false-positive). Once I changed the encoding in my browser everything worked perfectly. Thank you so much for your patience and help as I worked through this issue!!! :D – charint Jul 27 '15 at 21:00
  • oh and I just saw your edit 6 hours ago - yup yup that is exactly what I have in my code! :) – charint Jul 27 '15 at 21:01
2

There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding. Consequently, the following code causes your problem:

string = "Norrlandsvägen"
string.force_encoding('iso-8859-1')
puts string.encode('utf-8') # Norrlandsvägen

Whereas the following code will actually correctly encode your contents:

string = "Norrlandsvägen".encode('iso-8859-1')
string.encode!('utf-8')

Here's an example running in irb:

irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1')
=> "Norrlandsv\xE4gen"
irb(main):024:0> string.encoding
=> #<Encoding:ISO-8859-1>
irb(main):025:0> string.encode!('utf-8')
=> "Norrlandsvägen"
irb(main):026:0> string.encoding
=> #<Encoding:UTF-8>
Aeyrix
  • 168
  • 7
  • Thank you so much for your reply! It works in irb but when I tried it in my gem I got `UndefinedConversionError "\xC3" from ASCII-8BIT to UTF-8` It seems as if Ruby actually thinks the incoming string from the web service is ASCII-8BIT instead of ISO-8859-1 (even though the beginning of the xml declares `` ). Could you give me another hint about what I might be missing here? Much appreciated!!! :) – charint Jul 27 '15 at 04:41
2

The above answer was spot on. Specifically this point here:

There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding.

In my situation, I had a text file with iso-8859-1 encoding. By default, Ruby uses UTF-8 encoding, so if you were to try to read the file without specifying the encoding, then you would get an error:

results = File.read(file)
results.encoding
 => #<Encoding:UTF-8> 
 results.split("\r\n")
ArgumentError: invalid byte sequence in UTF-8

You get an invalid byte sequence error because the characters in different encodings are represented by different byte lengths. Consequently, you would need to specify the encoding to the File API. Think of it like force_encoding:

results = File.read(file, encoding: "iso-8859-1")

So everything is good right? No, not if you want to start parsing the iso-8859-1 string with UTF-8 character encodings:

results = File.read(file, encoding: "iso-8859-1")
results.each do |line|
  puts line.split('¬')
end
Encoding::CompatibilityError: incompatible character encodings: ISO-8859-1 and UTF-8

Why this error? Because '¬' is represented as UTF-8. You are using a UTF-8 character sequence against an ISO-8859-1 string. They are incompatible encodings. Consequently, after you read the File as a ISO-8859-1, then you can ask Ruby to encode that ISO-8859-1 into a UTF-8. And now you will be working with UTF-8 strings and thus no problems:

results = File.read(file, encoding: "iso-8859-1").encode('UTF-8')
results.encoding
results = results.split("\r\n")
results.each do |line|
  puts line.split('¬')
end

Ultimately, with some Ruby APIs, you do not need to use force_encoding('ISO-8859-1'). Instead, you just specify the expected encoding to the API. However, you must convert it back to UTF-8 if you plan to parse it with UTF-8 strings.

Daniel Viglione
  • 8,014
  • 9
  • 67
  • 101