Converting integers to UTF-8 (Korean)

Question

I'm running Ruby 1.9.2 and trying to fix some broken UTF-8 text input where the text is literally "\\354\\203\\201\\355\\221\\234\\353\\252\\205" and change it into its correct Korean "상표명"

However after searching for a while and trying a few methods I still get out gibberish. It's confusing as the escaped characters example on line 3 works fine

# encoding: utf-8
puts "상표명" # Target string
# Output: "상표명"

puts "\354\203\201\355\221\234\353\252\205" # Works with escaped characters like this
# Output: "상표명"

# Real input is a string
input = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"

# After some manipulation got it into an array of numbers
puts [354, 203,201,355,221,234,353,252,205].pack('U*').force_encoding('UTF-8')
# Output: ŢËÉţÝêšüÍ (gibberish)

I'm sure this must have been answered somewhere but I haven't managed to find it.

score 10 · Accepted Answer · edited Aug 27 '11 at 02:18

This is what you want to do to get your UTF-8 Korean text:

s = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
k = s.scan(/\d+/).map { |n| n.to_i(8) }.pack("C*").force_encoding('utf-8')
# "상표명"

And this is how it works:

The input string is nice and regular so we can use scan to pull out the individual number.
Then a map with to_i(8) to convert the octal values (as noted by Henning Makholm) to integers.
Now we need to convert our list of integers to bytes so we pack('C*') to get a byte string. This string will have the BINARY encoding (AKA ASCII-8BIT).
We happen to know that the bytes really do represent UTF-8 so we can force the issue with force_encoding('utf-8').

The main thing that you were missing was your pack format; 'U' means "UTF-8 character" and would expect an array of Unicode codepoints each represented by a single integer, 'C' expects an array of bytes and that's what we had.

score 2 · Answer 2 · answered Aug 27 '11 at 01:39

2

The \354 and so forth are octal escapes, not decimal, so you cannot just write them as 354 to get the integer values of the bytes.

answered Aug 27 '11 at 01:39

hmakholm left over Monica

23,074
3
51
73

+1, perfectly valid answer. I'm just wondering as mainly a C# programmer, will `force_encoding` really do what he thinks it should do? It seems... odd to let you change the encoding on the fly like that. – Blindy Aug 27 '11 at 01:48
@Blindy: yes, apparently [that's how Ruby handles ecodings](http://blog.grayproductions.net/articles/ruby_19s_string). – hmakholm left over Monica Aug 27 '11 at 01:56
@Blindy: Sort of. It will only work if the bytes really do represent UTF-8 text, you'd use [`Iconv`](http://ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html) if you want to transcode a string while preserving the characters. – mu is too short Aug 27 '11 at 02:11
It seems Array.pack accepts decimals, but after converting the octal values to decimals I tried `[236, 131, 129, 237, 145, 156, 235, 170, 133].pack('U*')` and it outputted different gibberish. I'm missing something here. – benui Aug 27 '11 at 02:16
1

See mu's answer; he noticed that `U*` is not what you want. – hmakholm left over Monica Aug 27 '11 at 02:19
Thanks for the correction BTW, "UTF-8 codepoint" doesn't make any sense. – mu is too short Aug 27 '11 at 02:44

Converting integers to UTF-8 (Korean)

2 Answers2

Linked