0

I'm running Ruby 1.9.2 and trying to fix some broken UTF-8 text input where the text is literally "\\354\\203\\201\\355\\221\\234\\353\\252\\205" and change it into its correct Korean "상표명"

However after searching for a while and trying a few methods I still get out gibberish. It's confusing as the escaped characters example on line 3 works fine

# encoding: utf-8
puts "상표명" # Target string
# Output: "상표명"

puts "\354\203\201\355\221\234\353\252\205" # Works with escaped characters like this
# Output: "상표명"

# Real input is a string
input = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"

# After some manipulation got it into an array of numbers
puts [354, 203,201,355,221,234,353,252,205].pack('U*').force_encoding('UTF-8')
# Output: ŢËÉţÝêšüÍ (gibberish)

I'm sure this must have been answered somewhere but I haven't managed to find it.

benui
  • 6,440
  • 5
  • 34
  • 49

2 Answers2

10

This is what you want to do to get your UTF-8 Korean text:

s = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
k = s.scan(/\d+/).map { |n| n.to_i(8) }.pack("C*").force_encoding('utf-8')
# "상표명"

And this is how it works:

  1. The input string is nice and regular so we can use scan to pull out the individual number.
  2. Then a map with to_i(8) to convert the octal values (as noted by Henning Makholm) to integers.
  3. Now we need to convert our list of integers to bytes so we pack('C*') to get a byte string. This string will have the BINARY encoding (AKA ASCII-8BIT).
  4. We happen to know that the bytes really do represent UTF-8 so we can force the issue with force_encoding('utf-8').

The main thing that you were missing was your pack format; 'U' means "UTF-8 character" and would expect an array of Unicode codepoints each represented by a single integer, 'C' expects an array of bytes and that's what we had.

hmakholm left over Monica
  • 23,074
  • 3
  • 51
  • 73
mu is too short
  • 426,620
  • 70
  • 833
  • 800
2

The \354 and so forth are octal escapes, not decimal, so you cannot just write them as 354 to get the integer values of the bytes.

hmakholm left over Monica
  • 23,074
  • 3
  • 51
  • 73
  • +1, perfectly valid answer. I'm just wondering as mainly a C# programmer, will `force_encoding` really do what he thinks it should do? It seems... odd to let you change the encoding on the fly like that. – Blindy Aug 27 '11 at 01:48
  • @Blindy: yes, apparently [that's how Ruby handles ecodings](http://blog.grayproductions.net/articles/ruby_19s_string). – hmakholm left over Monica Aug 27 '11 at 01:56
  • @Blindy: Sort of. It will only work if the bytes really do represent UTF-8 text, you'd use [`Iconv`](http://ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html) if you want to transcode a string while preserving the characters. – mu is too short Aug 27 '11 at 02:11
  • It seems Array.pack accepts decimals, but after converting the octal values to decimals I tried `[236, 131, 129, 237, 145, 156, 235, 170, 133].pack('U*')` and it outputted different gibberish. I'm missing something here. – benui Aug 27 '11 at 02:16
  • 1
    See mu's answer; he noticed that `U*` is not what you want. – hmakholm left over Monica Aug 27 '11 at 02:19
  • Thanks for the correction BTW, "UTF-8 codepoint" doesn't make any sense. – mu is too short Aug 27 '11 at 02:44