1

I have a string in UTF-8 hex like this:

s = "0059006F007500720020006300720065006400690074002000680061007300200067006F006E0065002000620065006C006F00770020003500200064006F006C006C006100720073002E00200049006600200079006F00750020006800610076006500200061006E0020004100640064002D004F006E0020006F007200200042006F006E0075007300200079006F007500720020007200650073006F00750072006300650073002000770069006C006C00200077006F0072006B00200075006E00740069006C0020006500780068006100750073007400650064002E00200054006F00200074006F00700020007500700020006E006F007700200076006900730069007400200076006F006400610066006F006E0065002E0063006F002E006E007A002F0074006F007000750070"

I want to convert this into actual UTF-8 string. It should read:

Your credit has gone below 5 dollars. If you have an Add-On or Bonus your resources will work until exhausted. To top up now visit vodafone.co.nz/topup

This works:

s.scan(/.{4}/).map { |a| [a.hex].pack('U') }.join

but I'm wondering if there's a better way to do this: whether I should be using Encoding#convert.

sawa
  • 165,429
  • 45
  • 277
  • 381
patrickdavey
  • 1,966
  • 2
  • 18
  • 25
  • I don't think that is UTF-8 because `00` is a control character (NIL). Looks more like a some kind of 16 bit encoding. – Craig S. Anderson Apr 25 '15 at 06:54
  • @CraigS.Anderson doesn't UTF-8 have the ability of being 8 or 16-bit? – vol7ron Apr 25 '15 at 07:02
  • @vol7ron UTF-8 is a variable length encoding. ASCII characters (0-127) are mapped to one byte, others are longer - up to 6 bytes. If the leading bit is 0, then it is a one byte encoding of the code point. – Craig S. Anderson Apr 25 '15 at 07:04

3 Answers3

5

The extra 00s suggest that the string is actually the hex representation of a UTF-16 string, rather than UTF-8. Assuming that is the case the steps you need to carry out to get a UTF-8 string are first convert the string into the actual bytes the hex digits represents (Array#pack can be used for this), second mark it as being in the appropriate encoding with force_encoding (which looks like UTF-16BE) and finally use encode to convert it to UTF-8:

[s].pack('H*').force_encoding('utf-16be').encode('utf-8')
matt
  • 78,533
  • 8
  • 163
  • 197
2

I think there are extra null characters all along the string (it's valid, but wasteful), but you can try:

[s].pack('H*').force_encoding('utf-8')

although, it seems "Your credit has gone below 5 dollars"...

The string prints with puts, but I can't read all the unicode characters on the terminal when the string is dumped.

Myst
  • 18,516
  • 2
  • 45
  • 67
  • It isn't UTF-8, so forcing the encoding to be UTF-8 is not correct. – Craig S. Anderson Apr 25 '15 at 07:06
  • @CraigS.Anderson : It **IS** UTF-8 and the encoding is valid - check with `[s].pack('H*').force_encoding('utf-8').valid_encoding?` ... Also, you can use `encode` instead of `force_encoding`, like this: `[s].pack('H*').encode('utf-8')` - but why waste resources if the encoding is valid? – Myst Apr 25 '15 at 07:09
  • The UTF-8 encoding of 'Yo' is 0x596F, not 0x0059006F. – Craig S. Anderson Apr 25 '15 at 07:12
  • @CraigS.Anderson , the answer still answers the question - even IF the string was malformed. I'm not sure, but I think you can always add null at the beginning of a value under 127, so Y become 0x0059 and o becomes 0x006F. The fact of the matter is that **the string passes validation** and prints correctly. Who am I to argue with the computer? – Myst Apr 25 '15 at 07:17
1

If you are intending to use this on other oddly encoded strings, you could unpad the leading bytes:

[s.gsub(/..(..)/,'\1')].pack('H*')

Or use them:

s.gsub(/..../){|p|p.hex.chr}

If you want to use Encoding::Converter

ec = Encoding::Converter.new('UTF-16BE','UTF-8')     # save converter for reuse
ec.convert( [s].pack('H*') )                         # or:  ec.convert [s].pack'H*'
vol7ron
  • 40,809
  • 21
  • 119
  • 172
  • what if your string had a relevant double zero, such as `"s\n"` (== `200a` in hex)...? I would't remove those double zeros. – Myst Apr 25 '15 at 09:44
  • @Myst you are right, that was irresponsible and I've updated the answer :) – vol7ron Apr 25 '15 at 18:12