0

This is related to ruby gem ruby-smpp, which I'm using for my project.

I have a string of bytes \u0000\xE0\u0000\xE2\u0000\xE1\u0000\xE8\u0000\xEA\u0000\xE9\u0000\xE7. It represents the body of a message in French received (i.e. MO, or mobile-originating) message. The actual content of this message is àâáèêéç. Just wondering how to convert \u0000\xE0\u0000\xE2\u0000\xE1\u0000\xE8\u0000\xEA\u0000\xE9\u0000\xE7 to àâáèêéç in Ruby.

I've tried

["\u0000\xE0\u0000\xE2\u0000\xE1\u0000\xE8\u0000\xEA\u0000\xE9\u0000\xE7"].pack('H*')

=> "\x00\x02\x01\b\n\t\a"

and

['E0','E2','E1','E8', 'EA', 'E9', 'E7'].pack('H*')
=> "\xE0"

Both are wrong.

Thanks in advance!

jl118
  • 307
  • 2
  • 16

1 Answers1

4

Looks like your string is UTF-16BE encoded:

str = "\u0000\xE0\u0000\xE2\u0000\xE1\u0000\xE8\u0000\xEA\u0000\xE9\u0000\xE7"

str.encode('UTF-8', 'UTF-16BE')
#=> "àâáèêéç"
Stefan
  • 109,145
  • 14
  • 143
  • 218
  • What tells you it's UTF-16BE encoded (from someone who knows nothing of encoding)? – Cary Swoveland Mar 20 '19 at 20:11
  • 1
    @CarySwoveland (1) because the values looked like _words_ (0x00E0, 0x00E2, ...), i.e. 16-bit values, and (2) because 0xE0 (224) is the codepoint for `à` in Unicode. – Stefan Mar 21 '19 at 00:07
  • I see there's a lazy-person's way of answering my question: `Encoding.list.find { |e| str.encode('UTF-8', e) == "àâáèêéç" rescue nil } #=> #`. See [Encoding#list](https://ruby-doc.org/core-2.2.0/Encoding.html). Note: `Encoding.list.size #=> 101`. – Cary Swoveland Mar 21 '19 at 06:39
  • @CarySwoveland I know that way ... quicker, easier, more seductive :-) – Stefan Mar 21 '19 at 09:05