What's the most efficient way to decode a UTF16 binary?

Question

As Rebol 3 supports unicode, and UTF16 is used internally when needed (if it has only ASCII characters, it's in ASCII), it should be as simple as copying the memory content from the binary and setting up the REBVAL structure. However, the only way I find seems to be iterating over the binary and converting each character individually.

Same question applies to encoding a string in UTF16.

The internal fixed usage of UTF16 isn't done by Red, instead picking a size based on the [highest codepoint in the string](http://www.red-lang.org/2012/09/plan-for-unicode-support.html). Rebol should be doing this as well, however, so any temptation to do magic taking advantage of the implementation details of [REBUNI](https://github.com/rebol/rebol/blob/25033f897b2bd466068d7663563cd3ff64740b94/src/include/reb-c.h#L149) should take that into account. — HostileFork says dont trust SE, Dec 03 '14 at 22:02

score 3 · Accepted Answer · answered Dec 04 '14 at 21:58

3

OK, there doesn't seem to be an easy way to do it. So I just added two codecs UTF-16LE/BE for this purpose. See this commit: https://github.com/zsx/r3/commit/630945070eaa4ae4310f53d9dbf34c30db712a21

With this change, you can do:

>> b: encode 'utf-16le "hello"
== #{680065006C006C006F00}

>> s: decode 'utf-16le b       
== "hello"

>> b: encode 'utf-16be "hello" 
== #{00680065006C006C006F}

>> s: decode 'utf-16be b 
== "hello"

answered Dec 04 '14 at 21:58

Shixin Zeng

1,458
1
10
14

Looks cool...I'm not sure how far the "in-the-box" codecs were planning to go vs. done with extensions... – HostileFork says dont trust SE Dec 05 '14 at 13:55
Yes, this is something I am not sure about either, and that's why I started this question in the first place, to make sure I haven't missed anything obvious before adding a new codec. Once we start adding, we need a line where to stop adding. – Shixin Zeng Dec 08 '14 at 16:02

What's the most efficient way to decode a UTF16 binary?

1 Answers1