4

I'm looking into using ICU for Unicode string processing in a native Node.js module because it seems to me that v8::String (according to these docs) doesn't have a C++ API for this purpose.

To my knowledge V8 expects UTF-16 in ExternalStringResource and other APIs, so I'd like to use ICU for UTF-16 processing. I specifically need to:

  • Iterate over the characters (not just the 16-bit code units) of an UTF-16 string
  • Tell the number of characters (not just the 16-bit code units) that an UTF-16 string contains

So I looked at the ICU documentation and found the UnicodeString and CharacterIterator classes. However, UnicodeString doesn't have a fromUTF16 method, only fromUTF8 and fromUTF32.

The other thing I'm unsure about is, does the UnicodeString constructor copy the data I give it or not? I'd very much prefer to use a zero-copy approach where I'd just work with an immutable object so it shouldn't perform any copy operations, just use the buffer I point it at.

I'm also unsure if I can just use UCharIterator (assuming I can somehow convert UChar* from my UTF-16 strings).

So my question is: How do I use ICU for the above purposes?

halfer
  • 19,824
  • 17
  • 99
  • 186
Venemo
  • 18,515
  • 13
  • 84
  • 125

1 Answers1

6

UnicodeString uses UTF-16 for storage by default. That's why it only has fromUTF8 and fromUTF32: from UTF-16 there is no conversion to be made.

It does copy the data. It is an owning string, much like std::string.

You can use UCharIterator if you don't want to copy the data. UChar is a 16-bit value. You can force it to be whatever 16-bit type you prefer working with by defining the UCHAR_TYPE macro:

Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t), or wchar_t if that is 16 bits wide; always assumed to be unsigned.

If neither is available, then define UChar to be uint16_t.

This makes the definition of UChar platform-dependent but allows direct string type compatibility with platforms with 16-bit wchar_t types.

Community
  • 1
  • 1
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
  • Martinho, thanks for the answer! :) Where is it documented that `UCharIterator` works with UTF-16? I couldn't find it in the docs. – Venemo Nov 07 '13 at 20:46
  • @Venemo see [uiter_SetString](http://icu-project.org/apiref/icu4c/uiter_8h.html#a373dbf81553f2f3553b64c31e3c6147f). It just says it iterates over a "string", however the ICU API and docs often take that just to mean a UTF-16 string (there are historical reasons for this thing; ICU has been around for a long time). You can see that it uses UTF-16 because it takes a pointer to UChar. – R. Martinho Fernandes Nov 08 '13 at 09:38
  • Yes, but I wasn't able to find it in the docs where it says that a `UChar` means *UTF-16 code unit* – Venemo Nov 08 '13 at 11:16