0

Due to program requirements (fast access to individual characters), I am using uint32_t[] for characters. Simply stores code points, not UTF-32 code units. because I don't think UTF-32 code-unit and Unicode code-point is same thing, so I have to keep them separated.

The code points are taken from next32PostInc function

And I need to encode these code-points into UTF-8 chunk using libICU, and it's hard to find character level accumulative encoder. I see a way by using UnicodeString::append(), but it needs double conversions. ucnv_convert functions seems to do the job, but only with UTF-32 code units. And I really am not sure about safety if I use them with code points. Currently I am looking for something inverse of next32PostInc function. How can I do that? If my idea on code-point and code-units, please correct me.

eonil
  • 83,476
  • 81
  • 317
  • 516

1 Answers1

1

Current Unicode spec defines UTF-32 code unit is equal to code point.

From the Unicode FAQ:

Given that any industrial-strength text and internationalization support API has to be able to handle sequences of characters, it makes little difference whether the string is internally represented by a sequence of UTF-16 code units, or by a sequence of code-points ( = UTF-32 code units). Both UTF-16 and UTF-8 are designed to make working with substrings easy, by the fact that the sequence of code units for a given code point is unique.

So just use UTF-32 functions.

eonil
  • 83,476
  • 81
  • 317
  • 516