0

I have a function, which gets a wchar_t* as input. Now in this function I have to process code points.

Since this program should run on multiple platforms, I have very little knowledge about the encoding in the wchar_t*.

I tried to get a conversion from wchar_t* to char32_t* via std::codecvt<char32_t, wchar_t, std::mbstate_t>. Sadly, this specialization does not seem to exist.

Then I thought that I might perhaps be able to simply use the wchar_t* as a readonly input buffer to icu::UnicodeString, but it seems I first have to convert to UChar* via u_strFromWCS. But then again I first need to allocate a UChar buffer, with the correct amount of codeunits in UChar.

Can someone tell me what the most effective way of accessing code points in a wchar_t* is?

Example:

If I am not mistaken, the following example should make use of two code units per code point.

const wchar_t *test = L"A  剝Ц B";
abergmeier
  • 13,224
  • 13
  • 64
  • 120

2 Answers2

1

The standard says very little about the encoding or anything about wchar_t so you cannot have a solution without making some assumptions.

A reasonable assumption is that if sizeof(wchar_t) == 2 (on Windows), it is UTF-16, while if sizeof(wchar_t) == 4, it is UTF-32 (on Unix), so you can use macros or templates to select at compile time which to choose. If it is possible for something in wchar_t to be encoded in some legacy encoding, because there is no general way to detect encoding automatically, you have to get encoding information elsewhere.

Siyuan Ren
  • 7,573
  • 6
  • 47
  • 61
-1

Simplify, wchar_t contains a Unicode character. In my code, I often access every char code by indices (if I didn't misunderstand your question).

wchar_t* unicodeString = L"this is a unicode string";

unicodeString[0] is a single character

Hai Nguyen
  • 81
  • 10