Accessing code points of wchar_t*

Question

I have a function, which gets a wchar_t* as input. Now in this function I have to process code points.

Since this program should run on multiple platforms, I have very little knowledge about the encoding in the wchar_t*.

I tried to get a conversion from wchar_t* to char32_t* via std::codecvt<char32_t, wchar_t, std::mbstate_t>. Sadly, this specialization does not seem to exist.

Then I thought that I might perhaps be able to simply use the wchar_t* as a readonly input buffer to icu::UnicodeString, but it seems I first have to convert to UChar* via u_strFromWCS. But then again I first need to allocate a UChar buffer, with the correct amount of codeunits in UChar.

Can someone tell me what the most effective way of accessing code points in a wchar_t* is?

Example:

If I am not mistaken, the following example should make use of two code units per code point.

const wchar_t *test = L"A  剝Ц B";

"very little knowledge about the encoding". Without knowing, it´s impossible. — deviantfan, Apr 27 '14 at 09:44
Are you sure it is unicode? The Chinese/Japanese also use DBCS, which is different from unicode. If it is something like GB or Big5 or Shift-JIS, it is DBCS. — cup, Apr 27 '14 at 13:56
You should not be using wchar_t in this case. See utf8everywhere.org. — Pavel Radzivilovsky, Apr 28 '14 at 16:46
@abergmeier u_strFromWCS does its level best. But, if you're already using ICU, you should just use ICU functionality everywhere. Whether L"川" is Unicode or not depends on your compiler flags and platform. — Steven R. Loomis, Apr 29 '14 at 00:26

score 1 · Answer 1 · answered Apr 27 '14 at 13:46

The standard says very little about the encoding or anything about wchar_t so you cannot have a solution without making some assumptions.

A reasonable assumption is that if sizeof(wchar_t) == 2 (on Windows), it is UTF-16, while if sizeof(wchar_t) == 4, it is UTF-32 (on Unix), so you can use macros or templates to select at compile time which to choose. If it is possible for something in wchar_t to be encoded in some legacy encoding, because there is no general way to detect encoding automatically, you have to get encoding information elsewhere.

score -1 · Answer 2 · answered Apr 27 '14 at 10:28

-1

Simplify, wchar_t contains a Unicode character. In my code, I often access every char code by indices (if I didn't misunderstand your question).

wchar_t* unicodeString = L"this is a unicode string";

unicodeString[0] is a single character

answered Apr 27 '14 at 10:28

Hai Nguyen

81
10

2

This probably does not hold up when the platform defines `wchar_t` as 16 bit and you have a non UTF-16 character in there. – abergmeier Apr 27 '14 at 10:31
1

Right, you cannot assume anything in a cross-platform way. – Steven R. Loomis Apr 29 '14 at 00:24

Accessing code points of wchar_t*

Example:

2 Answers2