2

I've seen a few other posts on this issue but was unable to find any details on how to determine programatically if a codepoint uses more than one 2-byte (on Windows) wchar_t.

An example:

const wchar_t* s2 = L"\U0002008A"; // The "Han" character
std::wstring in(s2);               // length() == 2

I'd like to know how to determine when a character will have a length() > 1.

Vitaly
  • 2,760
  • 2
  • 19
  • 26
  • Just check for the proper ranges according to the UTF-16 encoding (easy to google). You most likely won't find anything more sophisticated. – Šimon Tóth Apr 18 '13 at 16:39

1 Answers1

5

Any codepoint above U+FFFF uses surrogates in its UTF-16 encoding. Surrogate values are in the range D800-DFFF.

bames53
  • 86,085
  • 15
  • 179
  • 244