4

Windows defines the wchar_t symbol to be 16 bits long. However, the UTF-16 encoding used tells us that some symbols may actually be encoded with 4 bytes (32 bits).

Does this mean that if I'm developing an application for Windows, the following statement:

wchar_t symbol = ... // Whatever

might only represent a part of the actual symbol?


And what will happen if I do the same under *nix, where wchar_t is 32 bits long?

Yippie-Ki-Yay
  • 22,026
  • 26
  • 90
  • 148

1 Answers1

7

Yes, it means that symbol may hold a part of a surrogate pair on Windows. On *nixes wchar_t is 32 bit long and will hold the whole Unicode character set. Note that a Unicode code-point doesn't represent a character, since some characters may be encoded by more than one Unicode code-point, thus it doesn't make sense to count characters at all. In particular this implies that it doesn't make sense to use anything other than UTF-8 encoded narrow-char strings anywhere outside Unicode libraries, even on Windows.

Read this old thread for details.

Community
  • 1
  • 1
Yakov Galka
  • 70,775
  • 16
  • 139
  • 220
  • You mistook _code point_ for _code unit_. Each character is associated with only one code point and may be represented by more than one code unit. – ExpExc Dec 04 '11 at 14:30
  • 2
    @ExpExc: No, I didn't. A character may be represented by more than one *codepoint*, and of course by more than one *codeunit*. E.g. `U+0061 U+U0306` is two *code-points* and represents the single character "á". In CJK scripts it's even more apparent. – Yakov Galka Dec 04 '11 at 15:51
  • 1
    Also on Windows, you should NOT use UTF-8 encoded strings when interacting with the OS, since the OS doesn't natively interpret UTF-8 strings. When interacting with Windows APIs, you should use UTF-16 strings. If you insist on using UTF-8, you need to call MultiByteToWideChar (specifying CP_UTF8) to convert from UTF-8 to UTF-16 before passing string to the Windows APIs. It's far easier to simply code your application as UTF-16 application than to deal with the UTF-8->UTF-16 conversion. 8 bit characters in Windows are NOT UTF-8 - they're either in the ANSI code page or in the OEM coe page. – Larry Osterman Dec 04 '11 at 15:53
  • ... and due to an old & deeply rooted Windows bug (multi-byte is interpreted as double-byte) you can't set `CP_UTF8` as the ANSI code page. – MSalters Dec 05 '11 at 08:27