how does windows wchar_t handle unicode characters outside the basic multilingual plane?

Question

I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?

That is:

many programmers seem to feel that UTF-16 is harmful because it is a variable-length code.
wchar_t is 16-bits wide on windows, but 32-bits wide on Unix/MacOS
The Windows APIs use wide-characters, not Unicode.

So what does Windows do when you want to code something like (U+2008A) Han Character on Windows?

That's what I thought too. However, I just successfully edited a filename on my Windows computer to contain a (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T. (see http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful). So clearly, even if Windows is using UCS-2, it allows me to put a SMP character in a filename. So how does it do that? — vy32, Oct 24 '11 at 01:25
According to [Wikipedia](http://en.wikipedia.org/wiki/NTFS#Internals), "NTFS allows any sequence of 16-bit values for name encoding (file names, stream names, index names, etc.). This means UTF-16 codepoints are supported, but the file system does not check whether a sequence is valid UTF-16 (it allows any sequence of short values, not restricted to those in the Unicode standard)". — Keith Thompson, Oct 24 '11 at 02:30
@K-ballo: Windows hasn't used UCS-2 since NT4. Starting with Windows2000, everything uses UTF-16 now. — Remy Lebeau, Oct 26 '11 at 00:48
@hippietrail, what's with the new tag? It doesn't seem relevant here whatsoever... — Charles, May 30 '13 at 01:31
@Charles: It's specifically for "unicode characters outside the basic multilingual plane", an exact phrase taken from the question title. But my tag wiki has not yet been approved by a mod which explains it. Basically "astral plane" has become a common unofficial term for the Unicode planes beyond the BMP. I avoided "BMP" in the tag because it would create worse confusion due to the inamge file format called BMP. If you click on the tag and see which other questions now use it I believe you will see the relevance. — hippietrail, May 30 '13 at 01:36
@hippietrail, it kind of strikes me as a bad tag name, especially given that the term appears in only one question on the entire site *and* that it's unofficial slang... I suppose it makes sense, but it still just rubs me the wrong way, that's all. — Charles, May 30 '13 at 01:46
Strange. I'm finding more and more questions using the term. Please feel free to suggest tag synonyms but Unicode doesn't offer a single term to cover all the other planes, just four or so ugly and unwieldy names for each of them and so far nobody is asking questions about those individual planes. I did put some thought into the name and so far it strikes me as the best compromise. Most people are using wording like "not in", "beyond", "other than" together with "bmp" or "basic multilingual plane" but those don't seem to lead to great tag names ... — hippietrail, May 30 '13 at 02:01

score 17 · Accepted Answer · answered Oct 24 '11 at 19:50

The implementation of wchar_t under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.

So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.

Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file .txt and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).

But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called .txt in the same folder as .txt, where case-insensitivity would otherwise disallow it, or the fact that you can create [U+DC01][U+D801].txt programmatically.

This is how pedants can have a nice long and basically meaningless argument about whether Windows “supports” UTF-16 strings or only UCS-2.

+1 for "supports". First define your nomenclature, then argue. :) — Prof. Falken, Oct 24 '11 at 20:03
Those `.txt`-s are all the same (a "?") for me in Chrome, that is not intended, right? ;) — mlvljr, Nov 08 '14 at 05:45

score 9 · Answer 2 · answered Oct 24 '11 at 19:56

Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.

Not all third party programs handle this correctly and so may be buggy with data outside the BMP.

Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

how does windows wchar_t handle unicode characters outside the basic multilingual plane?

2 Answers2

Linked