The wrong answer
Use WM_UNICHAR
, it handles UTF-32 and can handle Unicode Supplementary Plane characters.
While this is almost true, but the complete truth looks like this:
WM_UNICHAR
is a hack designed for ANSI Windows to receive Unicode characters. Create a Unicode window and you will never receive it.
- Create an ANSI window and you will be surprised that it still doesn't work as expected. The catch is that when the window is created, you receive a
WM_UNICHAR
with 0xffff
to which you must react by returning 1 (the default window procedure will return 0). Fail to do this, and you will never see a WM_UNICHAR
again. Good job that the official documentation doesn't tell you that.
- Run your program on a system that, for mysterious reasons, doesn't support
WM_UNICHAR
(such as my Windows 7 64 system) and it still won't work, even if you do everything correctly.
The theoretically correct answer
There is nothing to audit or to pay attention to.
Compile with UNICODE
defined, or explicitly create your window class as well as your window using a "W
" function, and use WM_CHAR
as if this was the most natural thing to do. That's it. It is indeed the most natural thing.
WM_CHAR
uses UTF-16 (except when it doesn't, such as under Windows 2000). Of course, a single UTF-16 character cannot represent code points outside the BMP, but that is not a problem because you simply get two WM_CHAR
messages containing a surrogate pair. It's entirely transparent to your application, you do not need to do anything special. Any Windows API function that accepts a wide character string will happily accept these surrogates, too.
The only thing to be aware of is that now the character length of a string (obviously) is no longer simply the number of 16-bit words. But that was a wrong assumption to begin with, anyway.
The sad truth
In reality, on many (most? all?) systems, you just get a single WM_CHAR
message with wParam
containing the low 16 bits of the key code. Which is mighty fine for anything within the BMP, but sucks otherwise.
I have verified this both by using Alt-keypad codes and creating a custom keyboard layout which generates code points outside the BMP. In either case, only a single WM_CHAR
is received, containing the lower 16 bits of the character. The upper 16 bits are simply thrown away.
In order for your program to work 100% correctly with Unicode, you must apparently use the input method manager (ImmGetCompositionStringW
), which is a nuisance and badly documented. For me, personally, this simply means: "OK, screw that". But if you are interested in being 100% correct, look at the source code of any editor using Scintilla (link to line) which does just that and works perfectly.