13

I can't use prepackaged Unicode string libraries, such as ICU, because they blow up the size of the binary to an insane degree (it's a 200k program; ICU is 16MB+!).

I'm using the builtin wchar_t string type for everything already, but I want to ensure I'm not doing anything stupid in terms of doing iteration on strings, or things like that.

Are there tools like Fuzzers do for security but for Unicode? That is, throw characters outside of the Basic Multilingual Plane at my code and ensure things get handled correctly as UTF-16?

(Oh, and obviously a cross platform solution works, though most cross platform things would have to support both UTF-8 and UTF-16)

EDIT: Also note things that are less obvious than UTF-16 surrogate pairs -- things like accent marks!

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • +1, it's pretty cool that you worry about that. – zneak Jun 20 '11 at 15:45
  • 1
    +1 good question. Just a note: `wchar_t` doesn't *imply* that you're doing Unicode at all. Functions like `wprintf` don't really handle Unicode correctly, and you actually have to make sure that your string manipulations take into account characters with multiple code points. In fact, I *think* (though I'm not 100% sure) that functions like `wcsstr` **don't** handle characters above U+0000FFFF correctly, because they just treat the string as though it was using fixed-length encoding. – user541686 Jun 20 '11 at 15:45
  • 1
    @Mehrdad: On pretty much every Windows compiler, `wchar_t` means UTF-16. The standard doesn't require that, but all the windows API functions are written that way. – Billy ONeal Jun 20 '11 at 15:46
  • 1
    @Mehrdad: `wcsstr` does not need to be updated to handle characters greater than `U+FFFF`. A plain bytewise comparison is just fine (this is one of the great things about both UTF-8 and UTF-16). Things get more complicated when you want to do things like sorting. – Billy ONeal Jun 20 '11 at 15:48
  • @Billy: On Windows compilers, `wchar_t` is 16-bit. That much is true. But I don't believe it implies UTF-16 for everything -- I'm not even sure `wcsstr` really counts as the "Windows API". (Edit 2: Oops, yes you're right... I forgot that the continuation bits aren't set in the first characters, so yes, it shouldn't be a problem.) – user541686 Jun 20 '11 at 15:48
  • I'm not aware of any unicode compliance libraries that can do this. You could read some japanese strings from a text file and make your own smoke test perhaps? – AJG85 Jun 20 '11 at 15:57
  • @AJG85: I don't think Japanese is enough (does it need multiple code points? I'm not sure) -- he would need to test characters that need multiple code points as well (which might be hieroglyphics, I don't know :P). – user541686 Jun 20 '11 at 16:14
  • @Mehdrad I suppose it could get quite complicated you'd probably need something else like arabic to test RTL as well. – AJG85 Jun 20 '11 at 16:18
  • Why is the size of binary important? – Puppy Jun 20 '11 at 16:32
  • 2
    @DeadMG: 1. because a lot of my users are on dialup. 2. because this thing gets downloaded (as part of ComboFix) some 4 million times a month, and I'm (well, friends of mine are) paying for bandwidth. – Billy ONeal Jun 20 '11 at 16:34
  • @Billy: You say "wchar_t means UTF16" on Windows -- I very much doubt that. If you convert the Unicode string "\U0010FFFF" into UTF16 (two code units)", `wcslen()` will say "2", not "1". I'm fairly sure that the wc* routines _expect_ fixed-width strings, and in fact the very definition of `wchar_t` says "big enough for fixed-width use". (It just means Windows isn't enabled for non-BMP Unicode by default.) – Kerrek SB Jun 20 '11 at 17:23
  • 3
    @Kerrek: 1. `wcslen` is not a Windows API function. 2. `wcslen` never claimed to do code point decoding. Just as `strlen` is worthless for number of printed characters for `UTF-8`, `wcslen` is worthless for that for UTF-16. Even if you made it smart enough to handle surrogate pairs, you still wouldn't have a true character count, because things like accent marks are full code points but contribute to a single character. 3. With the exception of `wcslen` and `wcschr`, I'm not aware of problems with UTF-16 that would break any of the `wcsXxx` functions. – Billy ONeal Jun 20 '11 at 17:31

2 Answers2

3

The wrong answer

Use WM_UNICHAR, it handles UTF-32 and can handle Unicode Supplementary Plane characters.

While this is almost true, but the complete truth looks like this:

  1. WM_UNICHAR is a hack designed for ANSI Windows to receive Unicode characters. Create a Unicode window and you will never receive it.
  2. Create an ANSI window and you will be surprised that it still doesn't work as expected. The catch is that when the window is created, you receive a WM_UNICHAR with 0xffff to which you must react by returning 1 (the default window procedure will return 0). Fail to do this, and you will never see a WM_UNICHAR again. Good job that the official documentation doesn't tell you that.
  3. Run your program on a system that, for mysterious reasons, doesn't support WM_UNICHAR (such as my Windows 7 64 system) and it still won't work, even if you do everything correctly.

The theoretically correct answer

There is nothing to audit or to pay attention to.

Compile with UNICODE defined, or explicitly create your window class as well as your window using a "W" function, and use WM_CHAR as if this was the most natural thing to do. That's it. It is indeed the most natural thing.

WM_CHAR uses UTF-16 (except when it doesn't, such as under Windows 2000). Of course, a single UTF-16 character cannot represent code points outside the BMP, but that is not a problem because you simply get two WM_CHAR messages containing a surrogate pair. It's entirely transparent to your application, you do not need to do anything special. Any Windows API function that accepts a wide character string will happily accept these surrogates, too.
The only thing to be aware of is that now the character length of a string (obviously) is no longer simply the number of 16-bit words. But that was a wrong assumption to begin with, anyway.

The sad truth

In reality, on many (most? all?) systems, you just get a single WM_CHAR message with wParam containing the low 16 bits of the key code. Which is mighty fine for anything within the BMP, but sucks otherwise.

I have verified this both by using Alt-keypad codes and creating a custom keyboard layout which generates code points outside the BMP. In either case, only a single WM_CHAR is received, containing the lower 16 bits of the character. The upper 16 bits are simply thrown away.

In order for your program to work 100% correctly with Unicode, you must apparently use the input method manager (ImmGetCompositionStringW), which is a nuisance and badly documented. For me, personally, this simply means: "OK, screw that". But if you are interested in being 100% correct, look at the source code of any editor using Scintilla (link to line) which does just that and works perfectly.

Community
  • 1
  • 1
Damon
  • 67,688
  • 20
  • 135
  • 185
2

Some things to check:

  • Make sure that instead of handing WM_CHAR you're handling WM_UNICHAR:

    The WM_UNICHAR message is the same as WM_CHAR, except it uses UTF-32. It is designed to send or post Unicode characters to ANSI windows, and it can handle Unicode Supplementary Plane characters.

  • Do not assume that the ith character is at index i. It obviously isn't, and if you happen to use that fact for, say, breaking a string in half, then you could be messing it up.

  • Don't tell the user (in a status bar or something) that the user has N characters just because the character array has length N.

Community
  • 1
  • 1
user541686
  • 205,094
  • 128
  • 528
  • 886
  • @Billy: See my edit. (Something tells me Windows doesn't always really mean "UTF-16" when it says "UTF-16"...) – user541686 Jun 20 '11 at 15:57
  • 2
    @Mehrdad: Ah -- I see now. `WM_CHAR` passes a single `wchar_t`, so there'd be no way to pass a surrogate pair. (I was thinking strings, but if you're passing a single codepoint it makes sense) – Billy ONeal Jun 20 '11 at 16:00
  • @Billy: Yeah, but the issue is that if it only passes a single `wchar_t`, then it's not really UTF-16, is it?... – user541686 Jun 20 '11 at 16:01
  • 1
    @Mehrdad: It can still be UTF-16 and be a single character. It's just boxed to one character. I guess that's also the same as UCS-2, but it's not necessarily a limitation of the system itself, merely that they only allocated 2 bytes for the character in the returned structure. – Billy ONeal Jun 20 '11 at 16:03
  • @Billy: I see what you mean, but to the end user (and the developer), it *is* a limitation of the system, isn't it? – user541686 Jun 20 '11 at 16:06
  • @Mehrdad: I would agree that it's a limitation. On the other hand I don't think it's wrong to call it UTF-16. – Billy ONeal Jun 20 '11 at 16:17