5

As we know, in windows system, we can set locale language for non-Unicode programs in "Control Panel\Clock, Language, and Region". But what does a local language mean for an application? Since to my understanding, an application is a compiled binary executable file, which only contained machine code instructions and no data, so how the character encoding affect their running?

One guess is if the executable file contain some literal strings in code segment, it will use some internal Charset to encoding them. If the charset is not unicode, then it will display garbage. But is not the internal Charset is a fixed one? Just like in Java, java spec defines the internal encoding is UTF-16.

Hope someone can answer my questions,

Thanks.

Alfred
  • 1,709
  • 8
  • 23
  • 38
  • 1
    Keep in mind that Unicode does not imply UTF16, but on Windows it does. They should have gone with UTF8 over 15 years ago, and this problem would not exist. – Matt Joiner Oct 07 '10 at 09:20
  • @Matt Joiner: Actually, this issue would still exist. Remember, we're talking about _non-Unicode_ programs here. They don't care at all whether _Unicode_ programs use UTF8 or UTF16. – MSalters Oct 08 '10 at 14:35

3 Answers3

5

Windows has two methods by which programs can talk to it, called the "ANSI API" and the "Unicode API", and a "non-unicode application" is one that talks to Windows via the "ANSI API" rather than the "Unicode API".

What that means is that any string that the application passes to Windows is just a sequence of bytes, not a sequence of Unicode characters. Windows has to decide which characters that sequence of bytes corresponds with, and the Control Panel setting you're talking about is how it does that.

So for example, a non-unicode program that outputs a byte with value 0xE4 on a PC set to use Windows Western will display the character ä, whereas one set up for Hebrew will display the character ה.

RichieHindle
  • 272,464
  • 47
  • 358
  • 399
  • And where in "ANSI API" one byte means one character on screen. In Unicode a character on screen can be represented by more than one byte. – Prof. Falken Oct 07 '10 at 08:23
  • 1
    @Amigable Clark Kant: Not always true - "double-byte character sets" (see http://msdn.microsoft.com/en-us/library/dd317794%28VS.85%29.aspx) still use the ANSI API. Otherwise there could have been no Chinese version of Windows before Unicode! – RichieHindle Oct 07 '10 at 08:33
  • It should also be noted that Microsoft could easily add UTF-8 as a supported multibyte character set and make the whole problem go away, but they *refuse to do so*. – R.. GitHub STOP HELPING ICE Oct 07 '10 at 16:40
  • @RichieHindle: Nice explanation. As you said, when application call windows API it just passes in "a sequence of bytes". So is the "sequence of bytes" in encoding same as their source code? I mean if the source code is written in UTF-8, then they are UTF-8; if source code is in GBK, then the sequence of bytes is in GBK. which means ANSI C does not have an fixed internal encoding just like what Java does(utf-16). – Alfred Oct 07 '10 at 19:40
  • @Guoqin: No, C does not define a standard encoding for its source code, or for string literals. A string literal output by a non-Unicode program will consist of the same bytes that were present in the source code, whatever encoding it used. – RichieHindle Oct 07 '10 at 22:54
  • @RichieHindle: Actually, the compiler has to translate from the _source character set_ to the _execution character set_, so technically, a string literal output by a non-Unicode program _doesn't need to_ consist of the same bytes present in the source code. – ninjalj Oct 07 '10 at 23:31
  • @Guoguin: the character set (and the encoding!) of C source doesn't need to be the same as the character set used on object files. In fact, properly internationalizable C source for Win32 ANSI will typically be pure ASCII (i.e: 0-127), and characters outside ASCII will appear only on resource files. – ninjalj Oct 07 '10 at 23:35
  • @ninjalj: Then how is the string literal in source converted when compiled to object file. How is the execution character set decided? – Alfred Oct 08 '10 at 00:10
  • @Guoqin: the "source character set" and "execution character set", as far as the compiler is concerned, usually only include the subset of ASCII which is mandated to exist (but not necessarily with ASCII encoding) by the standard. Since this will have the same (ASCII) encoding regardless of what locale/codepage junk is selected, it's largely irrelevant. Source/execution character set differences would only come into play if you had a cross-compiler running on an ASCII-based machine compiling binaries for an EBCDIC-based machine, or vice-versa. – R.. GitHub STOP HELPING ICE Oct 08 '10 at 07:04
1

RichieHindle correctly explains that there are two variants of most API's, a *W (Unicode) and a *A (ANSI) variant. But after that he's slightly wrong.

It's important to know that the *A variants (such as MessageBoxA) are just wrappers for the *W versions (such as MessageBoxW). They take the input strings and convert them to Unicode; they take the output strings and convert them back.

In the Windows SDK, for all such A/W pairs, there is a #ifdef UNICODE block such that MessageBox() is a macro that expands to either MessageBoxA() or MessageBoxW(). Because all macros use the same condition, many programs use either 100% *A functions or 100% *W functions. "non-Unicode" applications are then those that have not defined UNICODE, and therefore use the *A variants exclusively.

However, there is no reason why you can't mix-and-match *A and *W functions. Would programs that mix *A and *W functions be considered "Unicode", "non-Unicode" or even something else? Actually, the answer is also mixed. When it comes to that Clock, Language, and Region setting, an application is considered a Unicode application when it's making a *W call, and a non-Unicode application when it's making a *A call - the setting controls how the *A wrappers translate to *W calls. And in multi-threaded programs, you can therefore be both at the same time (!)

So, to come back to RichieHindle's example, if you call a *A function with value (char)0xE4, the wrapper will forward to the *W function with either L'ä' or L'ה' depending on this setting. If you then call the *W function directly with the value (WCHAR)0x00E4, no translation happens.

MSalters
  • 173,980
  • 10
  • 155
  • 350
0

A non-unicode application is one that primarily uses a multi-byte encoding, where the strings are reperesented by char*, not wchar_t*:

char* myString;

By changing the encoding used, you change the character set available to the application.

And most applications contain both instructions and data.

Alexander Rafferty
  • 6,134
  • 4
  • 33
  • 55
  • 1
    @Amigable Clark Kant: No, "multi-byte" is correct for the ANSI API and for using `char`. For instance, see the `MultiByteToWideChar` API, where `MultiByte` means non-Unicode and `WideChar` means Unicode. – RichieHindle Oct 07 '10 at 08:36
  • 1
    Answers and comments should explain that this is incorrect terminology created by Microsoft. The primary encoding for Unicode is UTF-8, a multibyte encoding, and there exist systems where wide character encoding is not Unicode. In fact, one could argue that it's not Unicode on Windows since Windows' `wchar_t` is too small to store arbitrary Unicode codepoints... – R.. GitHub STOP HELPING ICE Oct 07 '10 at 16:42
  • @Alexander Rafferty: so for the data segment, what is the internal encoding used in ANSI C? is not defined by C or we can changeit? – Alfred Oct 07 '10 at 19:33
  • @RichieHindle: MultiByte means multibyte, and WideChar means wide char. There are lots of systems out there using utf-8 for multibyte characters, and there's nothing in the C standard specifying that wide chars should be Unicode or ISO/IEC 10646. – ninjalj Oct 07 '10 at 23:41
  • 1
    @Guoqin: I hope you're not confusing ANSI C (roughly equivalent to ISO 9899, ISO C) with the Windows ANSI API, so called because some of the codepages used by Windows were based on drafts of ANSI standards. – ninjalj Oct 07 '10 at 23:49
  • @ninjalj: One could argue that the C standard does imply `wchar_t` *should* be Unicode via specifying the `__STDC_ISO_10646__` macro which is predefined when `wchar_t` is Unicode. – R.. GitHub STOP HELPING ICE Oct 08 '10 at 07:01