4

When I run the command chcp in a cmd.exe window, it represents the code page used in Windows.

I think Windows uses the UNICODE character set.

So, my questions are:

  1. Why does Windows use ANSI codepages instead of Unicode?

  2. Windows uses UTF-16 or UCS-2? Can I check this (by command or MSDN link)?

  3. UTF-16 or UCS-2 is just an encoding? or is also a character set?

  4. UTF-8, UTF-16, UTF-32, etc .. do they have different character set size?

I'm so confused. please somebody define them.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
JaeHyeok Kim
  • 103
  • 5
  • Changing the console's code page only affects non-Unicode applications. AFAIK the console still only supports UCS-2, but of course most Windows applications are GUI and don't use the console anyway. – Harry Johnston Oct 11 '17 at 00:24
  • Character cells in the console use a 16-bit character code. This limits what it can display to the BMP. A UTF-16 surrogate pair can be written to adjacent cells, in which case they'll show up as two default glyphs, such as a boxed question mark. FWIW, you can copy and paste the surrogate pair to another window. The console also doesn't use Uniscribe or DirectWrite, so there's no support for complex scripts, combining characters, and automatic fallback fonts. You can improve glyph coverage with manual font linking in the registry. – Eryk Sun Oct 11 '17 at 00:53
  • *1) Why does Windows use ANSI Code page instead of UNICODE?* really console use both Unicode and multibyte api. all internal functions used Unicode. text displayed as Unicode. the CP used only for translate input/output Unicode <-> multibyte. if we call `WriteConsoleW` text will be displayed as is and current CP have no any effect. if we call `WriteConsoleA` - text first will be translated to Unicode via `MultiByteToWideChar` and CP will be used as first argument here. so `A` api call result depend on current CP, while `W` not. and `chcp` will have effect only for current `cmd.exe` – RbMm Oct 11 '17 at 01:16
  • 1
    @RbMm, I assume you mean for the current console, not just a CMD shell that's attached to the console. CMD is just a console client application, like any other console application. chcp.com is a simple console app that calls `GetConsoleCP`, `SetConsoleCP` and `SetConsoleOutputCP`. It doesn't allow setting the output codepade independent of the input codepage. Notably the console's input and output codepages are used when using it as a generic file via `ReadFile` and `WriteFile`, for which UTF-16LE (codepage 1200) is not supported. – Eryk Sun Oct 11 '17 at 01:36
  • Thanks for your response. I added 4th question please response that. – JaeHyeok Kim Oct 11 '17 at 07:47
  • @eryksun - yes :) if be exactly I mean *conhost.exe* (console server process) to which attached *cmd.exe* and *chcp.com* too. and call from `SetConsole[Output]CP` in any process attached to console (*conhost.exe*) leads to a call `SrvSetConsoleCP` in *conhost.exe* which actually and set CP. so CP this is only variable/state in *conhost.exe* and affected processes attached to it. if we exec new cmd from current - this will be have effect to it too(attached to same *conhost.exe*) but if exec cmd from explorer - it have separate *conhost.exe* and no effect for it – RbMm Oct 11 '17 at 07:58
  • so Code page (in console server process *conhost.exe*) is variable used for performing the conversion multi-byte <-> Unicode when ansi api version is used or Read-Write file (to console) too. but this is already details. if we use *W* api for interact console - no conversion and current CP have no effect at all – RbMm Oct 11 '17 at 08:03
  • [*Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16*](https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx) – RbMm Oct 11 '17 at 08:12
  • and [Surrogates and Supplementary Characters](https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx) – RbMm Oct 11 '17 at 08:14
  • @RbMm, generally we shouldn't consider undocumented implementation details, but we need to be aware of the bugs. For example, using 65001 (UTF-8) for the output codepage was buggy prior to Windows 8, in that `WriteFile` and `WriteConsoleA` returned the number of UTF-16 codes written instead of the number of bytes written. Even worse, setting the input codepage to 65001 fails at reading input beyond 7-bit ASCII, even in Windows 10 Creators update, due to static assumptions about the number of ANSI bytes per character when sizing the internal buffer used for the `WideCharToMultiByte` call. – Eryk Sun Oct 11 '17 at 11:19
  • @RbMm, another internal change (IMO not really a bug) is that new console in Windows 10 no longer calls `MultiByteToWideChar` (for `WriteConsoleA` / `WriteFile`) with the flag `MB_USEGLYPHCHARS`. The old console implementation used this flag to substitute the classic OEM PC glyph characters for ASCII control characters. Arguably this an enhancement since the screen buffer in the new console has exactly the ASCII characters written to it instead of implicitly substituted characters. – Eryk Sun Oct 11 '17 at 11:22
  • Your second sentence contradicts your first question. Please clarify. – user207421 Oct 11 '17 at 21:54

1 Answers1

6
  1. Historical reasons, and backwards compatibility. Windows itself is a Unicode-based OS, and has been since the NT days. But many legacy (and even current) apps are not written for Unicode. Unicode-enabled apps do not use ANSI codepages, unless they need to convert runtime data between ANSI and Unicode.

  2. Microsoft switched to UTF-16 in Windows 2000. Before that, it used UCS-2. See Unicode in Microsoft Windows.

  3. Both UTF-16 and UCS-2 are just encodings of the same Unicode character set. UTF-16 was invented to support encoding codepoints above U+FFFF, which UCS-2 cannot handle.

  4. All UTFs (including many you haven't named) are just encodings of the same Unicode character set. The number specified in the name is the number of bits used in encoded codeunits (UTF-8 uses 8bit codeunits, UTF-16 uses 16bit codeunits, etc).

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 1
    UTF-16 is a character encoding. UCS-2 is a character set. When work on Windows NT started, they were essentially the same thing. Numerically, not semantically. The distinction wasn't all that important, until Windows 2000, as you point out in bullet points 2 and 3. – IInspectable Oct 26 '17 at 21:14
  • "Unicode-enabled apps do not use ANSI codepages" how do you "unicode-enable" an app? I can't find anything about that – Barnack May 14 '19 at 13:30
  • @Barnack by using Unicode strings and Unicode APIs in your code, instead of using ANSI strings and ANSI APIs. Start by making sure that your project is configured to use the Unicode character set, so that the ``UNICODE` and `_UNICODE` conditionals are defined during compiling, making all `TCHAR`/_TCHAR`-based variables and C/Win32 APIs to use `wchar_t` instead of `char`. Check with your compiler's documentation for more details – Remy Lebeau May 14 '19 at 17:00
  • @RemyLebeau wchar_t does a somewhat better job at handling unicode just because it supports up to two bytes. It's still not a support for unicode encoding. And it certainly excludes utf-8 which is what you should be using in a program that uses mostly western strings to avoid memory waste. – Barnack May 14 '19 at 17:05
  • @Barnack the OP's question pertains to Windows. And on Windows, Unicode is handled by `wchar_t` and UTF-16. While you can certainly use UTF-8 in your code if you want to, you will have to convert to/from UTF-16 when interacting with the OS. – Remy Lebeau May 14 '19 at 17:11
  • @RemyLebeau what i'm saying is wchar_t just handles 2 byte characters, and will happen to work for utf-16 characters that take up to 2 bytes. Won't it have issues for ones that take more? – Barnack May 14 '19 at 17:27
  • @Barnack No, because UTF-16 uses 2 16-bit values (known as a surrogate pair) to represent codepoints above U+FFFF, and those surrogates fit perfectly fine in 2-byte `wchar_t`s. It is no different than UTF-8 using more than 1 8-bit `char` to represent codepoints above U+007F. All UTFs support the full range of Unicode codepoints (U+0000..U+10FFFF). – Remy Lebeau May 14 '19 at 18:21
  • @RemyLebeau so on windows wchar_t is always interpreted as utf-16 encoding while char is never interpreted as utf-8 and there's no way to make the os interpret a char as utf-8, did i understand correctly? – Barnack May 14 '19 at 18:22
  • 1
    @Barnack `wchar_t` was interpreted as UCS-2 prior to Windows 2000, but since 2000 `wchar_t` is now interpreted as UTF-16, yes. As for `char` and UTF-8, most versions of Windows do not understand UTF-8 (other than a few isolated cases, such as the `MultiByteToWideChar()`/`WideCharToMultiByte()` APIs, extensions to `fopen()`, the `chmod` command of the `cmd` console, etc). But in Windows 10 insider build 17035, Microsoft finally added UTF-8 codepage support to legacy Win32 ANSI APIs (ie, to interpret `char` strings as UTF-8 instead of as ANSI), but the feature is currently in beta – Remy Lebeau May 14 '19 at 18:30
  • Under advanced language settings there is an option to change to UTF8. (Beta) change that and see how installation software and windows itself crumbles. Its not only Microsoft but also all the legacy of other party developers prohibiting successful addaptaion of Unicode or UTF8 in Windows – theking2 Jan 15 '23 at 20:40