2

I am writing a program includes output chinese characters using Dev C++.

I've added -finput-charset=big5 -fexec-charset=big5 in compiler parameters. I also set the code page of the console to be 950 (traditional chinese)

It works perfectly while in a simple cout like this:

cout << "中文字";

while it comes to characters array it goes wrong as expected:

char chin[] = "中文字"; 
cout << chin[0];//output nothing
cout << chin[0] << chin[1];//output the first chinese character as one chinese character occupies 2 bytes.

So I decided to use wchar_t instead and I have to use wcout with wchar_t or else a number will be shown.

However, wcout show nothing in the console. All of the below show nothing:

wcout << L"中文字";
wchar_t chin2[] = L"中文字";
wcout << chin2[0]; 

What did I missed to use wchar_t to output chinese (or other east asian) characters? I really don't want to write 2 array member to show one single chinese chracters.

  • What about wchar_t * chin2 = L"中文字"; ? – Leo Chapiro Jul 29 '14 at 09:17
  • On Windows these things are more...tricky than they should be. If it works well with your first example then **your console is UTF8**. You should first use [`_setmode()`](http://msdn.microsoft.com/en-us/library/tw4k6df8(v=vs.110).aspx) to set `stdout` to UTF16 (because with `wchar_t` and `wcout` you're outputting UTF16). Moreover you **must not** access characters by index, UTF16 is not fixed size encoding (some characters may have 2 `wchar_t` length), this is especially true for Traditional Chinese. To print a single character you may need "some" extra code. – Adriano Repetti Jul 29 '14 at 09:35
  • @duDE warning already while compiling: [Warning] deprecated conversion from string constant to 'wchar_t*' [-Wwrite-strings] – Alexander Xanotos Jul 30 '14 at 02:01
  • @AdrianoRepetti No, the console is using code page 950(traditional chinese). And I tried the _setmode() to alter to utf-16 but it still show nothing. – Alexander Xanotos Jul 30 '14 at 02:06
  • You may not be able to embed those characters in your source, even though your runtime may support them. Try using unicode character constants in the source file. – M.M Jul 30 '14 at 02:33
  • Try `wprintf("%s\n", L"中文字")`. – Siyuan Ren Jul 30 '14 at 02:58
  • I temporary solve the problem by using 'typedef char cchar[3]; cchar chi[3]={"中","文","字"}; cout << chi[0] << chi[1] << chi[2] << endl;' And from some data research and the comment of the below answer, I suspect the source of why wchar_t don't show anything is the console of Windows XP do not support UTF-16 while there may be font support problem for UTF-8(I 'm not sure about this as I expect having some question mark output rather than nothing). – Alexander Xanotos Jul 31 '14 at 06:55
  • @C.R. My compiler strangely told me "cannot convert const char* to const wchar_t*". – Alexander Xanotos Jul 31 '14 at 07:11
  • @AlexanderXanotos: My bad. Try `wprintf(L"%s\n", L"中文字");`. – Siyuan Ren Jul 31 '14 at 11:16
  • @C.R Wow! I success with: _setmode(_fileno(stdout), _O_U16TEXT); wprintf(L"%s\n",L"中文"); Thank you!Thank Adrianno for the _setmode()! Thank all of you for all the information! – Alexander Xanotos Aug 02 '14 at 01:59

1 Answers1

0

There are subtle problems going on here.

The C++ compiler does not understand Big5 encoding. When you create a source code file and display it, you may see your familiar Chinese characters but the compiler sees a string of bytes. Big5 is a double byte charset so each input character will be represented by 2 bytes inside the compiler.

When that string of bytes is fed to a suitable output device the Chinese characters appear again. Code page 950 is compatible with Big5 so you see the "right" thing. But then you try to build on this and confusion is the result. Your second code sample uses L"" strings, but I expect those strings will contain half a character in each short.

The only "safe" character set you can use is Unicode. Windows internals are historically UCS-2 (char is a single short) but is now theoretically UTF-16 (char is short, but may include multi-byte sequences). Not all existing software and older APIs fully support UTF-16 (or need to). Windows has very limited support for UTF-8 or other encodings. Everything gets converted into Unicode, so best to just leave it that way.

In practice, you should build your C++ code with Unicode settings, for UCS-2, and exercise caution if you need characters that would require multibyte sequences. You should ensure that any source code you write and any input text files are identified as whatever encoding they need to be, but are translated into Unicode internally. Leave your console as the default Unicode encoding, and everything will just work.

It is almost impossible to sensibly use Big5 as an internal encoding in a Windows program. Best not to try.

david.pfx
  • 10,520
  • 3
  • 30
  • 63
  • All "wide" API functions expect UTF-16 (see msdn). Which ones exactly are lacking, according to you? – rubenvb Jul 29 '14 at 17:43
  • the default code page is 950(traditional chinese)already, while I change the non-Unicode program setting in Regional and Language Options, and theoretically Dev C++ can understand big5 source code as I have added the -finput-charset=big5 in the compilier setting – Alexander Xanotos Jul 30 '14 at 02:12
  • @AlexanderXanotos: It appears your compiler understands Big5 as a multi-byte encoding, not as a wide encoding, and this works because your console decodes the byte sequences. I am saying that the only practicable wide encoding is Unicode, and that you need to configure your compiler to convert Big5 into Unicode for wide to work. – david.pfx Jul 30 '14 at 02:34
  • @rubenvb: Older software, older APIs and older versions of Windows may have either limited or no support for multi-element (UTF-16) Unicode. Single element (UCS-2) Unicode works everywhere (since NT 3.x), multi-element Unicode may not. – david.pfx Jul 30 '14 at 02:36
  • Do you are just saying things without knowing if they are true? I'm not saying I don't believe you (Microsoft pulls this kind of thing) but right now you're just spreading FUD. Nobody uses anything pre-XP anymore, and even XP is officially dead. – rubenvb Jul 30 '14 at 05:28
  • @rubenvb: There are plenty of software products and APIs produced in the last 10-15 years which do not correctly support UTF-16 surrogates or characters outside the basic plane. If you look, you'll find, but this is not something to debate here in the comments. Ask a good question and I might be tempted to answer it. – david.pfx Jul 30 '14 at 06:49