Why `const char *` type in C++ can store Unicode?

Question

I can write code like this:

const char * a = "你好";
cout<<a;

But when write like this:

char a[] = {'你','好'};
cout<<a;

It outputs garbled codes like this:

I thought the Chinese characters are stored in wchar_t,

so how does const char * contain Chinese characters?

@ratsafalig How many is there chinese chars? And what's the size_of char -> which gives you how many unique chars you can store? — Verthais, Mar 17 '20 at 08:46
@idclev 463035818 but if so,how `cout` recognize wich is single char and which isn't? — ratsafalig, Mar 17 '20 at 08:52
I dont know. Btw please include the output in the question, not that I dont believe you, but when I try to compile that line I get errors. Before your edit it also was an error, now it is "garbled codes", what is it really? Please show the output — 463035818_is_not_an_ai, Mar 17 '20 at 08:54
@interjay sorry,did you mean the second isn't end with '\0' or? but even if i add the '\0',it still don't output as i expected — ratsafalig, Mar 17 '20 at 09:05

phuclv · Answer 1 · 2020-04-11T16:59:49.790

When you write char a[] = {'你','好'}; it declares a char array of 2 elements (i.e 2 chars). Since it's not null-terminated it's not a string that cout can print properly and attempting to print it invokes undefined behavior. But even if you add a null terminator { '你', '好', '\0' }; it'll still not work because a 1-byte char can't store a Chinese character. In fact if the content between the two single quotes are longer than 1 byte (like 'abcd' or '你' in this case) then the behavior is implementation-defined. See Multicharacter literal in C and C++

However if you enclose the characters inside double quotes "你好" then it's definitely not a 3-byte null-terminated string literal but a byte sequence in some encoding. The C++ standard doesn't specify which encoding to be used in a string literal but it's generally whatever bytes that was saved in the source file in the its encoding, which is often the current ANSI codepage in Windows and UTF-8 in Linux. std::string wraps a const char* inside so the same thing applies to it

UTF-8 is a variable-length encoding whose unit is byte like other multi-byte encodings, so its underlying representation can be a char[] array and "你好" will be a string of 6 code units. You can check that with strlen(). OTOH cout knows nothing about those characters and doesn't care if that's a single byte character or longer. It simply passes the byte stream to the terminal and it's the terminal's job to display them on the screen. But if it wants it can determine how long the character is easily, just like how terminals or text editors do, because it's defined in the character encoding

There are many other character types in C++: wchar_t, char8_t, char16_t and char32_t. Their corresponding string types are std::wstring, std::u8string, std::u16string and std::u32string

Just like char*, the encoding in wchar_t* is not defined by the standard but it's often UTF-16 in Windows and UTF-32 in Linux. It's recommended to use char8_t, char16_t and char32_t which mandates the UTF-8/16/32 encoding regardless of compiler settings and source file encoding

To convert between any encodings you can use std::codecvt.
There are also the deprecated converters std::wstring_convert / std::codecvt_utf8 / std::codecvt_utf16 / std::codecvt_utf8_utf16 in older C++ standards and the conversion routines in each system: iconv in Unix and WideCharToMultiByte / MultiByteToWideChar in Windows, but it's better to use modern standard functions for portability

You may want to read these

is there any way to convert `const wchar * a` which contains chinese characters into a `const char * b`? — ratsafalig, Mar 17 '20 at 09:21
@ratsafalig there's no `wchar`, only `wchar_t`, but yes you can convert the encoding because you just need to know the encoding rules — phuclv, Mar 17 '20 at 09:28
@ratsafalig there are plenty of APIs/libraries readably available to handle such conversions (`WideCharToMultiByte()` on Windows, `std::wstring_convert` in C++11..C++17, iconv, ICU, etc) — Remy Lebeau, Mar 17 '20 at 18:13

tenfour · Answer 2 · 2020-03-17T09:19:37.697

There are a couple string encoding features at play here. Namely:

1. Character encoding

There are many ways of encoding strings. char does not imply 1-byte characters. Multibyte character sets (MBCS) have existed for decades before Unicode, and this is probably how your compiler is interpreting the literal Chinese characters. If you look into the memory that represents this string, you'll almost certainly see that the characters are represented by more than just 1 byte each.

This is a common source of headache though, and the reason Unicode was conceived. Everything needs to be using the same character encoding for proper string representation. Between your text file saved on disk, your compiler, your code which handles the string (and all libraries like std::), the stream you're writing to, the font... everything needs to agree on the encoding.

We avoid this headache in modern times by using Unicode of some form.

The shortest answer though here is that it depends on how your compiler is interpreting your source. It's implementation-defined, and normally there is a compiler-specific way of specifying this behavior (for msvc: /utf-8).

This means your second example, which does assume that characters are 1 byte each, could only succeed if your compiler is operating with an encoding where these chars fit into a single byte, which I suspect is impossible. The compiler will thus truncate to 1 character, and you'll get basically garbage.

2. Null-termination

Strings are generally null-terminated in C or C++, meaning after the last character, a value of 0 indicates the end of the string. A string like abc is represented in memory as 4 bytes: 'a', 'b', 'c', 0

In your first example, the compiler automatically adds the null-terminating character for you.

In your second example, there is no null terminator. So when you print the string to the console, the printing routine doesn't know how long your string is, and prints until it finds a null in garbage memory.

VLL · Answer 3 · 2020-03-17T09:16:47.280

4

When you write a string literal in your code, using characters longer than 1 byte, it is converted for you by the compiler. Consider this:

const char * a = "你好";
cout << strlen(a); // Prints 6

std::cout prints the bytes as is, and the characters are recognized by the Windows terminal.

With the character array, similar conversion might not be done, even if you add the missing zero. This is implementation-defined behavior. For example, with the compiler I used, each character is interpreted as a multi-character literal of type int, and then truncated to 1-byte char type.

edited Mar 17 '20 at 09:16

answered Mar 17 '20 at 09:12

VLL

9,634
1
29
54

Here's an [example](https://godbolt.org/z/TFNkom) to accompany this answer. – Ted Lyngmo Mar 17 '20 at 09:16
I now have a `const wchar_t * a` which contains the data,how can i get the data and convert the type into a `const char * b`?(the `const wchar_t * a` have characters longer than 1 byte) – ratsafalig Mar 17 '20 at 09:18
@ratsafalig You should use a conversion function such as `MultiByteToWideChar`: https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte – VLL Mar 17 '20 at 09:20

Why `const char *` type in C++ can store Unicode?

3 Answers3

1. Character encoding

2. Null-termination