There are a couple string encoding features at play here. Namely:
1. Character encoding
There are many ways of encoding strings. char
does not imply 1-byte characters. Multibyte character sets (MBCS) have existed for decades before Unicode, and this is probably how your compiler is interpreting the literal Chinese characters. If you look into the memory that represents this string, you'll almost certainly see that the characters are represented by more than just 1 byte each.
This is a common source of headache though, and the reason Unicode was conceived. Everything needs to be using the same character encoding for proper string representation. Between your text file saved on disk, your compiler, your code which handles the string (and all libraries like std::
), the stream you're writing to, the font... everything needs to agree on the encoding.
We avoid this headache in modern times by using Unicode of some form.
The shortest answer though here is that it depends on how your compiler is interpreting your source. It's implementation-defined, and normally there is a compiler-specific way of specifying this behavior (for msvc: /utf-8
).
This means your second example, which does assume that characters are 1 byte each, could only succeed if your compiler is operating with an encoding where these chars fit into a single byte, which I suspect is impossible. The compiler will thus truncate to 1 character, and you'll get basically garbage.
2. Null-termination
Strings are generally null-terminated in C or C++, meaning after the last character, a value of 0
indicates the end of the string. A string like abc
is represented in memory as 4 bytes: 'a', 'b', 'c', 0
In your first example, the compiler automatically adds the null-terminating character for you.
In your second example, there is no null terminator. So when you print the string to the console, the printing routine doesn't know how long your string is, and prints until it finds a null in garbage memory.