3

Previously I thought that C++ std::string can only support ASCII character sets. For other character sets for example, French, Japanese characters, encoding for example UTF-8/UTF-16 will be needed.

I just try below code and it seems that C++ std::string has already support Japanese characters and French. Guess also support all other languages.

How can it happen? Does it mean that we just need std::string to handle all human languages?

string s;

s = "今年1年の世相を4字で振り返る恒例の「創作四字熟語」の優秀・入選50作品を発表した";

string t;

t = "Vélo, sac, appareil photo: bleu en un «Klein» d'œil pour Noël";

cout<<s<<'\n';

cout<<t<<'\n';

Output:

今年1年の世相を4字で振り返る恒例の「創作四字熟語」の優秀・入選50作品を発表した

Vélo, sac, appareil photo: bleu en un «Klein» d'œil pour Noël

lst
  • 65
  • 5
  • 2
    C++ strings are just a container of bytes. If your source file has some sort of encoding (say UTF-8) and the compiler passes the bytes through as-is, and the `cout` passes the bytes through as-is, and the environment you are using can interpret those bytes (again, say UTF-8) and use them to display the expected characters... voila! All is well. Any part of that end-to-end breaks down, and then it'll fail. – Eljay Dec 17 '18 at 12:19
  • Note that `std::string` is a typedef for `std::basic_string`. The full template is `template , class Alloc = allocator> class basic_string`. You can instantiate it with your own character type that has its own defined properties. – Pete Becker Dec 17 '18 at 12:47
  • I try with Xcode Debugger and below are the info: Printing description of s.__r_: (std::__1::__compressed_pair, std::__1::allocator >::__rep, std::__1::allocator >) __r_ = { std::__1::__compressed_pair_elem, std::__1::allocator >::__rep, 0, false> = { __value_ = { = { __l = (__cap_ = 129, __size_ = 123, __data_ = "今年1年の世相を4字で振り返る恒例の「創作四字熟語」の優秀・入選50作品を発表した") – lst Dec 18 '18 at 12:20
  • __s = { = (__size_ = '\x81', __lx = '\x81') __data_ = { [0] = '\0' [1] = '\0' [2] = '\0' [3] = '\0' [4] = '\0' [5] = '\0' [6] = '\0' [7] = '{' [8] = '\0' [9] = '\0' [10] = '\0' [11] = '\0' [12] = '\0' [13] = '\0' [14] = '\0' [15] = '\xa0' [16] = '\x01' [17] = 'p' [18] = '\0' [19] = '\x01' [20] = '\0' [21] = '\0' [22] = '\0' } } __r = { __words = ([0] = 129, [1] = 123, [2] = 4302307744) } } } } } – lst Dec 18 '18 at 12:24
  • _size = 123 means the size of data is 123 bytes for "今年1年の世相を4字で振り返る恒例の「創作四字熟語」の優秀・入選50作品を発表した" there is also a Printing description of s.__r_.__value_.__l.__data_: (std::__1::basic_string, std::__1::allocator >::pointer) __data_ = 0x00000001007001a0 "今年1年の世相を4字で振り返る恒例の「創作四字熟語」の優秀・入選50作品を発表した" the 0x00000001007001a0 is the address of the following Japanese characters stored in memory. So, it seems that std::string just a container and does care about actual encoding? – lst Dec 18 '18 at 13:14
  • @Eljay Literals are emitted by the compiler using the compilation's execution character encoding. @ lst I have never seen a C++ compiler where either the source character encoding or execution character encoding could be set to ASCII. – Tom Blodget Dec 18 '18 at 21:46
  • @TomBlodget I tried below code: cout< – lst Dec 20 '18 at 12:44
  • There is no text but encoded text. In a text file or text data structure, you would use exactly one character encoding. For string literals from source code, the compiler uses the character encoding you specify as the "execution charset". You would want libraries, including cout, to know what that is. – Tom Blodget Dec 21 '18 at 02:43

1 Answers1

4

A std::string can support an arbitrary byte stream, including UTF-8, which is what you're seeing here. On the input side, your compiler evidently supports it, and on the output side your terminal program does.

Where things might break down is if you assume in your code that one char in your std:: string corresponds to one character on the screen. That is not true for UTF-8, as you probably already know.

Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
  • Does compiler need support the string encoding? Or just the editor supports will work? I tried to copy and save the example sentence and it seems that under OS X just create a new file and put it, the file size will be 123 bytes. When the sentence is assigned to string s, just the 123 raw bytes are copied to s? Guess compiler has nothing to do with the content? When the application execute and try to display the output, the essential element will be the OS support the font? Can it be understand that as long long there is the font in OS, terminal program can display it? – lst Dec 20 '18 at 12:49
  • The compiler doesnt need to do anything special really, just accept character string literals verbatim. The editor does need to support it of course, as does the OS. – Paul Sanders Dec 20 '18 at 13:15