3

This is my first attempt at dealing with multiple languages in a program. I would really appreciate if someone could provide me with some study material and how to approach this type of issue.

The question is representing a string which has multiple languages. For example, think of a string that has "Hello" in many languages, all comma separated. What I want to do is to separate these words. So my questions are:

  1. Can I use std::string for this or should I use std::wstring?
  2. If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
  3. Overall, what is the 'accepted' way of handling this type of case?

Thank you.

madu
  • 5,232
  • 14
  • 56
  • 96

1 Answers1

2

Can I use std::string for this or should I use std::wstring?

Both can be used. If you use std::string, the encoding should be UTF-8 so as to avoid null-bytes which you'd get if you were to use UTF-16, UCS-2 etc. If you use std::wstring, you can also use encodings that require larger numbers to represent the individual characters, i.e. UCS-2 and UCS-4 will typically be fine, but strictly speaking this is implementation-dependent. In C++11, there is also std::u16string (good for UTF-16 and UCS-2) and std::u32string (good for UCS-4).

So, which of these types to use depends on which encoding you prefer, not on the number or type of languages you want to represent.

As a rule of thumb, UTF-8 is great for storage of large texts, while UCS-4 is best if memory footprint does not matter so much, but you want character-level iterations and position-arithmetic to be convenient and fast. (Example: Skipping n characters in an UTF-8 string is an O(n) operation, while it is an O(1) operation in UCS-4.)

If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.

I would use the same data type for the words as I would use for the text itself. I.e. words of a std::string text should also be std::string, and words from a std::wstring should be std::wstring.

(If there is really a good reason to switch from a string-datatype to a character-pointer datatype, of course char* is right for std::string and wchar_t* is right for std::string. Similarly for the C++11 types, there is char16_t* and char32_t*.)

Overall, what is the 'accepted' way of handling this type of case?

The first question you need to answer to yourself is which encoding you want to use for storage and processing. In highly international settings, only Unicode encodings are truly eligible, but there are still more than one to choose from: UTF-8, UCS-2 and UCS-4 are the most common ones. As described above, which one you choose has implications for memory footprint and processing speed, so think carefully about what types of operations you need to perform. It may be required to convert from one encoding to another at certain points in your program for optimal space and time behavior. Once you know which encoding you want to use in each part of the program, choose the data type accordingly.

Once encoding and data types have been decided, you might also need to look into Unicode normalization. In many languages, the same character (or character/diacritics combination) can be represented by more than one sequence of Unicode code points (esp. when combining characters are used). To deal with these cases properly, you may need to apply Unicode normalizations (such as NFKC) to the strings. Note that there is no built-in support for this in the C++ Standard Library.

jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • Thank you very much for your detailed answer jogojapan. I want to clarify one more thing. How can I specify my string is UTF-8 encoded? For example, in my machine, I have Japanese characters installed. If I write a C++ program which has both English and Japanese character strings, and use std::string for them, it would work in my machine. What if I run this same program on a another machine. How would it know that my string is UTF-8 encoded? How can I specify to the system that my program is using strings which are encoded UTF-8? Does that question make sense? – madu Dec 02 '13 at 04:33
  • 1
    It depends on how you generate these strings. If the strings are part of your C++ program, they are generated by the editor (or IDE) you use to create your program. There should be an option somewhere to choose which encoding you want to use when saving the file. If you can't find it out, you can try loading the file in another editor (or in a web browser) that allows you to choose the display encoding. If you choose UTF-8 and then cannot read the character, you know that the file was saved in a different encoding. – jogojapan Dec 02 '13 at 04:38
  • Thank you jogojapan. So it seems that if I'm using UTF-8 encoding, and other systems which will use this program also has UTF-8 support, I do not need to do any special handling to have a string with multiple languages? But when I write the program, suppose my IDE is using UTF-8 encoding, but what would happen my executable is run on another machine? For example, if I create this program as a Win32 console EXE, and then run it on a different WIndows machine, how does that machine know to interpret the strings as UTF-8? That information is not included in the executable, is it? Thank you. – madu Dec 02 '13 at 04:58
  • 1
    If the strings are part of the progam (i.e. if they are defined as literals, e.g. if you have code like `std::string s = "文字列";` _in your code_), then the strings will be part of the executable too. They will have the same encoding on every machine you run it on. – jogojapan Dec 02 '13 at 05:00
  • 1
    You may have problems when you _compile_ your code on a different platform. The compiler there may reject the UTF-8 encoded strings in your code. In that case it's best to store the strings in a separate file and load that file from your code. You can then be sure that the encoding will be the same on every platform. – jogojapan Dec 02 '13 at 05:04
  • Thank you very much jogojapan. It's pretty clear now. Just a bit confused as to how the executable will have no issue with the encoding. I imagined the system should be aware of what encoding was used during compilation to decode back the string (and display it correctly on the console). Appreciate your help! – madu Dec 02 '13 at 06:16
  • 1
    If you have a program like `std::cout << "文字" << std::endl;`, then the compiler will compile the byte sequence of "文字" into the program. If you transfer this to another computer and execute it there, the exact same sequence of bytes (same encoding) will produced there. If the console there doesn't support UTF-8, or doesn't have the right font, or whatever, then the string will be displayed wrongly. Neither the OS nor the console will try to guess the encoding, decode it to something else and then output it. – jogojapan Dec 02 '13 at 06:19
  • Thank you! Thats what was confusing. – madu Dec 02 '13 at 06:42