1

Recent times I am coming across the conversion of UTF-8 encoding to string and vice vera. I understood that UTF-8 encoding is used to hold almost all the characters in the world while using char which is built in data type for string, only ASCII values can be stored.For a character in UTF-8 encoding the number of bytes required in memory is varied from one byte to 4 bytes but for 'char' type it is usually 1 byte.

My question is what happens in conversion from wstring to string or wchar to char ? Does the characters which require more than one byte is skipped? It seems it depends on implementation but I want to know what is the correct way of doing it.

Also does wchar is required to store unicode characters ? As far as I understood UNICODE characters can be stored in normal string as well. Why should we use wstring or wchar ?

evk1206
  • 433
  • 7
  • 18
  • `char` is not an encoding, but a data-type. And there is no conversion defined, only a plethora of conversion-functions, and you have to pick the appropriate one. – Deduplicator Dec 01 '14 at 09:26
  • @Deduplicator : Thanks for correcting the mistake. Do we need wchar/wstring type for UTF-8 encoding ? I understood that we can use normal string or char – evk1206 Dec 01 '14 at 09:32

2 Answers2

4

Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.

Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.

So, the right way to convert from UTF8 to UTF16:

     std::string utf8String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::wstring utf16String = convert.from_bytes( utf8String );

And the other way around:

     std::wstring utf16String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::string utf16String = convert.to_bytes( utf16String );

And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.

When compiling in Unicode the windows API commands expect these formats:

CommandA - multibyte - ANSI
CommandW - Unicode - UTF16

Yochai Timmer
  • 48,127
  • 24
  • 147
  • 185
  • Windows is using UTF16 encoding type by default and it uses wstring datatype for its implemetation . Is that coorect? I am working on some HTTP encoding and decoding where the characters are UTF-8 encoded , so for this implementation should i need to use wstring or string data type is enough – evk1206 Dec 01 '14 at 09:38
  • @vinothkumareswaran yes, when you use "unicode" it's UTF16. Also C# and Java use UTF16 strings. – Yochai Timmer Dec 01 '14 at 09:39
  • If one says Unicode one means UTF-16? Nowhere but windows. Also, `wstring` is only UTF-16 on windows, it's UTF-32 otherwise. – Deduplicator Dec 01 '14 at 09:55
  • @Deduplicator It's what you put into it. Also on windows it could be UTF-32 – Yochai Timmer Dec 01 '14 at 10:28
  • ?? It's what you put into it? What do you mean? And a `wstring` on windows that's UTF-32 would be a severe aberration. – Deduplicator Dec 01 '14 at 10:35
  • wstring is a type, not a format. You could use it to store GB 18030 , UCS-2 and of course UTF16 as well. – Yochai Timmer Dec 01 '14 at 10:57
1

Make your source files UTF-8 encoded, set the character encoding to UNICODE in your IDE.
Use std::string and widen them for WindowsAPI calls.
std::string somestring = "こんにちは"; WindowsApiW(widen(somestring).c_str());
I know it sounds kind of hacky but a more profound explaination of this issue can be found at utf8everywhere.org.

  • In Visual Studio all WinApi calls automatically use the wide string version if you set the encoding to Unicode, so you don't need to explicitly end the function with W. – Katrin Meißner Dec 01 '14 at 10:32
  • You are assigning some special characters to std::string, If I put somestring.c_str() directly , the result is undefined. You have used constant string , which I think is stored in data segement , so at conversion time you are making use windows API and converting it to wide string. Is that correct? – evk1206 Dec 01 '14 at 10:49
  • If you'd do somestring.c_str() you'd get a compiler error, because the WindowsAPI uses wide characters for UNICODE (UTF-16). This is just a workaround because Windows does not natively support UTF-8. – Katrin Meißner Dec 01 '14 at 10:55
  • std::string somestring = "こんにちは"; cout<<"\n"< – evk1206 Dec 01 '14 at 10:56