Converting "normal" std::string to utf-8

Question

Let's see if I can explain this without too many factual errors...

I'm writing a string class and I want it to use utf-8 (stored in a std::string) as it's internal storage. I want it to be able to take both "normal" std::string and std::wstring as input and output.

Working with std::wstring is not a problem, I can use std::codecvt_utf8<wchar_t> to convert both from and to std::wstring.

However after extensive googling and searching on SO I have yet to find a way to convert between a "normal/default" C++ std::string (which I assume in Windows is using the local system localization?) and an utf-8 std::string.

I guess one option would be to first convert the std::string to an std::wstring using std::codecvt<wchar_t, char> and then convert it to utf-8 as above, but this seems quite inefficient given that at least the first 128 values of a char should translate straight over to utf-8 without conversion regardless of localization if I understand correctly.

I found this similar question: C++: how to convert ASCII or ANSI to UTF8 and stores in std::string Although I'm a bit skeptic towards that answer as it's hard coded to latin 1 and I want this to work with all types of localization to be on the safe side.

No answers involving boost thanks, I don't want the headache of getting my codebase to work with it.

First you need to somehow get the question mark out of "(which I assume in Windows is using the local system localization?)". `std::string` does not have a normal/default encoding. You can choose to assume that the `std::string` you have is encoded according to locale, but if for example you've just read it from a file then that might be untrue, since it will be encoded however the file is encoded. — Steve Jessop, Feb 05 '14 at 11:08
Well typically when reading raw text files there just is no way to know what encoding it has. Lacking this information it seems more likely for the file to have been created on a system with the same encoding, and therefore I assume the input of reading the file is in the local encoding. — DaedalusAlpha, Feb 05 '14 at 11:13
OK, so you can indeed remove the question mark :-) There is no doubt that you are assuming the locale-specific encoding. — Steve Jessop, Feb 05 '14 at 11:15

score 25 · Accepted Answer · answered Feb 05 '14 at 11:11

25

If your "normal string" is encoded using the system's code page and you want to convert it to UTF-8 then this should work:

std::string codepage_str;
int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
                               codepage_str.length(), nullptr, 0);
std::wstring utf16_str(size, '\0');
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
                    codepage_str.length(), &utf16_str[0], size);

int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
                                    utf16_str.length(), nullptr, 0,
                                    nullptr, nullptr);
std::string utf8_str(utf8_size, '\0');
WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
                    utf16_str.length(), &utf8_str[0], utf8_size,
                    nullptr, nullptr);

answered Feb 05 '14 at 11:11

Simple

13,992
2
47
47

This isn't much different from my naive solution in the question isn't it? First convert to wstring and then to utf-8, meaning at least 4 loops (check size, convert, check size, convert) through the data whereas if the input data is typical english text using only ascii characters it would be enough with one loop and no conversion. – DaedalusAlpha Feb 05 '14 at 13:56
@DaedalusAlpha unless you want to look else where for a Unicode library that handles Windows code pages then this is the best you can do using the Win32 API. You have to handle those characters outside of the 7-bit range, so just using one loop isn't enough. – Simple Feb 05 '14 at 13:59
What if you first ran a loop checking that the character is within the 7 bit range and adding it to the utf-8 string, and as soon as a check failed you would clear the string and fall back to this? In that case it would surely at least be like 1000% faster for ascii text and only about maximum 20% slower for non ascii. I think a lot of text files for example are in the ascii range. – DaedalusAlpha Feb 05 '14 at 14:06
@DaedalusAlpha depends on whether you live in America or not. – Simple Feb 05 '14 at 14:07
Not necessarily, I would argue that a lot of text files are output from programs that are coded in english and output in english. Also a lot of text files contains no text at all, only numbers which fall within the ascii range. I live in Sweden and can safely say that at least 99% of all text files on my computer is in english or only contains numbers, such as csv files. – DaedalusAlpha Feb 05 '14 at 14:10
@DaedalusAlpha well try it out. You could store the position up to the first non-ASCII character and then use that as the pointer to pass to those functions. – Simple Feb 05 '14 at 14:12
that's a great idea! I will try that, that's essentially all of the bonuses with no drawbacks. – DaedalusAlpha Feb 05 '14 at 14:14
Hey great work, it saved my life. I need to do the same thing for android and ios (cross compiling c++). Any ideas how to use without windows.h? – hugo411 Mar 16 '18 at 16:04
I too am curious as to how to do this in a cross-platform manner. – Oscar Apr 06 '21 at 03:02
Be careful with using MB_COMPOSITE, this flag is recommended by Microsoft devs to be never used at all, and there are certain encodings for which it's not going to work. Worse: it will silently pass on Windows 10, while it can explicitly fail on Windows 7. Reference: https://learn.microsoft.com/en-us/archive/blogs/shawnste/dont-use-mb_composite-mb_precomposed-or-wc_compositecheck – p0358 Apr 14 '23 at 11:44

Converting "normal" std::string to utf-8

1 Answers1

Linked