I am wondering in c++, how can we support UTF8 encoding? I think c++ only support char and w_char, but I am wondering how to support UTF-8?
Asked
Active
Viewed 3,362 times
3
-
Windows? Linux? Android NDK? – bmargulies Aug 19 '14 at 00:00
-
1And support UTF-8 in what way exactly? Opening UTF-8 encoded filenames? Writing UTF-8 encoded Unicode data into the file? – Remy Lebeau Aug 19 '14 at 00:02
-
If using C++11/C++14, you can imbue the stream supporting classes with a codecvt or whatever it is called.. to support UTF8. Btw.. I thought it already supported UTF8.. But I might be mixing that up with ANSI. – Brandon Aug 19 '14 at 00:04
-
I mean read UTF8 encoded file. – Adam Lee Aug 19 '14 at 00:08
-
Under the latest Mac OSX – Adam Lee Aug 19 '14 at 00:09
-
Support in the standard libraries is very poor. You might want to consider using a separate library like ICU: http://site.icu-project.org/ – Galik Aug 19 '14 at 00:13
-
@AdamLee You can definitely use `ifstream` to read a UTF-8 encoded file into memory. In memory, it will of course still be UTF-8 -- is that a problem? Is your actual question about how to convert to it so it becomes UCS-2 or UCS-4 in memory? – jogojapan Aug 19 '14 at 00:17
-
UTF-8 was specifically designed to be compatible even with 7-bit character systems. You will only need to worry about encoding if you try to count characters (number of characters will be different than the length of the buffer), modify the information (change case etc.), match stuff with regular expressions etc. In this case you might want to convert to UNICODE and use `std::wstring` instead of `std::string` and everything that envolves use of `wchar_t` instead of `char`. – Havenard Aug 19 '14 at 01:27
-
@Brandon There's nothing C++11 about imbuing a stream or a streambuf with a locale; it's been present in C++ since C++98. – James Kanze Aug 19 '14 at 08:27
1 Answers
5
UTF-8 is supported just fine; UTF-8 uses eight-bit symbols to represent characters, with each character having one or more symbols. The standard guarantees that char
will be at least eight bits, so every conforming C++ implementation can read, write and process UTF-8 characters. Since 7-bit ASCII is a strict subset of UTF-8, conversion between char
strings and UTF-8 is also not a problem.
What is a problem is converting between other encodings (code pages such as Latin-1 or other Unicode encodings such as UTF-16, UCS-2, UTF-32 and UCS-4) and UTF-8. Here's a rough outline of the situation:
- C++98 provided the
wchar_t
type and allowed wide-string literals in the formL"XXX"
but left most of the details implementation-defined. So VC++ treatswchar_t
as 16-bit and encodes wide-string literals as UTF-16; GCC treatswchar_t
as 32-bit and encodes wide-string literals as UTF-32. - C++11 provides some extra types,
char16_t
andchar32_t
, as well as 16- and 32-bit literals asu"XXX"
andU"XXX"
. These, however, are not yet supported by VC++ (GCC has them). - Conversion between encodings is supported by the
codecvt
template. This was added in C++98 but support has been spotty, to say the least. Today, VC++ seems to have reasonable support but GCC's support is lacking.

Tom
- 7,269
- 1
- 42
- 69
-
-
Why the C++11 in the last paragraph? `codecvt` has been present since C++98. What C++11 does add is the required presence of `std::codecvt_utf8` and a few others. – James Kanze Aug 19 '14 at 08:34
-
7 bits ASCII may be a subset of UTF-8, but it's not guaranteed that `char` is ASCII (those weird IBMs still exist). – MSalters Aug 19 '14 at 09:06
-
@JamesKanze It has been there since 98 but it has been absolutely crap until C++11. Not many libraries used it actually. If you haven't noticed, there is STILL a HUGE "lack" of support for it in gcc. You can't say that for more than 11 years, it still lacks this badly. It doesn't work at ALL on g++4.8.1 for windows. I've asked at least 10 questions on codecvt and unicode. Most popular being: http://stackoverflow.com/questions/21370710/unicode-problems-in-c-but-not-c None of the answers would work without visual studio and even then, it has huge bugs. codecvt became popular with C++11. – Brandon Aug 19 '14 at 23:00
-
@Brandon It's been there since C++98, and it hasn't changed since then. It's also true that even today, real support for it hasn't been up to par. But there's no way it can be considered a C++11 feature. I don't think its increasing popularity is a C++11 thing; I suspect that it's more a question of more users wanting to use UTF-8, and insisting that it works. – James Kanze Aug 20 '14 at 09:11