Read/Store different types of strings (utf8/utf16/ansi)

Question

I'm parsing a file that among other things contains various strings in different encodings. The way these strings are stored is this:

0xFF 0xFF - block header                   2 bytes
0xXX 0xXX - length in bytes                2 bytes
0xXX      - encoding (can be 0, 1, 2, 3)   1 byte
...       - actual string                  num bytes per length

This is generally quite easy, however I'm not sure how to deal with encodings. Encoding can be one of:

0x00 - regular ascii string (that is, actual bytes represent char*)
0x01 - utf-16 with BOM (wchar_t* with the first two bytes being 0xFF 0xFE or 0xFE 0xFF)
0x02 - utf-16 without BOM (wchar_t* directly)
0x03 - utf-8 encoded string (char* to utf-8 strings)

I need to read/store this somehow. Initially I was thinking on simple string but that wouldn't work with wchar_t*. Then I thought about converting everything to wstring, yet this would be quite a bit of unnecessary conversion. The next thing came to mind was boost::variant<string, wstring> (I'm already using boost::variant in another place in the code). This seems to me to be a reasonable choice. So now I'm a bit stuck with parsing it. I'm thinking somewhere along these lines:

//after reading the bytes, I have these:
int length;
char encoding;
char* bytes;

boost::variant<string, wstring> value;
switch(encoding) {
    case 0x00:
    case 0x03:
        value = string(bytes, length);
        break;
    case 0x01:
        value = wstring(??);
        //how do I use BOM in creating the wstring?
        break;
    case 0x02:
        value = wstring(bytes, length >> 1);
        break;
    default:
        throw ERROR_INVALID_STRING_ENCODING;
}

As I do little more than print these strings later, I can store UTF8 in a simple string without too much bother.

The two questions I have are:

Is such approach a reasonable one (i.e. using boost::variant)?
How do I create wstring with a specific BOM?

have a look here: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring (the top answer), if you are on windows only wstring is a solid choice, I mean throughout the software not the Variant approach, if you plan cross-platform i suggest using QT for its text conversion power (handling all of them in QString) — Najzero, Jan 07 '13 at 11:07
@Najzero I'm developing on linux, but the result must be able to compile under windows, linux and mac os x. Also, note that I'm aiming for under 300K statically compiled executable on any platform (outside requirements, I don't control those), therefore linking ICU or QT is most likely not an option. — Aleks G, Jan 07 '13 at 11:07

score 0 · Answer 1 · answered Jan 07 '13 at 10:46

UTF16 need to be distinguished between LE vs BE.

I suspect 0x02 - utf-16 without BOM (wchar_t* directly) is the actually UTF16 BE. With BOM encoding means LE/BE is indicated by the BOM.

Unicode support of C++ Standard Library is very limited, and I don't think vanilla C++ will handle UTF16LE/BE properly, not to mention of UTF8. Many Unicode applications use 3rd party support libraries such as ICU.

For in-memory representation, I would stick to std::string. Because std::string can represents any text encoding and std::wstring is not much helpful to this multiple encoding situation. If you need to use std::wstring and related std::iostream functions, be careful with system locale and std::locale settings.

Mac OS X uses UTF8 as the only default text encoding whereas Windows uses UTF16 LE. You also need only one text encoding internally, plus several converting functions will do you purpose, I think.

Fair enough. The issue with ICU is that it's heavy, and I'm trying to avoid external libs as much as possible. I'm aiming at a statically linked executable (i.e. no external dll or so files) to be under 300K, so I can't really afford to link ICU in. Unless I can find a small header file-only set of templates/functions/macros, I'll probably end up writing a few functions myself for converting strings. — Aleks G, Jan 07 '13 at 11:09

score 0 · Accepted Answer · answered Jan 07 '13 at 12:47

After some research, tries and errors, I decided to go with UTF8-CPP, which is a lightweight, header-only set of functions for converting to/from utf8. It includes functions for converting from utf-16 to utf-8 and, from my understanding, can deal correctly with BOM.

Then I store all strings as std::string, converting utf-16 strings to utf-8, something like this (from my example above):

int length; char encoding; char* bytes;

string value;
switch(encoding) {
    case 0x00:
    case 0x03:
        value = string(bytes, length);
        break;
    case 0x01:
    case 0x02:
        vector<unsigned char> utf8;
        wchar_t* input = (wchar_t*)bytes;
        utf16to8(input, input + (length >> 1), back_inserter(utf8));
        value = string(utf8.start(), utf8.end());
        break;
    default:
        throw ERROR_INVALID_STRING_ENCODING;
}

This works fine in my quick test. I'll need to do more testing before final judgement.

Read/Store different types of strings (utf8/utf16/ansi)

2 Answers2