C++ how to get the next multibyte character

Question

Is their a way to get the next full character in a multibyte string for example "z\u00df\u6c34\U0001d10b" or "zß水" would be represented as 4 characters excluding null termination in a widestring but maybe 9 characters in a multibyte string. I was using the below code to convert to and from string, since I used widestirng internally, but their seems to be subtle issues if the proper length is not given for the __wideToString even if the length is larger than it needs to be. I have also realized that I can probably skip the whole conversion to and from wstring, by using only string, if I can simply get how many characters in the multibyte string makes up the next full character. So say in string u8"u6c34\U0001d10b" which may be stored in 6 characters I would only want the next 2 which would be "水". Can anyone guide me in solving this issue?

I have been having this unicode type issue for a while now and their doesn't seem to be a lot of information on how it's handled in C++, save for third party solutions, which I am trying to avoid.

static 
std::string __wideToString(const std::wstring & ws){
    if(ws.empty()){throw std::invalid_argument("Wide string must have length >= 1");}
    std::setlocale(LC_ALL, "");
    size_t length = sizeof(wchar_t)*ws.length();
    std::string str(length,' ');
    if((length=wcstombs(&str[0], ws.c_str(), length))==size_t(-1)){//return -1 on invalid conversion
        throw std::length_error("Conversion Error Invalid Wide Character"); 
    }
    str.resize(length); // Shrink to fit.
    return str;
}

static 
std::wstring __stringToWide(const std::string & str){
    if(str.empty()){throw std::invalid_argument("String must have length >= 1");}
    std::setlocale(LC_ALL, "");
    size_t length = str.length();
    std::wstring ws(length, L' '); // Overestimate number of code points.
    if((length=mbstowcs(&ws[0], str.c_str(), length))==size_t(-1)){//return -1 on invalid conversion
        throw std::length_error("Conversion Error Invalid Multibyte Character");    
    } 
    ws.resize(length); // Shrink to fit.
    return ws;
}

The encoding of the example string is not consistent. There is no way of knowing how many bytes a character is. — Some programmer dude, Jul 08 '14 at 18:31
You cannot process multi-byte characters without knowing the encoding used. The encoding tells you how to interpret the individual bytes correctly. — Remy Lebeau, Jul 08 '14 at 22:07
@JoachimPileborg it was only an example since the input is random words from a dictionary and this issue only began when i started to use the file /usr/share/dict/words as a source, which has multibyte characters — kdgwill, Jul 08 '14 at 22:46
Agree that there seems to be a lack of a definitive guide to unicode string processing in C++ — M.M, Jul 09 '14 at 00:26

Lasse Reinhold · Answer 1 · 2014-07-09T00:08:29.517

wcstombs() doesn't work for characters beyond unicodes 0 - 0xff.

It will either fail with return value -1 (for chinese letters, etc) or silently produce bad output (such as removing diacritics from 'ā' so it becomes 'a').

The problem is that what you're doing doesn't make sense if you have characters that can not be represented by a normal std::string. There is no operating system API or C++03/11 features that support what you are trying to do.

Methods named things like wideToString() do not make sense unless you only have a limited ANSI-like character set. stringToWide() would make sense though.

Back to your question - Windows stores wstring payload as UTF-16 and each wchar_t inside it is a single 16-bit UTF-16 code unit (so you need two wchar_ts for characters beyond unicodes 0xffff). Linux stores wstring payload as UTF-8 but a wchar_t is a 32-bit UTF-32 code unit.

So on Windows you can search for some UTF-16 decoding functions on the net to find out where the next character begins. But again, it won't help you out.

score 1 · Answer 2 · answered Sep 06 '17 at 16:08

This function will get you byte length and code point:

void getNextCharByteLengthAndCodePoint(const char* ch, size_t& byteLength, char32_t& codePoint)
{
    unsigned char firstByte(*ch);

    //Check against 1000 0000 is the first byte set?
    if ((firstByte & BIT_10000000) == 0)
    {
        // Codepoint is everything 0111 1111
        codePoint = firstByte & BIT_01111111;
        byteLength = 1;
    }
    //Check against 1110 0000 making sure we are 1100 0000
    else if ((firstByte & BIT_11100000) == BIT_11000000)
    {
        // Codepoint is everything 0001 1111
        codePoint = firstByte & BIT_00011111;
        byteLength = 2;
    }
    //Check against 1111 0000 making sure we are 1110 0000
    else if ((firstByte & BIT_11110000) == BIT_11100000)
    {
        // Codepoint is everything 0000 1111
        codePoint = firstByte & BIT_00001111;
        byteLength = 3;
    }
    //Check against 1111 1000 making sure we are 1111 0000
    else if ((firstByte & BIT_11111000) == BIT_11110000)
    {
        // Codepoint is everything 0000 0111
        codePoint = firstByte & BIT_00000111;
        byteLength = 4;
    }
    else
    {
        throw std::runtime_error("Invalid UTF8 encoding");
    }

    for (int i = 1; i < byteLength; ++i)
    {
        //Go through the other 'byteLength' bytes and shift everything 6
        codePoint = ((codePoint << 6) | (ch[i] & BIT_00111111));
    }
}

C++ how to get the next multibyte character

2 Answers2