Character length in bytes

Question

Given first byte(of a multi-byte character) and charset canonical name, how to determine byte-length of a character?

Best would be using ICU library.

Let's say you start reading a stream, you see first byte, you know charset, can you tell how many more bytes you need to read to get whole character? Any library doing this would be appreciated. Thanks. — Michal, May 31 '13 at 06:57

score 2 · Accepted Answer · answered Jun 03 '13 at 13:13

Use ucnv_getNextUChar from ICU library. The following code splits binary stream to chars and prints size of each character:

const char * utf8_strings[] = {"Samotność - cóż po ludziach, czym śpiewak dla ludzi"};

icu::ErrorCode err;
UConverter* conv = ucnv_open("UTF-8", err);
size_t len = strlen(utf8_strings[0]);
const char* curr = utf8_strings[0]; 
do {
    const char* prev = curr;
    ucnv_getNextUChar(conv, &curr, curr+len, err);
    std::cout << prev[0] << "  " << curr - prev << std::endl;       
} while (curr < utf8_strings[0]+len);

score 1 · Answer 2 · answered May 29 '13 at 16:27

For most reasons, when designing a character set, there is always a way to determine byte length of a char by first character. So just say:

If it was UTF-16, each char is in two bytes.
If it was UTF-8, there may be three situations:
1. chars below 0x80 is in format of 0xxx xxxx
2. chars above 0x80 and below 0x800 is in format of 110x xxxx 10xx xxxx
3. chars above 0x800 is in format of 1110 xxxx 10xx xxxx 10xx xxxx
If it was GBK, you can tell whether there is another byte of the char code by detecting whether first byte of this char is larger than 0x7f.
For iso-latin-1 or something like this, there is always one byte.

Characters above 0xFFFF use 4 bytes in UTF-16. – Peter Lawrey May 29 '13 at 17:42 — Peter Lawrey, May 29 '13 at 17:42
Any library supporting this operation? – Michal May 31 '13 at 10:20 — Michal, May 31 '13 at 10:20

Character length in bytes

2 Answers2