0

Given first byte(of a multi-byte character) and charset canonical name, how to determine byte-length of a character?

Best would be using ICU library.

Michal
  • 2,078
  • 19
  • 29
  • Let's say you start reading a stream, you see first byte, you know charset, can you tell how many more bytes you need to read to get whole character? Any library doing this would be appreciated. Thanks. – Michal May 31 '13 at 06:57

2 Answers2

2

Use ucnv_getNextUChar from ICU library. The following code splits binary stream to chars and prints size of each character:

const char * utf8_strings[] = {"Samotność - cóż po ludziach, czym śpiewak dla ludzi"};

icu::ErrorCode err;
UConverter* conv = ucnv_open("UTF-8", err);
size_t len = strlen(utf8_strings[0]);
const char* curr = utf8_strings[0]; 
do {
    const char* prev = curr;
    ucnv_getNextUChar(conv, &curr, curr+len, err);
    std::cout << prev[0] << "  " << curr - prev << std::endl;       
} while (curr < utf8_strings[0]+len);
Michal
  • 2,078
  • 19
  • 29
1

For most reasons, when designing a character set, there is always a way to determine byte length of a char by first character. So just say:

  • If it was UTF-16, each char is in two bytes.
  • If it was UTF-8, there may be three situations:
    1. chars below 0x80 is in format of 0xxx xxxx
    2. chars above 0x80 and below 0x800 is in format of 110x xxxx 10xx xxxx
    3. chars above 0x800 is in format of 1110 xxxx 10xx xxxx 10xx xxxx
  • If it was GBK, you can tell whether there is another byte of the char code by detecting whether first byte of this char is larger than 0x7f.
  • For iso-latin-1 or something like this, there is always one byte.
kyriosli
  • 333
  • 1
  • 6