0

I am reading code from somebody, I happended to see code as follow.

According to comment, this function is to Convert a UCS character to an UTF-8 string. But what is ucs character, what is the rule to convert ucs to unicode, where can I find the documents?

/*
 * Convert a UCS character to an UTF-8 string
 *
 * Returns the string length of the result
 */
size_t
tUcs2Utf8(ULONG ulChar, char *szResult, size_t tMaxResultLen)
{
    if (szResult == NULL || tMaxResultLen == 0) {
        return 0;
    }

    if (ulChar < 0x80 && tMaxResultLen >= 2) {
        szResult[0] = (char)ulChar;
        szResult[1] = '\0';
        return 1;
    }
    if (ulChar < 0x800 && tMaxResultLen >= 3) {
        szResult[0] = (char)(0xc0 | ulChar >> 6);
        szResult[1] = (char)(0x80 | (ulChar & 0x3f));
        szResult[2] = '\0';
        return 2;
    }
    if (ulChar < 0x10000 && tMaxResultLen >= 4) {
        szResult[0] = (char)(0xe0 | ulChar >> 12);
        szResult[1] = (char)(0x80 | (ulChar >> 6 & 0x3f));
        szResult[2] = (char)(0x80 | (ulChar & 0x3f));
        szResult[3] = '\0';
        return 3;
    }
    if (ulChar < 0x200000 && tMaxResultLen >= 5) {
        szResult[0] = (char)(0xf0 | ulChar >> 18);
        szResult[1] = (char)(0x80 | (ulChar >> 12 & 0x3f));
        szResult[2] = (char)(0x80 | (ulChar >> 6 & 0x3f));
        szResult[3] = (char)(0x80 | (ulChar & 0x3f));
        szResult[4] = '\0';
        return 4;
    }
    szResult[0] = '\0';
    return 0;
} /* end of tUcs2Utf8 */
roger
  • 9,063
  • 20
  • 72
  • 119
  • Really? [this](https://www.google.com/search?q=ucs+character&oq=ucs+character&aqs=chrome..69i57j69i60&sourceid=chrome&es_sm=122&ie=UTF-8) did not help? – Sourav Ghosh Jan 18 '16 at 09:57
  • @SouravGhosh, I can read this code, but why is this? so I want to know what is rule between the conversion – roger Jan 18 '16 at 10:00
  • 1
    Please don't roll your own code when tested and stable alternatives exist. If this is Windows specific, you can use `MultibyteToWideChar` and/or `WideCharToMultibyte`. Otherwise you can use ICU. – user4520 Jan 18 '16 at 10:21
  • 2
    The function name is misleading. A UCS-2 code unit only covers the range U+0000 to U+FFFF. What this function actually does is convert a full Unicode character by its code point number (U+0000 to U+10FFFF) to a UTF-8 byte sequence. – bobince Jan 18 '16 at 10:31
  • 1
    @bobince: Think the 2 in the function name is "to" and not "two", i.e. the result a poor choice of name for the function. – dalle Jan 18 '16 at 11:37
  • UCS is what happens when ISO adopts a great standard. They'll make another one. – Hans Passant Jan 18 '16 at 15:05

1 Answers1

0

Universal Character Set is an ISO standard. It defines the same characters as Unicode, so there's no need for character conversion. Every version of UCS is essentially a small subset of a certain version of the Unicode standard. New characters are first added to Unicode and every so often, UCS is synchronized with Unicode. Appendix C of the Unicode standard contains a table that shows the relationship between different versions.

Also note that the code you posted uses a non-standard upper limit of 0x200000. This should be changed to 0x110000.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113