Converting an "HTML entity" emoticon code in UTF16 (in c++)

Question

I'm currently writing my own DrawTextEx() function that supports emoticons. Using this function, a callback is called every time an emoticon is found in the text, giving the opportunity to caller to replace the text segment containing the emoticon by an image. For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image while the text is drawn. Actually this function works fine.

Now I want to implement an image library on the caller side. I receive a text segment like 0x3DD8 0x00DE in my callback function, and my idea is to use this code as key in a map containing all the Unicode combinations, every one linked with a structure containing the image to draw. I found a good package on the http://emojione.com/developers/ website. All the packages available on this site contain several file names, that is an hexadecimal code. So I can iterate through the files contained in the package, and create my map in an automatic way.

However I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development, as it can be seen on the http://graphemica.com/%F0%9F%98%80 website. So, to be able to use these files, I need a solution to convert the HTML entity values contained in their names into an UTF16 code. For example, in the case of the above mentioned smiling face, I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.

A brute force approach may consist to write a map that converts these codes, by adding each of them in my code, one by one. But as the Unicode standard contains, in the most optimist scenario, more than 1800 combinations for the emoticons, I want to know it there is an existing solution, such as a known API or function, that I may use to do the job. Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)

Regards

score 2 · Answer 1 · answered Sep 20 '16 at 23:08

For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image

The character U+1F600 Grinning Face is represented by the UTF-16 code unit sequence 0xD83D, 0xDE00.

(Graphemica swapping the order of the bytes for each code unit is super misleading; ignore that.)

I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development

HTML has nothing to do with it. They're plain Unicode characters—just ones outside the Basic Multilingual Plane, above U+FFFF, which is why it takes more than one UTF-16 code unit to represent them.

HTML numeric character references like 😀 (often incorrectly referred to as entities) are a way of referring to characters by code point number, but the escape string is only effective in an HTML (or XML) document, and we're not in one of those.

So:

I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.

sounds more like:

I need to convert representations of U+1F600 Grinning Face: from the code point number 0x1F600 to the UTF-16 code unit sequence 0xD83D, 0xDE00

Which in C# would be:

string face = Char.ConvertFromUtf32(0x1F619); // "" aka "\uD83D\uDE00"

or in the other direction:

int codepoint = Char.ConvertToUtf32("\uD83D\uDE00", 0); // 0x1F619

(the name ‘UTF-32’ is poorly-chosen here; we are talking about an integer code point number, not a sequence of four-bytes-per-character.)

Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)

In C++ things are more annoying; there's not (that I can think of) anything that directly converts between code points and UTF-16 code units. You could use various encoding functions/libraries to convert between UTF-32-encoded byte sequences and UTF-16 code units, but that can end up more faff than just writing the conversion logic yourself. eg in most basic form for a single character:

std::wstring fromCodePoint(int codePoint) {
    if (codePoint < 0x10000) {
        return std::wstring(1, (wchar_t)codePoint);
    }
    wchar_t codeUnits[2] = {
        0xD800 + ((codePoint - 0x10000) >> 10),
        0xDC00 + ((codePoint - 0x10000) & 0x3FF)
    };
    return std::wstring(codeUnits, 2);
}

This is assuming the wchar_t type is based on UTF-16 code units, same as C#'s string type is. On Windows this is probably true. Elsewhere it is probably not, but on platforms where wchar_t is based on code points, you can just pull each code point out of the string as a character with no further processing.

(Optimisation and error handling left as an exercise for the reader.)

score 0 · Answer 2 · answered Sep 21 '16 at 14:53

I'm using the RAD Studio compiler, and fortunately it provides an implementation for the ConvertFromUtf32 and ConvertToUtf32 functions mentioned by bobince. I tested them and they do exactly what I needed.

For those that doesn't use the Embarcadero products, the fromCodePoint() implementation provided by bobince works also well. For information, here is also the ConvertFromUtf32() function as implemented in RAD Studio, and translated into C++

std::wstring ConvertFromUtf32(unsigned c)
{
    const unsigned unicodeLastChar  = 1114111;
    const wchar_t  minHighSurrogate = 0xD800;
    const wchar_t  minLowSurrogate  = 0xDC00;
    const wchar_t  maxLowSurrogate  = 0xDFFF;

    // is UTF32 value out of bounds?
    if (c > unicodeLastChar || (c >= minHighSurrogate && c <= maxLowSurrogate))
        throw "Argument out of range - invalid UTF32 value";

    std::wstring result;

    // is UTF32 value a 16 bit value that can fit inside a wchar_t?
    if (c < 0x10000)
        result = wchar_t(c);
    else
    {
        // do divide in 2 chars
        c -= 0x10000;

        // convert code point value to UTF16 string
        result  = wchar_t((c / 0x400) + minHighSurrogate);
        result += wchar_t((c % 0x400) + minLowSurrogate);
    }

    return result;
}

Thanks to bobince for his response, which pointed me in the right direction and helped me to solve this problem.

Regards

Converting an "HTML entity" emoticon code in UTF16 (in c++)

2 Answers2