Codepoint mismatch between Java and C

Question

So, I'm having some problems with the following char – in a port of imgui to kotlin

After digging the whole day into Charsets and encodings, I came down to my only hope: rely on the unicode codepoints.

That char on the jvm

"–"[0].toInt() // same as codePointAt()

returns codepoint u2013

On C, I'm not sure, but since this is what is being done:

const ImFontGlyph* ImFont::FindGlyph(ImWchar c) const
{
    if (c >= IndexLookup.Size)
        return FallbackGlyph;
    const ImWchar i = IndexLookup.Data[c];
    if (i == (ImWchar)-1)
        return FallbackGlyph;
    return &Glyphs.Data[i];
}

Where

typedef unsigned short ImWchar

and

ImVector<ImWchar> IndexLookup; // Sparse. Index glyphs by Unicode code-point.

So, doing this

char* a = "–";
int b = a[0];

returns codepoint u0096

As far as I read, it look like over 127(0x7F) we are in the "Extended Ascii" territory, which is bad, because it appears there are different versions/interpretation of it.

For example, this encoding table doesn't match my codepoint, but the Cp1252 encoding does, so I'm inclined to think that this is what is actually being used on C.

In the table at the bottom of the link just mentioned, you can actually see that 150 (decimal, count from the right column with the given number) corresponds indeed to 2013 (hex, I find it a little incoherent, but anyway).

To solve this, I tried to convert my Strings on Kotlin to the same encoding (ignoring for the moment that this is of course platform-specific), so for every c: Char

"$c".toByteArray(Charset.forName("Cp1252"))[0].toUnsignedInt

This works, but breaks rendering for foreign fonts, such as chinese, japanese, etc..

So, my question is: why the difference between u2013 on JVM and u0096 on C?

Which is the right way to deal with this?

Try `L"–"` in C, which creates UTF-16 strings on Windows. (You can also just do L'–' without having to create a string then read a character from it.) — Rup, Apr 03 '19 at 16:45
I can't see the point to that, anyway, with `char* a = L"–";`I get the following `error C2440: 'initializing': cannot convert from 'const wchar_t [2]' to 'char *'` — elect, Apr 03 '19 at 16:48
It should work with `ImWchar*`, which is what you want isn't it? Maybe that would need a cast, but it would be the correct data. — Rup, Apr 03 '19 at 16:49
It wanted the cast indeed. And it's actually `8211` in this case — elect, Apr 03 '19 at 16:53
Like you said, it’s a CP1252 string. You can convert that to wide chars using [MultiByteToWideChar](https://learn.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-multibytetowidechar). — Rup, Apr 03 '19 at 17:04
"Extended Ascii": There are only two interpretations of it. 1) One very specific character encoding (in which case there is no reason to call it "Extended Ascii", just say which is it). 2) Adaptive to the user's system's current settings followed up by optional locale changes within the program (which is rarely but sometimes exactly what you want). — Tom Blodget, Apr 03 '19 at 17:10
In practice, I think ‘extended ASCII’ usually refers to Windows Latin-1 AKA CP1252. (That's mostly the same as ISO Latin-1 AKA ISO 8859-1 (which matches the first 256 Unicode chars) except that it has things like curly quotes and long dashes in chars 128–159 where ISO Latin-1 has control characters.) — gidds, Apr 03 '19 at 17:14
**Which C++** and with what options? Your claimed 'C' is actually C++, although you can have the same issue in C. The Java standard defines it to use UTF-16, plus there is one dominant implementation (Sun-now-Oracle), but the C++ and C standards make (source and execution) character encodings implementation-dependent and there are many thousands of C and C++ implementations which make many different choices in this area; some compilers support multiple options. — dave_thompson_085, Apr 03 '19 at 18:03
@dave_thompson_085, sorry but I have no idea about which options exactly you refer (I just cloned it and opened in VS). Anyway, I kind of found a solution for the moment — elect, Apr 09 '19 at 08:00

score 1 · Accepted Answer · answered Apr 09 '19 at 08:10

At the moment I solved like this on Windows, I inserted this function before retrieving the char codepoint. It basically remaps all those chars which differs from ISO-8859-1. You can see them in this table, they are all those with that light grey border.

internal fun Char.remapCodepointIfProblematic(): Int {
    val i = toInt()
    return when (Platform.get()) {
        /*  https://en.wikipedia.org/wiki/Windows-1252#Character_set
         *  manually remap the difference from  ISO-8859-1 */
        Platform.WINDOWS -> when (i) {
            // 8_128
            0x20AC -> 128 // €
            0x201A -> 130 // ‚
            0x0192 -> 131 // ƒ
            0x201E -> 132 // „
            0x2026 -> 133 // …
            0x2020 -> 134 // †
            0x2021 -> 135 // ‡
            0x02C6 -> 136 // ˆ
            0x2030 -> 137 // ‰
            0x0160 -> 138 // Š
            0x2039 -> 139 // ‹
            0x0152 -> 140 // Œ
            0x017D -> 142 // Ž
            // 9_144
            0x2018 -> 145 // ‘
            0x2019 -> 146 // ’
            0x201C -> 147 // “
            0x201D -> 148 // ”
            0x2022 -> 149 // •
            0x2013 -> 150 // –
            0x2014 -> 151 // —
            0x02DC -> 152 // ˜
            0x2122 -> 153 // ™
            0x0161 -> 154 // š
            0x203A -> 155 // ›
            0x0153 -> 156 // œ
            0x017E -> 158 // ž
            0x0178 -> 159 // Ÿ
            else -> i
        }
        else -> i // TODO
    }
}

Codepoint mismatch between Java and C

1 Answers1