Conversion from UTF-8 to ANSI wcstombs failes at one spezial character

Question

I want to change a wchar_t* like it is displayed to a char*. No conversions like in the WideCharToMultibyte should be done.

I found the wcstombs function and it looked like it works perfectly, but there is one char which does not get changed correcly.

It is the 'œ', it has the ANSI Number 156, but in UTF-8 it is the number 339. Of caurse ASCII has not so much numbers, but why does it get the wrong one?

Here a part of my sourcecode, I added a loop and a if so that it works:

    wchar_t *wc;    // source string
    char *cc;       // destination string
    int len = 0;    // length of the strings

    ...

    for(int i = 0; i < len; i++) {
            if(wc[i] != 339) {
                    cc[i] = wc[i];
            }else{
                    cc[i] = 156;
            }
    }

This Code is working, but seriously, is this the best way to solve that problem?

Many thanks in advance!

are you sure you should be using a `char` and not `unsigned char`? what are you expecting when you assign a value to `char` which is higher than its [upper limit](http://msdn.microsoft.com/en-us/library/296az74e.aspx) or are you questioning why `wcstombs` can generate a non valid ascii value? — EdChum, Nov 30 '12 at 15:13
"but in UTF-8 it is the number 339." That is just wrong. You are using the term UTF-8 to mean something else (UTF-16? Unicode? UTF-32?). — R. Martinho Fernandes, Nov 30 '12 at 15:14
339 is the unicode code point (in decimal, [the formal format is `U+0153`](http://www.fileformat.info/info/unicode/char/153/index.htm)) for `œ`. UTF-8 is an encoding, where that character will be encoded as `0xC5 0x93`. `ANSI` in this context most likely means Code page 1252 or Windows-1252 — Esailija, Nov 30 '12 at 15:22
You demand not using WideCharToMultiByte() but then use an alternative that's completely broken. You've only found one bad conversion, there are many more. No, it's not the best way. — Hans Passant, Nov 30 '12 at 15:34
"ANSI" is actually an incorrect name for several [Windows code pages](http://en.wikipedia.org/wiki/Windows_code_pages). — Keith Thompson, Nov 30 '12 at 21:15
If you're converting from `wchar_t*`, you're almost certainly *not* converting from UTF-8, which is a representation of Unicode using 8-bit characters. `wchar_t` is typically 16 or 32 bits; if it's 16 bits, it's likely to be UTF-16, or perhaps UCS-2. — Keith Thompson, Nov 30 '12 at 21:16
There are libraries for converting between different character representations. See if you can find [iconv](http://en.wikipedia.org/wiki/Iconv) for your system. — Keith Thompson, Nov 30 '12 at 21:17

score 0 · Answer 1 · answered Nov 30 '12 at 19:16

I want to change a wchar_t* like it is displayed to a char*.

Okay, you want to convert from wchar_t strings to char strings.

No conversions like in the WideCharToMultibyte should be done.

What? I presume you don't mean 'no conversion should be done,' but with only one example I can't tell what you want to avoid. Just WideCharToMultibyte or are there other functions?

I found the wcstombs function and it looked like it works perfectly,

wcstombs seems like WideCharToMultibyte to me, but I guess it's different in some way that's important to you? It'd be good if you could describe what exactly makes wcstombs acceptable and WideCharToMultibyte unacceptable.

but there is one char which does not get changed correcly.

Sounds like it's not working perfectly...

It is the 'œ', it has the ANSI Number 156, but in UTF-8 it is the number 339. Of caurse ASCII has not so much numbers, but why does it get the wrong one?

You probably mean that in CP1252 'œ' is encoded as 156 in decimal or 0x9C in hex, and that this character has the Unicode codepoint 339 in decimal, or more conventionally U+0153. I don't see where UTF-8 comes into this at all.

Here a part of my sourcecode, I added a loop and a if so that it works:

As for why you're not getting the results you expect, it's probably because you're not using wcstombs() correctly. It's hard to tell because you're not showing how you're doing the conversion with wcstombs().

wcstombs() converts between wchar_t and char using the encodings specified by the program's current C locale. If you've set the locale to one that uses a Unicode encoding for wchar_t and uses CP1252 for char then it should do what you expect.

This Code is working, but seriously, is this the best way to solve that problem?

No.

Esailija · Answer 2 · 2012-11-30T15:33:24.493

Please bear with my complete ignorance of c/c++, but you can either use a custom lookup table or some premade function.

Here is an array of 256 integers, where the index i contains the unicode codepoint for the Windows-1252 codepoint i.

So for instance, the index 156, contains 0x0153 which is 339 in decimal.

int[] windows1252ToUnicodeCodePoints = {
         0x0000,0x0001,0x0002,0x0003,0x0004,0x0005,0x0006,0x0007,0x0008,0x0009,0x000A,0x000B,0x000C,0x000D,0x000E,0x000F
        ,0x0010,0x0011,0x0012,0x0013,0x0014,0x0015,0x0016,0x0017,0x0018,0x0019,0x001A,0x001B,0x001C,0x001D,0x001E,0x001F
        ,0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0027,0x0028,0x0029,0x002A,0x002B,0x002C,0x002D,0x002E,0x002F
        ,0x0030,0x0031,0x0032,0x0033,0x0034,0x0035,0x0036,0x0037,0x0038,0x0039,0x003A,0x003B,0x003C,0x003D,0x003E,0x003F
        ,0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,0x0047,0x0048,0x0049,0x004A,0x004B,0x004C,0x004D,0x004E,0x004F
        ,0x0050,0x0051,0x0052,0x0053,0x0054,0x0055,0x0056,0x0057,0x0058,0x0059,0x005A,0x005B,0x005C,0x005D,0x005E,0x005F
        ,0x0060,0x0061,0x0062,0x0063,0x0064,0x0065,0x0066,0x0067,0x0068,0x0069,0x006A,0x006B,0x006C,0x006D,0x006E,0x006F
        ,0x0070,0x0071,0x0072,0x0073,0x0074,0x0075,0x0076,0x0077,0x0078,0x0079,0x007A,0x007B,0x007C,0x007D,0x007E,0x007F
        ,0x20AC,0xFFFD,0x201A,0x0192,0x201E,0x2026,0x2020,0x2021,0x02C6,0x2030,0x0160,0x2039,0x0152,0xFFFD,0x017D,0xFFFD
        ,0xFFFD,0x2018,0x2019,0x201C,0x201D,0x2022,0x2013,0x2014,0x02DC,0x2122,0x0161,0x203A,0x0153,0xFFFD,0x017E,0x0178
        ,0x00A0,0x00A1,0x00A2,0x00A3,0x00A4,0x00A5,0x00A6,0x00A7,0x00A8,0x00A9,0x00AA,0x00AB,0x00AC,0x00AD,0x00AE,0x00AF
        ,0x00B0,0x00B1,0x00B2,0x00B3,0x00B4,0x00B5,0x00B6,0x00B7,0x00B8,0x00B9,0x00BA,0x00BB,0x00BC,0x00BD,0x00BE,0x00BF
        ,0x00C0,0x00C1,0x00C2,0x00C3,0x00C4,0x00C5,0x00C6,0x00C7,0x00C8,0x00C9,0x00CA,0x00CB,0x00CC,0x00CD,0x00CE,0x00CF
        ,0x00D0,0x00D1,0x00D2,0x00D3,0x00D4,0x00D5,0x00D6,0x00D7,0x00D8,0x00D9,0x00DA,0x00DB,0x00DC,0x00DD,0x00DE,0x00DF
        ,0x00E0,0x00E1,0x00E2,0x00E3,0x00E4,0x00E5,0x00E6,0x00E7,0x00E8,0x00E9,0x00EA,0x00EB,0x00EC,0x00ED,0x00EE,0x00EF
        ,0x00F0,0x00F1,0x00F2,0x00F3,0x00F4,0x00F5,0x00F6,0x00F7,0x00F8,0x00F9,0x00FA,0x00FB,0x00FC,0x00FD,0x00FE,0x00FF
};

What you need is this table inversed (or do linear scans everytime), in any other language I would use a construct like Map<int, int>.

Conversion from UTF-8 to ANSI wcstombs failes at one spezial character

2 Answers2