0

for this code -

int main()
{
    std::wstring wstr = L"é";
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;

    std::stringstream ss;
    ss << std::hex << std::setfill('0');

    for (auto c : myconv.to_bytes(wstr))
    {
        ss << std::setw(2) << static_cast<unsigned>(c);
    }
    string ssss = ss.str();
    cout << "ssss = " << ssss << endl;



Why does this print ffffffc3ffffffa9 instead of c3a9?

Why does it append ffffff in beginning? If you want to run it in ideone - https://ideone.com/qZtGom

user123456
  • 13
  • 5
  • Becuase the `sizeof(wchar_t) != sizeof(unsigned)` and `wchar_t is probably signed` as a result, there is a conversion that preserves the value of the last bit. – Martin York Oct 25 '22 at 22:32
  • @MartinYork even if i cast to int, it still has ffffff appended in beginning – user123456 Oct 25 '22 at 22:41
  • The problem is the sign extension. `static_cast(c)` Make sure the object you are initially working with is unsigned. Then you can extend it if you require. – Martin York Oct 25 '22 at 22:49
  • Try this: `static_cast(c & 0xFF)` – Eljay Oct 26 '22 at 01:00
  • @MartinYork why does sign extension occur here? – user123456 Oct 26 '22 at 22:04
  • It seems like `std::wstring_convert::to_bytes()` is returning a [byte string](https://en.cppreference.com/w/cpp/locale/wstring_convert/to_bytes). Each member of that byte string is a signed type (in your implementation). Note: `char` is either `signed` or `unsigned` depending on implementation (you have to manually check (or read the docs)). So hear I am just making sure it is an `unsigned char` before allowing the object to be put in a larger type (thus avoiding sign extension. – Martin York Oct 26 '22 at 22:29
  • @MartinYork Also, it promotes the char to an int right, because char is only 8 bits so it can only store the c3 thus the other ffffff is stored in the other 3 bytes of the int right? – user123456 Oct 27 '22 at 15:19

1 Answers1

0

c is of type char, which is signed on most systems. Converting a char to an unsigned causes value to be sign-extended.

Examples:

  • char(0x23) aka 35 --> unsigned(0x00000023)
  • char(0x80) aka -128 --> unsigned(0xFFFFFF80)
  • char(0xC3) aka -61 --> unsigned(0xFFFFFFc3)

[edit: My first suggestion didn't work; removed]

You can cast it twice: ss << std::setw(2) << static_cast<int>(static_cast<unsigned char>(c));

The first cast gives you an unsigned type with the same bit pattern, and since unsigned char is the same size as char, there is no sign extension.

But if you just output static_cast<unsigned char>(c), the stream will treat it as a character, and print .. something .. depending on your locale, etc.

The second cast gives you an int, which the stream will output correctly.

Marshall Clow
  • 15,972
  • 2
  • 29
  • 45
  • Still returns same results if i cast to an int - for this code - https://ideone.com/qZtGom – user123456 Oct 25 '22 at 22:46
  • This will do what you want, but it's ugly: `ss << std::setw(2) << static_cast(static_cast(c));` Cast the value to an unsigned type which is the same size as `char`, and then to an `int`, and output it as hex two characters wide. – Marshall Clow Oct 25 '22 at 22:48
  • why does sign extension occur here? – user123456 Oct 26 '22 at 22:03