-1

I'm writing a program in C++ which have to handle unicode characters. The main problem is I use algorithms where I need to parse my s/wstrings char by char :

std::wstring word = L"Héllo"
for (auto &e : word)
// doing something with e

But if I run this :

std::wstring word = L"Héllo"
for (auto &e : word)
    std::wcout << e << std::endl;

I get this output :

H
?
l
l
o

Am I doing something wrong ?

Do note that word prints properly when I use std::wcout << word;.

EDIT FOR @Ben Voigt

Here is the out with std::wcout << std::hex << std::setw(4) << (int)e << L" " << e << L'\n'; :

  48 H
  e9 �
  6c l
  6c l
  6f o
LightMan
  • 53
  • 1
  • 8
  • 4
    Does your console support unicode? – NathanOliver Jun 05 '17 at 15:25
  • 1
    Observation: Not dealing with a surrogate pair, because there's only one row of output for the unknown character. – Ben Voigt Jun 05 '17 at 15:25
  • It does because when I print all string in std::wcout I can see the 'é' – LightMan Jun 05 '17 at 15:26
  • Would you mind adding the hexadecimal printout of the element values, e.g. `std::wcout << hex << setw(4) << (int)e << L" " << e << L'\n'`; – Ben Voigt Jun 05 '17 at 15:27
  • `wchar`, `std::wstring`, and `std::wcout` are not unicode. AFAIK, they're just "wide characters" supporting 16-bit characters instead of 8-bit characters. Proper unicode would store everything as codepoints (which would have to be at least 32-bit) or use some kind of encoding scheme like UTF-8 (which is very common) or UTF-16/UCS-16 (which are commonly used in `std::wstring` but not mandated by the container) – Xirema Jun 05 '17 at 15:27
  • @Xirema: The output shows that there are only 5 elements, no variable-length encodings, at least in this particular case. – Ben Voigt Jun 05 '17 at 15:28
  • Okay but should wide chars be able to store 'é' character ? – LightMan Jun 05 '17 at 15:29
  • @BenVoigt I'm more just contesting the decision to tag this question with the `unicode` tag. – Xirema Jun 05 '17 at 15:29
  • I'm not on Windows, I'm on Linux – LightMan Jun 05 '17 at 15:29
  • 1
    @Ðаn Not sure about those dupes. The OP says the full string prints properly, just not when it goes character by character. – NathanOliver Jun 05 '17 at 15:30
  • @Xirema: Furthermore on Linux, `wchar_t` probably is 32-bit, capable of holding any Unicode codepoint without variable-length encoding. – Ben Voigt Jun 05 '17 at 15:31
  • @BenVoigt I edited the question with your line and its output. – LightMan Jun 05 '17 at 15:34
  • 1
    There is the basic issue of putting quoted, non-ASCII string-literals in the source code itself. It isn't guaranteed that what you typed inside those quotes are going to be the actual characters displayed. – PaulMcKenzie Jun 05 '17 at 15:39
  • @PaulMcKenzie I already tried to read from strand input, result is same – LightMan Jun 05 '17 at 15:40
  • I deleted my answer, so I copy here: `0xE9` is indeed the representation of `é` in wide unicode, so the output (the binary value, not the shown symbol) appears correct. I don't know why string output would differ or why the correct value doesn't work in your console. – eerorika Jun 05 '17 at 15:56
  • 1
    The reason for the failure has to be that a different `operator<<` is selected when printing a single character, compared to an entire string. To find out what is going wrong, one would need to write a custom `strreambuf` class that logs every call to `xsputn` and `overflow` and see how the call sequence varies. To work around it, it might be enough to call `write` or `put` directly on `wcout`, or cast before using `<<`. e.g. I expect that `wcout << wstring(1, e);` would get the "writing a whole string" behavior. – Ben Voigt Jun 05 '17 at 16:04

1 Answers1

0

To print the characters as you intend, try the following:

#include <string>
#include <iostream>
#include <fcntl.h>
#include <io.h>
int _tmain(int argc, _TCHAR* argv[])
{
    std::wstring word = L"Héllo";
    _setmode(_fileno(stdout), _O_U16TEXT);
    for (auto &e : word)
        std::wcout << e << std::endl;
    return 0;
}

More detailed description in this post: Why are certain Unicode characters causing std::wcout to fail in a console app?

Ravi
  • 71
  • 1
  • 3
  • 1
    These functions don't exist [on Linux](https://stackoverflow.com/questions/44372286/using-wstring-and-wcout-doesnt-get-the-expected-output#comment75745817_44372286) do they? – Ben Voigt Jun 05 '17 at 15:58