15

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.

P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.

ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Narek
  • 38,779
  • 79
  • 233
  • 389
  • 3
    Unicode is not exact enough. Are you using UTF-[8/16/32]? Do you want to use the same representation internally and when it is serialized to a file? If you want to convert representations do you want to do it manually or via the locale using using codecvt facet? – Martin York Jul 08 '10 at 21:19
  • As you wish!!! No file and nothing else cin, and cout that all! – Narek Jul 09 '10 at 11:52
  • After having read various threads on this topic, my conclusion is that it is impossible to do in C++. Drop `cin`, `cout` and everything else from the C++ and C standards and use the the plain Windows functions `ReadConsoleW` and `WriteConsoleW`. The C and C++ standards are just broken in this respect. – Philipp Jul 09 '10 at 20:51
  • 1
    @philip - The C++ standard simply doesn't address Unicode at all. Just like it doesn't address communicating with a network layer. C++0x does address Unicode in some way that I haven't familiarized myself with yet....at which point you'll have standard C++ functionality to do Unicode stuff. Though C++ doesn't know WTF a "console" is I'd bet that it will be taken care of. – Edward Strange Jul 10 '10 at 11:06

5 Answers5

12

I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:

#include <iostream>
#include <locale>
#include <string>

using namespace std;

int main() {
    ios_base::sync_with_stdio(false);
    wcin.imbue(locale("en_US.UTF-8"));
    wcout.imbue(locale("en_US.UTF-8"));

    wstring s;
    wstring t(L" la Polynésie française");

    wcin >> s;
    wcout << s << t << endl;
    return 0;
}
Post Self
  • 1,471
  • 2
  • 14
  • 34
Bolo
  • 11,542
  • 7
  • 41
  • 60
  • 2
    I have debugged, seams this line is the problem: wcin.imbue(locale("en_US.UTF-8")); – Narek Jul 09 '10 at 11:50
  • 1
    @Narek Yes, I did test the code. It runs without problems on my Ubuntu. What system do you have? – Bolo Jul 09 '10 at 17:54
  • 5
    `wcin` and `wcout` don't work on Windows, just like the equivalent C functions. Only the native API works. – Philipp Jul 10 '10 at 07:40
  • Thank you. Your trick fixed my problem (cin being skipped if the input contains an accented letter). – Aminos Dec 26 '16 at 21:02
10

Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.

For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv

Brian R. Bondy
  • 339,232
  • 124
  • 596
  • 636
  • 1
    At which point you can use UTF-16 instead of UTF-8 iff your OS understands it. – Edward Strange Jul 08 '10 at 20:31
  • +1: wcout for wstring for wchar_t (primarily window's UTF-16), cout for string for char (Linux, UTF-8 by default) – rubenvb Jul 08 '10 at 20:49
  • @Philipp: In what way do `wcin` and `wcout` not work for you? They won't display Unicode characters not supported by your console font, but that's a fault of the console and not iostreams. – Ben Voigt Jul 10 '10 at 07:48
  • 1
    @Ben Voight: They don't display Unicode characters at all, even if the font supports it. See my answer for an example. The reason is that they don't wrap `ReadConsoleW`/`WriteConsoleW`. – Philipp Jul 10 '10 at 08:00
8

Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.

#ifndef UNICODE
#define UNICODE
#endif

#ifndef _UNICODE
#define _UNICODE
#endif

#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN

#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>

#include <conio.h>
#include <windows.h>

void testIostream();
void testStdio();
void testConio();
void testWindows();

int wmain() {
    testIostream();
    testStdio();
    testConio();
    testWindows();
    std::system("pause");
}

void testIostream() {
    std::wstring first, second;
    std::getline(std::wcin, first);
    if (!std::wcin.good()) return;
    std::getline(std::wcin, second);
    if (!std::wcin.good()) return;
    std::wcout << first << second << std::endl;
}

void testStdio() {
    wchar_t buffer[0x1000];
    if (!_getws_s(buffer)) return;
    const std::wstring first = buffer;
    if (!_getws_s(buffer)) return;
    const std::wstring second = buffer;
    const std::wstring result = first + second;
    _putws(result.c_str());
}

void testConio() {
    wchar_t buffer[0x1000];
    std::size_t numRead = 0;
    if (_cgetws_s(buffer, &numRead)) return;
    const std::wstring first(buffer, numRead);
    if (_cgetws_s(buffer, &numRead)) return;
    const std::wstring second(buffer, numRead);
    const std::wstring result = first + second + L'\n';
    _cputws(result.c_str());
}

void testWindows() {
    const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
    WCHAR buffer[0x1000];
    DWORD numRead = 0;
    if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
    const std::wstring first(buffer, numRead - 2);
    if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
    const std::wstring second(buffer, numRead);
    const std::wstring result = first + second;
    const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD numWritten = 0;
    WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
  • Edit 1: I've added a method based on conio.
  • Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
Philipp
  • 48,066
  • 12
  • 84
  • 109
  • Thanks. Please also tell me how to write in command line in unicode? I can't! It ignores and writes in latin. – Narek Jul 12 '10 at 21:03
  • Also you might want to write "main" instead of "wmain", no? – Narek Jul 12 '10 at 21:11
  • If you want to read command line arguments, declare `wmain` as `int wmain(int argc, wchar_t** argv)` (the `w` is not a typo!). – Philipp Jul 13 '10 at 06:09
  • 1
    No, anyway, I can't wtire in command line any damn letter from Armenian or Russian alphabet! – Narek Jul 13 '10 at 07:29
  • What did you try? BTW, I think you should better ask a new question, the comments aren't a good substiture for a discussion forum. – Philipp Jul 13 '10 at 08:09
0

If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)

There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.

John
  • 289
  • 1
  • 2
-1

It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.

Edward Strange
  • 40,307
  • 7
  • 73
  • 125
  • He's on Windows, which uses UTF-16, but requires special API functions (`ReadConsole`/`WriteConsole`) to work with Unicode text. – Philipp Jul 10 '10 at 07:41