setw() imbues wrong output on strings containing UTF-8 multi-byte characters/code points

Question

I need to output some data that may be UTF-8 multi-byte and I need to keep them formatted using setw().

When the characters are multi-byte sequences, aligement is lost and setw() doesn't work correctly.

//#include <stdio.h>
#include <locale>
#include <iostream>
//#include <fstream>
#include <iomanip>
//#include <sstream>

int main(int argc, char **argv)
{ 
    std::locale l=std::locale("en_US.utf8");
    std::locale::global(l); 
    std::cout.imbue(l);
    std::cout<<std::endl;
    std::cout<<std::setw(40)<<std::right<<"hi “my” friend"<<std::endl;
    std::cout<<std::setw(40)<<std::right<<"hi -my- friend"<<std::endl;
    return 0;
}

The output is:

                  hi “my” friend
                      hi -my- friend

What am I missing ?

I must point out that the characters “ and ” are not the normal " but instead two others, which in UTF-8 are expressed by three bytes each.

Sadly, imbuing a UTF-8 locale won't make formatting functions UTF-8 aware. The easiest way to accomplish your task is to convert everything to wchar_t and use wide character streams. — n. m. could be an AI, Mar 06 '16 at 21:29
[wide characters are working](https://godbolt.org/z/M7KvsYKP7) on POSIX, but fails on windows :/ — Marek R, Sep 13 '22 at 10:11

score 3 · Answer 1 · answered Mar 06 '16 at 20:26

3

String literal "hi -my- friend" contains 14 characters. String literal "hi “my” friend" contains 18 characters: symbols “ and ” are encoded by 3 characters/bytes. cout outputs those characters as-is, it is target terminal which converts 3-byte sequence into single symbol.

So, from stream point of view everything is okay: it outputs (width - strlen(literal) ) fill characters, then strlen(literal) characters, width total. It does not handle possible multibyte sequences and doesn't know that target terminal transform several characters to one symbol.

answered Mar 06 '16 at 20:26

Revolver_Ocelot

8,609
3
30
48

It would be expected that the stream having the locale knowledge would handle the necessary conversions. If not it is of null usefullness the "setw" as it doesn't does what the user expects to do. What is so the meaning of "imbue" ? Obviously the need isn't for the terminal only but for files as well as these may contain utf8 text ( or whatever enconding is selected). – George Kourtis Mar 06 '16 at 20:33
1

@GeorgeKourtis If you look through `locale` class, you will find that it has literally nothing to deal with multibyte encoding. Whole localisation library and all standard streams expect fixed-width encodings. Only thing it does provide is `wstring_convert` and `codecvt_*` classes which do conversion between encodings. You are expected to convert your data into fixed width encoding before passing it to standard library facilities. In short: you are feeding it data it cannot handle. Either convert data into fixed-width or do not rely on anything extra except raw character output – Revolver_Ocelot Mar 06 '16 at 20:40

Eric Sokolowsky · Answer 2 · 2020-03-25T03:30:35.560

You can accomplish this formatting by counting how many characters your string would be if it were in a wide representation, then taking the difference between the length of your string and the wide representation, then adding that difference to what you pass to setw, such as:

std::mbstate_t state = std::mbstate_t();
std::string s = "hi “my” friend";
const char *cp = s.c_str();
size_t len = mbsrtowcs(nullptr, &cp, s.size(), &state);
std::cout << setw(40 + (s.size() - len)) << std::right << s << std::endl;

You could encode this functionality into a function that takes the string as a parameter and just returns the difference to be added to the setw call:

size_t f(const std::string &s)
{
  std::mbstate_t state = std::mbstate_t();
  const char *cp = s.c_str();
  size_t len = mbsrtowcs(nullptr, &cp, s.size(), &state);
  return s.size() - len;
}
...

std::string s = "hi “my” friend";
std::cout << std::setw(40 + f(s)) << std::right << s << std::endl;

"Wide" characters are not required to be UTF-16, UTF-32, or any Unicode encoding. So there is no guarantee that this code will produce the intended result. Even if it did, if "wide" characters were UTF-16, it would only produce useful results for codepoints that fit into a single UTF-16 code unit. — Nicol Bolas, Mar 25 '20 at 03:31
This produces 19, where only 4 is needed to correct the alignment. That seems to be triggered by `len` being the max ulong64_t value triggering under or overflow and all that fun stuff. Looks like a locale has to be explicitly defined for this to work — Zoe, Jul 28 '20 at 15:25

setw() imbues wrong output on strings containing UTF-8 multi-byte characters/code points

2 Answers2