Using boost::format %s specifier with UTF-8 strings

Question

We are adding support for UTF8 to an existing application with a large code base. This application uses boost::format(), and the output in non-ASCII characters is not aligning properly. Specifically, when using the %{width}.{length}s specifier, boost::format() counts chars, which does not "do the right thing" with utf8 strings. I think it should be possible to change the string length code (which is probably string::size()) to use utf8len() or something analogous, based on ... something?

In this case, it is not practical to change the existing code base to use UCS2 (or UCS4, or UTF-16, etc), but it is possible to modify boost::format() if necessary. I was hoping someone else had run across this need, and can point me to a possible solution.

Note: I found some web pages on using locales with utf8, but most of that seemed more applicable to converting to/from utf8 and UCS4 in streams.

You mean that `boost::format()` counts _bytes_ when it should count chars, don't you? — cjm, May 09 '11 at 21:51
What does "aligning" even mean in Unicode? It's fundamentally an ASCII thing, not unicode. See e.g. http://mothy.org/hacks/unicodewidth/UnicodeWidth.html. Take a very simple example: "A". That should probably be 2 characters wide. Mind you, that's no "A". It's U+FF21, Fullwidth Latin Capital A. — MSalters, May 10 '11 at 09:07
@cjm - Yes, bytes is correct, although I was referring to "chars" as in the plural of "char", and not characters as in the logical printed unit. — Grognard61, May 10 '11 at 13:15
@MSalters - We use a monospace font for rendering, so we cheat: all characters are one "printed unit" wide. It would be better if we used a variable pitch font and did alignment calculations in pels, but we don't. "Old code base" in an embedded system precludes making these changes. — Grognard61, May 10 '11 at 13:17
@Grognard61: Even a zero-width space?! Unicode has really funny characters. Oh, and the reason for full-width A is so it can align with Kanji characters, which take twice the width of ASCII characters. — MSalters, May 10 '11 at 14:49
@all - Thank you for all your comments. I understand that there are many different issues with the rich set of characters in Unicode. My interest is in modifying boost::format() so that it deals with at least *some* of the other characters, even if not all. In the real world of adding multiple language support to a product, we have control over the translation and we don't want to re-write all the output code. An 80% general solution that covers 100% of our cases is a good solution. (continued) — Grognard61, May 12 '11 at 15:24
I have modified boost::format() to handle mono-space-only cases. I am looking into using the code from mothy.org in boost::format() to handle wide & narrow chars. My questions are "Has anyone _else_ tried to modify boost::format() to handle utf-8? How did you do it, and what issues did you encounter?" When I am done, I will post the results (such as they are) for others. — Grognard61, May 12 '11 at 15:26

score 1 · Answer 1 · answered Aug 21 '15 at 11:49

This is probably too late for you, but maybe it will help someone else. Boost::format accepts a std::locale as an optional template parameter. (see http://www.boost.org/doc/libs/1_55_0/libs/format/doc/format.html). If you pass it a unicod aware locale, such as the boost::locale("en_US.UTF-8"), you should get the desired behavior.

Instead of passing a locale each time to the boost::format constructor, you could also set the default locale of your application, which might help you avoid other problems. If you take this route, I would recomment the use of a boost::locale over a std::locale, as the boost::locale's won't modify your numeric formatting unless you explicity ask it to (docs here).

In general, this is a goto approach for making an application in C++ work nicely with Unicode. If the functionality can use a locale (std::regex, std::sort, boost::format), give it a unicode aware locale, and you should be safe (and if you arent' please tell me, I want to know).

If you are making a small, lightweight application and only care about the 80% case, you may not want to pay the price for including ICU (Internation Components for Unicode) which is the default engine boost locale wraps around when providing unicde support. In this case build Boos using your OS's or Posix unicode support, and your application will remain small and light, but you won't have a lot of unicode support, like multiple collation levels.

For the problem you are describing, Posix support is likely sufficent.

score 1 · Answer 2 · answered Jun 24 '20 at 22:46

AFAIK Boost Format measures everything in code units even when a UTF-8 based locale is used.

If you can switch to another library, then consider C++20 std::format or the {fmt} formatting library which count width in display width units (similarly to wcswidth) so the alignment is correct. For example

fmt::print("┌{0:─^{2}}┐\n"
           "│{1: ^{2}}│\n"
           "└{0:─^{2}}┘\n", "", "Hello, world!", 20);

prints:

┌────────────────────┐
│   Hello, world!    │
└────────────────────┘

Disclaimer: I'm the author of {fmt} and C++20 std::format

Using boost::format %s specifier with UTF-8 strings

2 Answers2