How does the behavior of std::tolower change in different locales?

Question

I was reading the documentation for std::tolower at cppreference.com:

Converts the given character to lowercase according to the character conversion rules defined by the currently installed C locale.

In the default "C" locale, the following uppercase letters ABCDEFGHIJKLMNOPQRSTUVWXYZ are replaced with respective lowercase letters abcdefghijklmnopqrstuvwxyz.

How might that behavior change in different locales?

By transforming characters that don't exist in the default locale? — juanchopanza, Aug 11 '14 at 06:05
@juanchopanza Oh... well that does seem obvious now that I think of it. — Michael Dorst, Aug 11 '14 at 06:06
Tip: in general support for locale in C and C++ is flaky and disomogeneous between compilers and systems. Moreover, most parsers out there are written with the assumption of the "C" locale, and are going to break if you use a locale that does anything fancy. Just avoid them, they are not worth the effort. — Matteo Italia, Aug 11 '14 at 06:28

score 5 · Accepted Answer · answered Aug 11 '14 at 06:18

5

Actually, the very example on the site shows a difference:

#include <iostream>
#include <cctype>
#include <clocale>

int main()
{
    unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
                              // but ´ (acute accent) in ISO-8859-1 

    std::setlocale(LC_ALL, "en_US.iso88591");
    std::cout << std::hex << std::showbase;
    std::cout << "in iso8859-1, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
    std::setlocale(LC_ALL, "en_US.iso885915");
    std::cout << "in iso8859-15, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
}

Output:

in iso8859-1, tolower('0xb4') gives 0xb4
in iso8859-15, tolower('0xb4') gives 0xb8

Because the C language has no notion of encoding, a char (and thus a char const*) are just bytes. When switching locale, you switch the interpretation of those bytes, for example here the byte 0xb4 (180) is outside the ASCII range (0-127), and therefore its meaning changes depending on the locale you switch to:

in ISO-8859-1, it means ´, and therefore is unchanged when moving from upper to lower
in ISO-8859-15, it means Ž, and therefore changes to ž (0xb8 in this locale) when moving from upper to lower

You would think that in a post-Unicode world, this would be irrelevant, but many have not yet transitioned to Unicode...

answered Aug 11 '14 at 06:18

Matthieu M.

287,565
48
449
722

:sigh: They've only had 20 years – Billy ONeal Aug 11 '14 at 06:21
So is there a better function to use if you are using unicode? – Michael Dorst Aug 11 '14 at 06:23
@anthropomorphic: If you're using Unicode, you should only use Unicode locales. – celtschk Aug 11 '14 at 06:24
@celtschk So the default locale for C++ is not a Unicode locale? – Michael Dorst Aug 11 '14 at 06:26
1

@anthropomorphic: The default locale is determined by your OS settings (on Unix-like systems, usually through the environment variable `LC_ALL`). If nothing else is specified, it's the `C` locale. – celtschk Aug 11 '14 at 06:28
1

@anthropomorphic: actually, Unicode brings some new cases as well. Many locales were crafted so that one "character" (whatever that means) is encoded on a single byte, which is what `tolower` assumes. Unfortunately, with Unicode, this no longer holds, and thus `tolower` does not work on Unicode strings, no matter the particular encoding (utf-8, utf-16, utf-32, ...) – Matthieu M. Aug 11 '14 at 06:57

celtschk · Answer 2 · 2014-08-11T06:26:46.067

It may change in two ways:

There are characters not in that set, which may also be translated in a non-C locale. For example, in a German locale, the letter "Ä" will be converted to "ä".
Even for characters in that set, the lowercase version may be different. For example, in Turkish locales, the lowercase version of "I" should not be "i", but "ı", while "i" will be produced as lowercase equivalent of "İ".

Also note that the position of the non-ASCII characters in the character set may depend on the locale, since the locale also determines the used character set. However, even if you're working exclusively in Unicode (e.g. use UTF-8 locales exclusively), you still have the differences listed above.

How does the behavior of std::tolower change in different locales?

2 Answers2