C++ std::string capitalize in non-latin language (without third-party libraries)

Question

Considering the method:

void Capitalize(std::string &s)
{
    bool shouldCapitalize = true;

    for(size_t i = 0; i < s.size(); i++)
    {
        if (iswalpha(s[i]) && shouldCapitalize == true)
        {
            s[i] = (char)towupper(s[i]);
            shouldCapitalize = false;
        }
        else if (iswspace(s[i]))
        {
            shouldCapitalize = true;
        }
    }
}

It works perfectly for ASCII characters, e.g.

"steve" -> "Steve"

However, once I'm using a non-latin characters, e.g. as with Cyrillic alphabet, I'm not getting that result:

"стив" -> "стив"

What is the reason why that method fails for non-latin alphabets? I've tried using methods such as isalpha as well as iswalpha but I'm getting exactly the same result.

What would be a way to modify this method to capitalize non-latin alphabets?

Note: Unfortunately, I'd prefer to solve this issue without using a third party library such as icu4c, otherwise it would have been a very simple problem to solve.

Update:

This solution doesn't work (for some reason):

void Capitalize(std::string &s)
{
    bool shouldCapitalize = true;
    std::locale loc("ru_RU"); // Creating a locale that supports cyrillic alphabet

    for(size_t i = 0; i < s.size(); i++)
    {
        if (isalpha(s[i], loc) && shouldCapitalize == true)
        {
            s[i] = (char)toupper(s[i], loc);
            shouldCapitalize = false;
        }
        else if (isspace(s[i], loc))
        {
            shouldCapitalize = true;
        }
    }
}

Have you set the locale to one that uses Cyrillic alphabets? — Ranoiaetep, Feb 07 '23 at 12:03
No, how to do that? Will that work for the latin locales, e.g. en_US ? — Richard Topchii, Feb 07 '23 at 12:03
You have to use `std::locale`, and friends. Which will be painful. The no addl. libraries requirement is quite limiting one. This is what libraries are for: they're there to be used. Multilingual and i18n support in the C++ library is quite lacking. Hauling this task yourself is a pain. — Sam Varshavchik, Feb 07 '23 at 12:05
I've added an example that I've also tried, but it doesn't work. Moreover, that requirement works on latin characters though. — Richard Topchii, Feb 07 '23 at 12:06
Note that `std::toupper(std::locale)` returns the same character type as you put in so the `(char)` cast is not needed. — Ted Lyngmo, Feb 07 '23 at 12:17
Capitalization is a _real_ hard problem. Do you drop diacriticals? Do you have a dictionary available? (the capitalization of ß is not unique - and also depends on the country/locale) — MSalters, Feb 07 '23 at 12:25
Hmm, I just testde a few things and I got `Stiv` printed which seems to be the Englishification of the name `стив` .... odd — Ted Lyngmo, Feb 07 '23 at 12:26
@TedLyngmo that's probably what we want to avoid here... "стив" should be changed to "Стив", i.e. just the capitalization. — Richard Topchii, Feb 07 '23 at 12:27
@RichardTopchii Yeah, I'm just surprised. Where did it get _that_ from? :-) If I do `std::wstring in = L"стив"; std::wcout << in << L'\n';` it prints `stiv` (and I did `std::wcout.imbue(loc);` first thing in the program). — Ted Lyngmo, Feb 07 '23 at 12:28
That's "transliteration", very likely to give an ability to print cyrillic text in latin output. Might be similar for French/German, although it's less noticeable due to the fact that those languages have ~90% similar alphabet to ASCII. — Richard Topchii, Feb 07 '23 at 12:31
@TedLyngmo Some implementations will transliterate if you don't set up the locale properly. https://godbolt.org/z/TzP3n7rr3 — n. m. could be an AI, Feb 07 '23 at 13:32
@n.m. Yes, I have very little experience with localization and tend to forget what's needed every time I encounter it :-) I made a demo of skott's answer below that seems to do the right thing and should work for OP if coming from a UTF8 encoded `std::string`. — Ted Lyngmo, Feb 07 '23 at 13:50

score 2 · Answer 1 · answered Feb 07 '23 at 12:38

std::locale works, at least where it is present in system. Also you use it incorrectly.

This code works as expected on Ubuntu with Russian locale installed:

#include <iostream>
#include <locale>
#include <string>
#include <codecvt>

void Capitalize(std::wstring &s)
{
    bool shouldCapitalize = true;
    std::locale loc("ru_RU.UTF-8"); // Creating a locale that supports cyrillic alphabet

    for(size_t i = 0; i < s.size(); i++)
    {
        if (isalpha(s[i], loc) && shouldCapitalize == true)
        {
            s[i] = toupper(s[i], loc);
            shouldCapitalize = false;
        }
        else if (isspace(s[i], loc))
        {
            shouldCapitalize = true;
        }
    }
}

int main()
{
    std::wstring in = L"это пример текста";
    Capitalize(in);
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
    std::string out = conv1.to_bytes(in);
    std::cout << out << "\n";
    return 0;
}

Its possible that on Windows you need to use other locale name, I'm not sure.

This is nice! Here's a [demo](https://godbolt.org/z/ovajfbxcf) of doing the roundtrip from `std::string` => `std::wstring` => `Capitalize` => `std::string` — Ted Lyngmo, Feb 07 '23 at 13:09
Hmm, it doesn't really work for my case in macOS environment. It doesn't capitalize the individual letters, i.e. "a" and "б", those are not capitalized if the letter is just the only component in the string — Richard Topchii, Feb 07 '23 at 14:29

TopchetoEU · Answer 2 · 2023-02-07T12:21:29.623

0

Well, an external library would be the only practical choice IMHO. The standard functions works well with Latin, and any other locale would be a pain, and I wouldn't bother. Still, if you want support for Latin and Cyrillic without an external library, you can just write it yourself:

wchar_t to_upper(wchar_t c) {
    // Latin
    if (c >= L'a' && c <= L'z') return c - L'a' + L'A';
    // Cyrillic
    if (c >= L'а' && c <= L'я') return c - L'а' + L'А';

    return towupper(c);
}

Still, it's important to note that you need to painstakingly implement support for all alphabets, and even not all latin characters are supported, so an external library is the best solution. Consider the given solution if you're sure only English and Russian are going to be used.

edited Feb 07 '23 at 12:21

answered Feb 07 '23 at 12:10

TopchetoEU

697
5
9

Thanks a lot for your answer. Using a library is out of the question now. Please suggest improving this (or creating a different) function. I understand it's a bit more difficult, but the ICU library is 10+MB and I cannot take it as well due to some other concerns. – Richard Topchii Feb 07 '23 at 12:12
@RichardTopchii ok, i'll edit my answer to include an example function implementation – TopchetoEU Feb 07 '23 at 12:13
1

This solution won't work for other non-cyrillic and non-latin alphabets, e.g. Greek. – Richard Topchii Feb 07 '23 at 12:15
@RichardTopchii this is true, but as I mentioned in the answer, this is only for Latin and Cyrillic. If OP wants, he could painstakingly add all the alphabets he knows – TopchetoEU Feb 07 '23 at 12:18
In either case, `if (c >= L'а' && c <= L'я') return c - L'а' + L'А';` is needed – Ted Lyngmo Feb 07 '23 at 12:20
2

It doesn't even work for German or French - and those use the Latin alphabet. – MSalters Feb 07 '23 at 12:20
2

@RichardTopchii You mentioned _"the ICU library is 10+MB"_, but this is exactly the reason it's so large. Handling Unicode properly, for all languages, is a daunting task! – heap underrun Feb 07 '23 at 12:32
This doesn't handle `ё` `Ё` used in Russian. Other cyrillic languages might miss letters too. – HolyBlackCat Feb 07 '23 at 12:43
This solution doesn't even work for ALL Cyrillic symbols, for example `Ё` and `ё` have different difference, than `А` and `а`. – sklott Feb 07 '23 at 12:44
1

This is not needed. The wide-char version of `std::toupper(char, locale)` works (in non-brain-damaged implementations). Of course there are always things like the Turkish 'i' (you need to know somehow that it's Turkish!) or German ß (if you don't want to use the newfangled capital ẞ) but for the most part, it just works. https://godbolt.org/z/T6n7x38bT – n. m. could be an AI Feb 07 '23 at 13:13

C++ std::string capitalize in non-latin language (without third-party libraries)

2 Answers2