5

I have this simple C++ code that converts uppercase characters to lowercase:

#include <iostream>
#include <fstream>
#include <cwctype>
#include <locale>
#include <string>

int main()
{
    std::wstring input_str = L"İiIı";
    std::locale loc("tr_TR.UTF-8");
    std::wofstream output_file("lowercase_turkish.txt");
    output_file.imbue(loc);

    for (wchar_t& c : input_str) {
        c = std::towlower(c);
    }

    output_file << input_str << std::endl;
    output_file.close();

    return 0;
}

When giving the input İiIı I expect the output to be iiıı but rather I get the incorrect output İiiı

Why is that happening? and how can I solve the problem with minimum changes to the code considering that I use this code to convert uppercase letters to lowercase in more than 10 languages and it works well on all of them except Turkish.

I don't prefer a solution that is very specific to Turkish.

user2401856
  • 468
  • 4
  • 8
  • 22

3 Answers3

7

You apply the locale in std::wcout for printing, but not during the conversion to lowercase. std::towlower (which is a C function) uses the C locale. https://en.cppreference.com/w/cpp/string/wide/towlower

To fix this, do:

for (auto& c : input_str)
    c = std::tolower<wchar_t>(c, loc); // defined in <locale>

However, the output would be:

iiiı

even though it should be iiıı.

The reason for this seems to be that, as you can see here, turkish uses the regular latin I but the turkish-specific ı, and so the conversion of I to lowercase is wrong (regular ascii characters can be mixed with turkish unicode ones).

So you should use a turkish-specific solution:

for (auto& c : input_str)
    c = c == L'I' ? L'ı' : std::tolower<wchar_t>(c, loc);
guard3
  • 823
  • 4
  • 13
  • 3
    I like this answer better than mine, because this one is "surgical" and just targets the specific `tolower`, whereas my `std::setlocale` is global and affects everything. – Eljay Feb 12 '23 at 15:34
  • "the output would be:" It works for me just fine and outputs `iiıı` as expected. – n. m. could be an AI Feb 12 '23 at 16:02
  • Oh? I tried it and got `iiiı`. And OP commented that they got `İiiı`. I guess different implementations handle it differently. – guard3 Feb 12 '23 at 16:04
  • What compiler and OS? – n. m. could be an AI Feb 12 '23 at 16:05
  • macOS with Apple clang – guard3 Feb 12 '23 at 16:06
  • 2
    `İiiı` is an evidence that Unicode beyond ASCII is not being processed at all (which is expected, given that OP is calling a wrong function). `iiıı` is an evidence that Unicode is being processed according to the generic algorithm rather than locale-specific algorithm. So the Turkish locale on a Mac is broken in this respect. I am on Linux and it works just fine there (and I also see in the locale source that Turkish case conventions are respected). – n. m. could be an AI Feb 12 '23 at 16:16
  • The problem is with the first character `İ`, your solution never have impact on this char, it always stays the same after lowercase – user2401856 Feb 12 '23 at 17:13
  • @user2401856 What is your platform? The turkish locale in your machine is buggy, just like in my case. You should manually check for those `i`s to compensate. – guard3 Feb 12 '23 at 18:33
  • Also the problem is not just the first character, your expected output should be `iiıı` – guard3 Feb 12 '23 at 18:34
  • @guard3 I'm using windows 11, VS 2022 community v17.4.3 – user2401856 Feb 16 '23 at 00:23
5

You need to have the Turkish locale used for std::towlower. Otherwise its using the C locale, which is rather ASCII-centric.

#include <clocale>
#include <cwctype>
#include <fstream>
#include <iostream>
#include <locale>
#include <string>

int main() {
    std::wstring input_str = L"İiIıÇç";
    std::setlocale(LC_ALL, "tr_TR.UTF-8"); // This should impact std::towlower
    std::locale loc("tr_TR.UTF-8");
    std::wofstream output_file("lowercase_turkish.txt");
    output_file.imbue(loc);

    for (wchar_t& c : input_str) {
        c = std::towlower(c);
    }

    output_file << input_str << std::endl;
    output_file.close();
}
Eljay
  • 4,648
  • 3
  • 16
  • 27
  • it still gives me `İiiıçç` :( – user2401856 Feb 12 '23 at 17:11
  • The problem is with the first character `İ`, your solution never have impact on this char, it always stays the same after lowercase – user2401856 Feb 12 '23 at 17:13
  • 1
    I tried it on my Mac, and I'm getting `iiiıçç` — which is still wrong on the 3 character. On Linux, I'm getting `iiııçç`. Seems like a bug in macOS, at least on my Mac. The streams & locales & Unicode handling is (in my opinion) not very robust. I've had better luck with [ICU](https://icu.unicode.org/design/cpp) with C++, but I've not had to use ICU with Turkish localization. I have fixed bugs with SQL and Turkish localization — the user locale should not be used to interpret the SQL statements (they should be in C locale... but they're not). – Eljay Feb 12 '23 at 17:48
5

References from Wikipedia

Dotless I = I, ı / U+0049, U+0131 / LATIN CAPITAL LETTER I, LATIN SMALL LETTER DOTLESS I

Dotted İ = İ, i / U+0130, U+0069 / LATIN CAPITAL LETTER I WITH DOT ABOVE, LATIN SMALL LETTER I

Latin I = I, i / U+0049(LATIN CAPITAL LETTER I), U+0069(LATIN SMALL LETTER I)

Latin alphabet largely unaltered with the exception of extensions (such as diacritics), it is used to write English and other modern European languages.

Check Dotted and dotless I in computing

"İiIı" tolower using latin locale is "iiiı", any upper I,İ is lowered to i

"İiIı" tolower using turkish locale is "iiıı", İ is lowered to i and I is lowered to ı


Test Code - C++

using ICU on Windows

The code was compiled with Microsoft Visual C++ compiler.

To use this code

  1. Install PowerShell 7.x
  2. Run the script "Compile.ps1". It downloads ICU Lib and compile the code.
  3. Run the script "Run.ps1". It runs the generated program in dist folder.

You can clone/test/run the full source code from https://github.com/JomaStackOverflowAnswers/ToLowerTurkish

#include <iostream>
#include <string>
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/locid.h>
#ifdef _WIN32
#include <Windows.h>
#endif


using namespace std::string_literals;
int main()
{
    using namespace icu;
    #ifdef _WIN32
    SetConsoleOutputCP(CP_UTF8);
    #endif

    std::u16string data = u"İstanbul, Diyarbakır, DİYARBAKIR, Türkiye  İiIı  \u0130\u0069\u0049\u0131 - Default locale = "s;
    std::string data2 =  u8"İstanbul, Diyarbakır, DİYARBAKIR, Türkiye  İiIı  \u0130\u0069\u0049\u0131 - Custom Locale  = "s;
    
    UnicodeString localeName;

    UnicodeString uni_str(data.c_str(), data.length());
    uni_str.toLower();
    uni_str += Locale::getDefault().getDisplayName(localeName);

    UnicodeString uni_str2 = UnicodeString::fromUTF8(StringPiece(data2));
    Locale turkishLocale("tr", "TR");
    uni_str2.toLower(turkishLocale);
    uni_str2 += turkishLocale.getDisplayName(localeName);
    

    std::string str;
    uni_str.toUTF8String(str);

    std::string str2;
    uni_str2.toUTF8String(str2);

    std::cout << str << std::endl;
    std::cout << str2 << std::endl;
    
    return EXIT_SUCCESS;//0
}

Output

i̇stanbul, diyarbakır, di̇yarbakir, türkiye  i̇iiı  i̇iiı - default locale = English (United States)
istanbul, diyarbakır, diyarbakır, türkiye  iiıı  iiıı - custom locale  = Turkish (Turkey)

Screenshots C++ CODE

Visual Studio Code

vscode

Windows Terminal

wt


Test Code - C#

You can test/check from https://replit.com/@JomaCorpFX/ToLowerTurkish#main.cs

using System;
using System.Globalization;
                    
public class Program
{
    public static void Main()
    {
    string data = "İstanbul, Diyarbakır, DİYARBAKIR, Türkiye  İiIı  \u0130\u0069\u0049\u0131";
        CultureInfo culture = CultureInfo.CurrentCulture;
        Console.WriteLine($"System Culture {culture.Name}");
        Console.WriteLine(data.ToLower(culture));

        CultureInfo turkishCulture = new CultureInfo("tr-TR");
    Console.WriteLine($"Custom Culture {turkishCulture.Name}");
        Console.WriteLine(data.ToLower(turkishCulture));
    }
}

Output

System Culture en-US
istanbul, diyarbakır, diyarbakir, türkiye  iiiı  iiiı
Custom Culture tr-TR
istanbul, diyarbakır, diyarbakır, türkiye  iiıı  iiıı

Test Code - Java

You can test/check from https://replit.com/@JomaCorpFX/ToLowerTurkish-1#Main.java

import java.util.Locale;

public class Main {
  public static void main(String[] args) {
    String data = "İstanbul, Diyarbakır, DİYARBAKIR, Türkiye  İiIı  \u0130\u0069\u0049\u0131";
    Locale current = Locale.getDefault();
    System.out.println("Current Locale: " + current);
    System.out.println(data.toLowerCase(current));

    Locale turkishLocale = new Locale("tr", "TR");
    System.out.println("Custom Locale: " + turkishLocale);
    System.out.println(data.toLowerCase(turkishLocale));
  }
}

Output

Current Locale: en_US
i̇stanbul, diyarbakır, di̇yarbakir, türkiye  i̇iiı  i̇iiı
Custom Locale: tr_TR
istanbul, diyarbakır, diyarbakır, türkiye  iiıı  iiıı

Test Code - Python ❌

WARNING. On Windows locale.setlocale(locale.LC_ALL, "tr_TR.UTF-8") can't change the locale it remains the same.

import locale

data = "İstanbul, Diyarbakır, DİYARBAKIR, Türkiye  İiIı  \u0130\u0069\u0049\u0131"
defaultLocale = locale.getdefaultlocale()
print("Default Locale: " + str(defaultLocale))
print(data.lower())

turkishlocale = locale.setlocale(locale.LC_ALL, "tr_TR.UTF-8")
print("Custom Locale: " + str(turkishlocale))
print(data.lower())

Output

PS C:\Users\Megam\Downloads\icu4c-72_1-data-bin-b\TestIcu\ToLowerTurkish> python
Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>>
>>> data = "İstanbul, Diyarbakır, DİYARBAKIR, Türkiye �� İiIı �� \u0130\u0069\u0049\u0131"
>>> defaultLocale = locale.getdefaultlocale()
>>> print("Default Locale: " + str(defaultLocale))
Default Locale: ('en_US', 'cp1252')
>>> print(data.lower())
i̇stanbul, diyarbakır, di̇yarbakir, türkiye  i̇iiı  i̇iiı
>>>
>>> turkishlocale = locale.setlocale(locale.LC_ALL, "tr_TR.UTF-8")
>>> print("Custom Locale: " + str(turkishlocale))
Custom Locale: tr_TR.UTF-8
>>> print(data.lower())
i̇stanbul, diyarbakır, di̇yarbakir, türkiye  i̇iiı  i̇iiı
>>>
Joma
  • 3,520
  • 1
  • 29
  • 32