19

In Turkish, there's a letter İ which is the uppercase form of i. When I convert it to lowercase, I get a weird result. For example:

var string_tr = "İ".toLowerCase();
var string_en = "i";

console.log( string_tr == string_en );  // false
console.log( string_tr.split("") );     // ["i", "̇"]
console.log( string_tr.charCodeAt(1) ); // 775
console.log( string_en.charCodeAt(0) ); // 105

"İ".toLowerCase() returns an extra character, and if I'm not mistaken, it's COMBINING DOT ABOVE (U+0307).

How do I get rid of this character?

I could just filter the string:

var string_tr = "İ".toLowerCase();

string_tr = string_tr.split("").filter(function (item) {
    if (item.charCodeAt(0) != 775) {
        return true;
    }
}).join("");

console.log(string_tr.split(""));

but am I handing this correctly? Is there a more preferable way? Furthermore, why does this extra character appear in the first place?

There's some inconsistency. For example, in Turkish, there a lowercase form of I: ı. How come the following comparison returns true

console.log( "ı".toUpperCase() == "i".toUpperCase() ) // true

while

console.log( "İ".toLowerCase() == "i" ) // false

returns false?

akinuri
  • 10,690
  • 10
  • 65
  • 102
  • 9
    Have you tried `String.toLocaleLowerCase()`? https://stackoverflow.com/questions/1850232/turkish-case-conversion-in-javascript – Tobias Timm Oct 16 '17 at 12:47
  • 3
    You can read more about this here: https://msdn.microsoft.com/en-us/library/ms973919.aspx#stringsinnet20_topic5 – JOSEFtw Oct 16 '17 at 12:48
  • @JOSEFtw I'm curious, why JS converts `"ı".toUpperCase()` correctly, but not `"İ".toLowerCase()"`. – akinuri Oct 16 '17 at 13:10
  • 1
    @akinuri, because the mapping for `ı (U+0131)` and `i (U+0069)` are the same: `I (U+0049)` – MinusFour Oct 16 '17 at 13:14
  • @MinusFour Well, can't they just map `İ` to `i` instead of `i + COMBINING DOT ABOVE`? Current mapping seems a bit ridiculous. – akinuri Oct 16 '17 at 13:35
  • 1
    @akinuri, it would break out some code for people that depend on that behavior. It's not that ridiculous to be honest... At any point, that's why Unicode added special casings for the turkish language. That's why you need to use `.toLocaleLowerCase` – MinusFour Oct 16 '17 at 14:22
  • @akinuri Turkish is specifically used as a classic example of needing to take into account locale in string comparison instead of doing a brute force `toLowerCase` - one example of an article written on it http://www.i18nguy.com/unicode/turkish-i18n.html – McAden Oct 16 '17 at 22:44

2 Answers2

33

You’ll need a Turkish-specific case conversion, available with String#toLocaleLowerCase:

let s = "İ";

console.log(s.toLowerCase().length);
console.log(s.toLocaleLowerCase('tr-TR').length);
Ry-
  • 218,210
  • 55
  • 464
  • 476
  • 1
    Wouldn't that only be useful in situations where I know the locale of the string? For example, a user inputs a string on a form, but I don't have a way of knowing the locale of the string. What should I do then? Use `.toLocaleLowerCase('tr-TR')` anyway, just to be safe? In that case, is it safe to use `.toLocaleLowerCase('tr-TR')` on every string? – akinuri Oct 16 '17 at 13:04
  • 9
    @akinuri: No, it’s not safe (try lowercasing `I`). You have to know the locale of the string to transform it correctly in general. For specific situations, there might be workarounds – what’s your reason for lowercasing a string? – Ry- Oct 16 '17 at 13:39
  • Currently, I have none, but in the past, I had to do this a few times. One that I can think of was storing artist names (Turkish and foreign) in a database. Using PHP, I had to map the correct chars manually. – akinuri Oct 16 '17 at 13:47
  • @akinuri Your app should know what the current locale is, shouldn't be too hard to pass in. – Ruan Mendes Oct 16 '17 at 15:14
  • 6
    @akinuri because [there's no way to do universal case mapping](https://blogs.msdn.microsoft.com/oldnewthing/20030905-00/?p=42643) so you have to know which language that is. Same thing with [sorting](https://learn.microsoft.com/en-us/globalization/locale/sorting-and-string-comparison) because the same strings may sort into [different orders](http://www.unicode.org/reports/tr10/#Introduction) in different languages – phuclv Oct 16 '17 at 15:55
  • 2
    @akinuri: Artist names? Would you need to lowercase those, or would a case-insensitive comparison be enough? But yeah, language is one of those exceedingly tricky problems. – Ry- Oct 16 '17 at 16:03
  • 2
    @Ryan Doesn't case-insensitive comparison also require specifying the locale? – Barmar Oct 16 '17 at 18:36
  • I wasn't just making comparisons back then, also creating URL friendly names. Is user's locale always trustable? Because, for example, my OS is English, though my keyboard is Turkish. – akinuri Oct 16 '17 at 18:51
  • @LưuVĩnhPhúc, for case-insensitive comparison, there is a way, that is what the Invariant (or Ordinal) comparisons in .NET are: https://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo.invariantculture%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396. Your article simply said it was difficult, it isn't actually impossible. – NH. Oct 16 '17 at 22:53
  • @NH. that doesn't solve the problem. Invariant culture is English only. What if one uses some other language and expects that "i" converts to "İ" but the culture converts it to "I"? How about the order is correct in invariant culture but not in another language? Good luck doing Turkish case comparison in invariant culture. Of course invariant culture is better when no language information is available but it's not a magic to make everyone happy. The problem is still **impossible to solve** – phuclv Oct 17 '17 at 01:39
  • @Barmar: Yep, but for some situations you can cheat more reliably =D Like strings that have to be unique – you can strip out some characters in a case-folded version for the uniqueness test without ever actually displaying the mangled version of the input. – Ry- Oct 17 '17 at 02:45
  • @LưuVĩnhPhúc you didn't read my comment. For case-insensitive *comparison*. – NH. Oct 18 '17 at 15:16
  • @NH. you didn't read my comment. I talked about **case-insensitive comparison** – phuclv Oct 18 '17 at 15:25
  • 1
    @NH. no, it's relevant to *case comparison*. Compare `gif` and `GIF` and regardless of the result (equal or not) you'll at least make a Turkey or a non-Turkey man angry. Same with `Maße`, `MASSE` and `Masse`, would they compare the same? Ask a Swiss and a German to see. Compare [`DZ`](http://www.fileformat.info/info/unicode/char/01f1/index.htm) equal to `dz` will also make many people unhappy. As I said, **there's no way to make everyone happy** because there's no such thing as universal locale – phuclv Oct 18 '17 at 15:35
0

You can just use the LocalLowerCase or LocalUpperCase for languages like Turkish and other alphabets with dotted and dotless i versions such as Azerbaijani, Kazakh, Tatar, and Crimean Tatar.

var string_tr = "İ".toLocalLowerCase();
var string_en = "i";

console.log( string_tr == string_en );  // false
console.log( string_tr.split("") );     // ["i", "̇"]
console.log( string_tr.charCodeAt(1) ); // 775
console.log( string_en.charCodeAt(0) ); // 105