20

While I check the implementation of CaseInsensitiveComparator, which is private inner class of String, I found strange thing.

private static class CaseInsensitiveComparator
        implements Comparator<String>, java.io.Serializable {
    ...
    public int compare(String s1, String s2) {
        int n1 = s1.length();
        int n2 = s2.length();
        int min = Math.min(n1, n2);
        for (int i = 0; i < min; i++) {
            char c1 = s1.charAt(i);
            char c2 = s2.charAt(i);
            if (c1 != c2) {
                c1 = Character.toUpperCase(c1);
                c2 = Character.toUpperCase(c2);
                if (c1 != c2) {
                    c1 = Character.toLowerCase(c1);
                    c2 = Character.toLowerCase(c2);
                    if (c1 != c2) {
                        // No overflow because of numeric promotion
                        return c1 - c2;
                    }
                }
            }
        }
        return n1 - n2;
    }
    ...
}

What I'm curious is this: In the for loop, once you compare the upper cased characters, why you should compare the lower cased characters again? When Character.toUpperCase(c1) and Character.toUpperCase(c2) are different, is it possible that Character.toLowerCase(c1) and Character.toLowerCase(c2) are equal?

Couldn't it be simplified like this?

public int compare(String s1, String s2) {
    int n1 = s1.length();
    int n2 = s2.length();
    int min = Math.min(n1, n2);
    for (int i = 0; i < min; i++) {
        char c1 = s1.charAt(i);
        char c2 = s2.charAt(i);
        if (c1 != c2) {
            c1 = Character.toUpperCase(c1);
            c2 = Character.toUpperCase(c2);
            if (c1 != c2) {
                // No overflow because of numeric promotion
                return c1 - c2;
            }
        }
    }
    return n1 - n2;
}

Did I miss something?

ntalbs
  • 28,700
  • 8
  • 66
  • 83

1 Answers1

27

There are Unicode characters which are different in lowercase, but have the same uppercase form. For example the Greek letter Sigma - it has two lowercase forms (σ, and ς which is only used at the end of the word), but only one uppercase form (Σ).

I could not find any examples of the reverse, but if such a situation happened in the future, the current Java implementation is already prepared for this. Your version of the Comparator would definitely handle the Sigma case correctly.

You can find more information in the Case Mapping FAQ on the Unicode website.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
  • The common german character `ß` only exists in its lower case form, and would be `SS` in upper case. – maja Jul 29 '15 at 09:50
  • @maja "SS" are two characters. This conversion is only performed on strings, not on single characters. – xehpuk Jul 29 '15 at 10:26
  • 3
    Actually, uppercase ß (ẞ) was added to the Universal Character Set in 2008: https://wikipedia.org/wiki/Capital_ẞ However, even though it become mandatory for names of geographical places in official government documents in 2010, I haven't seen a single one in the wild. – Jörg W Mittag Jul 29 '15 at 11:00
  • +1, although you actually explain the opposit problem than the question asked for: inequal upper case characters may also map to the same lower case character, that is why this check is present in a CaseInsesitiveComparator. – Hulk Jul 29 '15 at 11:34
  • Yes, I realized that afterwards (and updated my answer). However, the Sigma-case is probably the most elegant and well-known example of the non-injectiveness of the uppercase function. For the lowercase function, I couldn't find a good example. – Glorfindel Jul 29 '15 at 11:49
  • 1
    (I remember having heard stories about Turkish `İ` Latin `I` both mapping to Latin`i`, and I think something about mathematical symbol my vs. Greek) – Hulk Jul 29 '15 at 11:51
  • 1
    There is also Turkish lowercase dotless i vs. Latin lowercase dotted i. They are different, but both are uppercased to uppercase dotless i. – gnasher729 Jul 29 '15 at 12:07
  • 1
    German ß is tricky. It should compare equal to ss for most purposes. Its use in uppercase form should be avoided, so for example with Main Street = Hauptstraße you should avoid HAUPTSTRASSE. What's worse there are a few cases where there is one word with ß and another with ss with totally different meaning; in that case ß should be capitalised to SZ. But with purely case insensitive comparison, it should be different from any other character. – gnasher729 Jul 29 '15 at 12:14
  • @gnasher729 AFAIR the SZ rule was dropped around 2006 or so, when we were blessed with the new orthography. – glglgl Jul 29 '15 at 14:11
  • @Hulk There are Turkish locales which male the upper/lower case conversion beave differently. – glglgl Jul 29 '15 at 14:12
  • @glglgl I didn't know that, but it doesn't surprise me. I am aware that the mapping has "collisions" in both directions and that it does depend on the locale and other things - as Glorfindel has pointed out, position within the word and neighboring chars may have an effect as well. – Hulk Jul 29 '15 at 14:18