5

Considering the following Java code comparing a small string containing the German grapheme ß

String a = "ß";
String b = a.toUpperCase();

assertTrue(a.equalsIgnoreCase(b));

The comparison fails, because "ß".toUpperCase() is actually equal to "SS", and that ends up failing a check in equalsIgnoreCase(). The Javadocs for toUpperCase() do mention this case explicitly, however I don't understand why this does not go to ẞ, the capital variant of ß?

More generally, how should we do case insensitive comparisons, potentially across different locales. Should we just always use either toUpper() or equalsIgnoreCase(), but never both?

It seems that the problem is that the implementation of equalsIgnoreCase() includes the following check: anotherString.value.length == value.length, which seems incompatible with the Javadocs for toUpper(), which state:

Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String.

Oleksi
  • 12,947
  • 4
  • 56
  • 80
  • You would need to use a [`Collator`](http://docs.oracle.com/javase/7/docs/api/java/text/Collator.html) instead of the built-in methods of `String`. – Mick Mnemonic May 15 '17 at 21:15
  • `SS` is the uppercase because [it's defined to be in Unicode](http://www.fileformat.info/info/unicode/char/00df/index.htm). – Andy Turner May 15 '17 at 21:16
  • @AndyTurner that's weird, because there is a Unicode code point for the upper case character, and it defines this character as it's lower case character http://www.fileformat.info/info/unicode/char/1e9e/index.htm – Oleksi May 15 '17 at 21:25
  • @AndyTurner Is this just to do with the fact that the capital was introduced in 2008, and the original in 1993? – Oleksi May 15 '17 at 21:44

1 Answers1

6

Java's Collator class is designed for different locale-sensitive text comparison operations. Since the concept of "upper-case" varies quite a bit between locales, Collator uses a more fine-grained model called comparison strength. There are four levels provided, and how they affect comparisons is locale-dependent.

Here's an example of using Collator with the German locale for case-insensitive comparison of the letter ß:

Collator germanCollator = Collator.getInstance(Locale.GERMAN);
int[] strengths = new int[] {Collator.PRIMARY, Collator.SECONDARY,
                             Collator.TERTIARY, Collator.IDENTICAL};

String a = "ß";
String b = "ß".toUpperCase();

for (int strength : strengths) {
    germanCollator.setStrength(strength);
    if (germanCollator.compare(a, b) == 0) {
        System.out.println(String.format(
                "%s and %s are equivalent when comparing differences with "
                + "strength %s using the GERMAN locale.",
                a, b, String.valueOf(strength)));
    }
}

The code prints out

ß and SS are equivalent when comparing differences with strength 0 using the GERMAN locale.
ß and SS are equivalent when comparing differences with strength 1 using the GERMAN locale.

which means that the German locale considers these two strings equal in PRIMARY and SECONDARY strength comparisons.

Mick Mnemonic
  • 7,808
  • 2
  • 26
  • 30