Does Java's toLowerCase() preserve original string length?

Question

Assume two Java String objects:

String str = "<my string>";
String strLower = str.toLowerCase();

Is it then true that for every value of <my string> the expression

str.length() == strLower.length()

evaluates to true?

So, does String.toLowerCase() preserve original string length for any value of String?

score 46 · Accepted Answer · edited Jun 18 '18 at 07:53

46

Surprisingly it does not!!

From Java docs of toLowerCase

Converts all of the characters in this String to lower case using the rules of the given Locale. Case mapping is based on the Unicode Standard version specified by the Character class. Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String.

Example:

package com.stackoverflow.q2357315;

import java.util.Locale;

public class Test {
    public static void main(String[] args) throws Exception {
        Locale.setDefault(new Locale("lt"));
        String s = "\u00cc";
        System.out.println(s + " (" + s.length() + ")"); // Ì (1)
        s = s.toLowerCase();
        System.out.println(s + " (" + s.length() + ")"); // i̇̀ (3)
    }
}

edited Jun 18 '18 at 07:53

Joachim Sauer

302,674
57
556
614

answered Mar 01 '10 at 16:38

codaddict

445,704
82
492
529

5

Can you name some examples? I know several examples which would make the uppercased variant different sized than the lowercased one, e.g. `ß` would become `SS`, but not the other way round. – BalusC Mar 01 '10 at 16:42
7

@BalusC: There are some fancy rules regarding combining characters in locales AZ, LT and TR, see `java/lang/ConditionalSpecialCasing.java`. For example, `"\u00cc".toLowerCase(new Locale("lt")).length() == 3` – axtavt Mar 01 '10 at 17:58
2

Cool, thanks for the pointer. I'll be so free to edit an SSCCE in this answer. – BalusC Mar 01 '10 at 18:04
1

`java/lang/ConditionalSpecialCasing.java` handles special Unicode cases listed here http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt – Arend v. Reinersdorff Jan 02 '16 at 10:04
Do we have any examples for this in English locale (en_US) ? – Steev James Nov 15 '22 at 12:11

Joachim Sauer · Answer 2 · 2010-03-01T18:22:14.140

First of all, I'd like to point out that I absolutely agree with the (currently highest-rated) answer of @codaddict.

But I wanted to do an experiment, so here it is:

~~It's not a formal proof, but this code ran for me without ever reaching the inside of the if (using JDK 1.6.0 Update 16 on Ubuntu):~~

Edit: Here's some updated code that handles Locales as well:

import java.util.Locale;

public class ToLowerTester {
    public final Locale locale;

    public ToLowerTester(final Locale locale) {
        this.locale = locale;
    }

    public String findFirstStrangeTwoLetterCombination() {
        char[] b = new char[2];
        for (char c1 = 0; c1 < Character.MAX_VALUE; c1++) {
            b[0] = c1;
            for (char c2 = 0; c2 < Character.MAX_VALUE; c2++) {
                b[1] = c2;
                final String string = new String(b);
                String lower = string.toLowerCase(locale);
                if (string.length() != lower.length()) {
                    return string;
                }
            }
        }
        return null;
    }
    public static void main(final String[] args) {
        Locale[] locales;
        if (args.length != 0) {
            locales = new Locale[args.length];
            for (int i=0; i<args.length; i++) {
                locales[i] = new Locale(args[i]);
            }
        } else {
            locales = Locale.getAvailableLocales();
        }
        for (Locale locale : locales) {
            System.out.println("Testing " + locale + "...");
            String result = new ToLowerTester(locale).findFirstStrangeTwoLetterCombination();
            if (result != null) {
                String lower = result.toLowerCase(locale);
                System.out.println("Found strange two letter combination for locale "
                    + locale + ": <" + result + "> (" + result.length() + ") -> <"
                    + lower + "> (" + lower.length() + ")");
            }
        }
    }
}

Running that code with the locale names mentioned in the accepted answer will print some examples. Running it without an argument will try all available locales (and take quite a while!).

~~It's not extensive, because theoretically there could be multi-character Strings that behave differently, but it's a good first approximation.~~

Also note that many of the two-character combinations produced this way are probably invalid UTF-16, so the fact that nothing explodes in this code can only be blamed on a very robust String API in Java.

And last but not least: even if the assumption is true for the current implementation of Java, that can easily change once future versions of Java implement future versions of the Unicode standard, in which the rules for new characters may introduce situations where this no longer holds true.

So depending on this is still a pretty bad idea.

You should be aware that the code you have written is default locale dependent. Not obvious, but nasty. — Tom Hawtin - tackline, Mar 01 '10 at 17:50

score 2 · Answer 3 · answered Feb 09 '11 at 12:00

2

Also remember that toUpperCase() does not preserve the length either. Example: “straße” becomes “STRASSE” for the German locale. So you're more or less screwed if you're working with case sensitive strings and you need to store the index for something.

answered Feb 09 '11 at 12:00

User

73
4

Since both straße and strasse are correct spellings (ignoring the fact that they should have a capital initial S because they are nouns), I assume that it will result in the interesting side effect that going to uppercase and back will result in a different string? Have you tried it? – Fredrik Feb 09 '11 at 12:10

Does Java's toLowerCase() preserve original string length?

3 Answers3

Linked