-4

This code is about 3 times faster than the standard String.toUpperCase() function:

public static String toUpperString(String pString) {
    if (pString != null) {
        char[] retChar = pString.toCharArray();
        for (int idx = 0; idx < pString.length(); idx++) {
            char c = retChar[idx];
            if (c >= 'a' && c <= 'z') {
                retChar[idx] = (char) (c & -33);
            }
        }
        return new String(retChar);
    } else {
        return null;
    }
}

Why is it so much faster? What other work is String.toUpperCase() also doing? In other words, are there cases in which this code will not work?

Benchmark results for a random long string (plain text) executed 2,000,000 times:

toUpperString(String) : 3514.339 ms - about 3.5 seconds
String.toUpperCase() : 9705.397 ms - almost 10 seconds

** UPDATE

I have added the "latin" check and used this as benchmark (for those who don't believe me):

public class BenchmarkUpperCase {

    public static String[] randomStrings;

    public static String nextRandomString() {
        SecureRandom random = new SecureRandom();
        return new BigInteger(500, random).toString(32);
    }

    public static String customToUpperString(String pString) {
        if (pString != null) {
            char[] retChar = pString.toCharArray();
            for (int idx = 0; idx < pString.length(); idx++) {
                char c = retChar[idx];
                if (c >= 'a' && c <= 'z') {
                    retChar[idx] = (char) (c & -33);
                } else if (c >= 192) { // now catering for other than latin...
                    retChar[idx] = Character.toUpperCase(c);
                }
            }
            return new String(retChar);
        } else {
            return null;
        }
    }

    public static void main(String... args) {
        long timerStart, timePeriod = 0;
        randomStrings = new String[1000];
        for (int idx = 0; idx < 1000; idx++) {
            randomStrings[idx] = nextRandomString();
        }
        String dummy = null;

        for (int count = 1; count <= 5; count++) {
            timerStart = System.nanoTime();
            for (int idx = 0; idx < 20000000; idx++) {
                dummy = randomStrings[idx % 1000].toUpperCase();
            }
            timePeriod = System.nanoTime() - timerStart;
            System.out.println(count + " String.toUpper() : " + (timePeriod / 1000000));
        }

        for (int count = 1; count <= 5; count++) {
            timerStart = System.nanoTime();
            for (int idx = 0; idx < 20000000; idx++) {
                dummy = customToUpperString(randomStrings[idx % 1000]);
            }
            timePeriod = System.nanoTime() - timerStart;
            System.out.println(count + " customToUpperString() : " + (timePeriod / 1000000));
        }
    }

}

I get these results:

1 String.toUpper() : 10724
2 String.toUpper() : 10551
3 String.toUpper() : 10551
4 String.toUpper() : 10660
5 String.toUpper() : 10575
1 customToUpperString() : 6687
2 customToUpperString() : 6684
3 customToUpperString() : 6686
4 customToUpperString() : 6693
5 customToUpperString() : 6710

Which is still about 60% faster.

EarlB
  • 111
  • 2
  • 7
  • 10
    Your code doesn't work for symbols other than those in the Latin alphabet; doesn't pay attention to locale etc. – Andy Turner Feb 11 '16 at 09:50
  • 5
    And how did you figure out that this code is 3 times faster? – Sнаđошƒаӽ Feb 11 '16 at 09:51
  • Because `String#toUpperCase()` has to handle the entirety of Unicode, not just the standard latin a-z – JonK Feb 11 '16 at 09:51
  • You can find `String`'s source code online or download it from Oracle. – Jonny Henly Feb 11 '16 at 09:51
  • 1
    Ěščřžýáíéöťďň etc. Maybe there will be a fast path for Java 9 (where ASCII-only strings will be encoded more effectively), but currently there cannot be. You can only do your code for ASCII encoding. If you know you can use that, do so, there are libraries supporting such cases: [Guava's `Ascii#toUpperCase()`](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/base/Ascii.html#toUpperCase%28java.lang.String%29) – Petr Janeček Feb 11 '16 at 09:52
  • True, it does not cater for the Latin alphabet, but catering for that will be a simple if statement for all charaters>192 and then doing the special cases uppercase, that should make it about 10% slower for normal plain text. – EarlB Feb 11 '16 at 10:01
  • 2
    This code is wrong in the Turkish Locale. The upper case of 'i' (i with dot) is not 'I' (I without dot) in Turkish, instead it is Unicode U+0130 (Latin Capital letter I with dot above) – greg-449 Feb 11 '16 at 10:03

3 Answers3

5

Examining the source code for java.lang.String is instructive:

  1. The standard version goes to considerable lengths to avoid creating a new string when it doesn't have to. This entails making two passes over the string.

  2. The standard version uses a locale object to do the case conversion for all characters. You are only doing that for characters greater than 192. While that probably works for common locales, it is possible that some locales (current or future ... or custom) will have "interesting" capitalization rules that apply to characters less than 192 as well.

  3. The standard version is doing a proper job of converting to uppercase by Unicode code-point rather than code-unit. (Converting by code-unit is liable to break or give the wrong answer if the string contains surrogate characters.)

The penalty for "doing it correctly" is that the standard version of toUppercase is slower than your version1. But it will give the correct answer in cases where your version won't.

Note that since you are testing on strings that are ASCII, you won't encounter the cases where your version of toUppercase gives the wrong answer.


1 - According to your benchmark ... but see the other answers!

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thank you for such a complete answer, I did check UTF-8 and UTF-16, both should work with my version as well. So its seems to be "compatible" with ASCII, UTF-8, UTF-16 which should be 99% of the times I want to use it. I will have to check the memory usage though as it does make 2 copies with each conversion. – EarlB Feb 11 '16 at 13:43
  • 1
    It isn't compatible with UTF-8 or UTF-16 if you use characters outside of code plane zero. – Stephen C Feb 11 '16 at 16:43
4

I ran simple jmh benchmark test to compare two methods #toUpperString and default j8 #toUpperCase and result's are following:

Benchmark                    Mode  Cnt     Score    Error  Units
MyBenchmark.customToString   avgt   20  3307.137 ± 81.192  ns/op
MyBenchmark.defaultToString  avgt   20  3384.921 ± 75.357  ns/op

the test implementation is:

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 1, warmups = 1)
@Threads(1)
public class MyBenchmark {

    public static String toUpperString(String pString) {
        if (pString != null) {
            char[] retChar = pString.toCharArray();
            for (int idx = 0; idx < pString.length(); idx++) {
                char c = retChar[idx];
                if (c >= 'a' && c <= 'z') {
                    retChar[idx] = (char) (c & -33);
                }
            }
            return new String(retChar);
        } else {
            return null;
        }
    }

    private SecureRandom random = new SecureRandom();

    public String nextSessionId() {
        return new BigInteger(130, random).toString(32);
    }


    @Setup
    public void init() {

    }

    @Benchmark
    public Object customToString() {
        return toUpperString(nextSessionId());
    }

    @Benchmark
    public String defaultToString() {
        return nextSessionId().toUpperCase();
    }

}

according to the score of this test, this method is not 3 times faster than the default.

hahn
  • 3,588
  • 20
  • 31
  • Sorry, but on my benchmarks, running your NextSessionId() 2,000,000 times I get these: toUpperString(String) = 26957.152 ms and String.toUpperCase() = 32568.200 ms. Meaning the generation of your random string is using up most of the time and is convoluting you results. Put them an a pre generated array, that will add very little time. – EarlB Feb 11 '16 at 10:55
  • 1
    could you share the test that you are executing? – hahn Feb 11 '16 at 10:58
0

In other words, are there cases in which this code will not work?

Yes. Even your updated code will not work correctly for German language, as it doesn't cover the special case of 'ß'. This letter only exists as lower case and gets converted to double s for upper case:

String bla = "blöße";
System.out.println(customToUpperString(bla)); // BLÖßE <- wrong
System.out.println(bla.toUpperCase(Locale.GERMANY)); // BLÖSSE <- right

I'm sure there are a lot more special cases like this in other languages.

  • Okay, looks like both versions are correct and they are trying to establish a capitalized version of ß. But honestly, i have never seen that capital "ß" anywhere and the fact that it doesn't exist on a standard german keyboard layout should tell you everything. https://en.wiktionary.org/wiki/%C3%9F https://en.wikipedia.org/wiki/Capital_%E1%BA%9E – 911DidBush Feb 11 '16 at 13:06