20

I'm currently looking into String concat options and the penalty they have on the overall performance. And my test-case creates results that blow my mind, I'm not sure if I'm overlooking something.

Here is the deal: Doing "something"+"somethingElse" in java will (at compile-time) create a new StringBuilder every time this is done.

For my test-case, I'm loading a file from my HDD that has 1661 lines of example data (classic "Lorem Ipsum"). This question is not about the I/O performance, but about the performance of the different string concat methods.

public class InefficientStringConcat {

    public static void main(String[] agrs) throws Exception{
        // Get a file with example data:

        System.out.println("Starting benchmark");
        // Read an measure:
        for (int i = 0; i < 10; i++){
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(new FileInputStream(new File("data.txt")))
            );

            long start = System.currentTimeMillis();
            // Un-comment method to test:
            //inefficientRead(in);
            //betterRead(in);
            long end = System.currentTimeMillis();
            System.out.println("Took "+(end-start)+"ms");

            in.close();
        }



    }

    public static String betterRead(BufferedReader in) throws IOException{
        StringBuilder b = new StringBuilder();
        String line;
        while ((line = in.readLine()) != null){
            b.append(line);
        }
        return b.toString();
    }

    public static String inefficientRead(BufferedReader in) throws IOException {
        String everything = "", line;
        while ((line = in.readLine()) != null){
            everything += line;
        }
        return everything;
    }
}

As you can see, the setup is the same for both tests. Here are the results:

Using inefficientRead()-method:

Starting benchmark
#1 Took 658ms
#2 Took 590ms
#3 Took 569ms
#4 Took 567ms
#5 Took 562ms
#6 Took 570ms
#7 Took 563ms
#8 Took 568ms
#9 Took 560ms
#10 Took 568ms

Using betterRead()-method

Starting benchmark
#1 Took 42ms
#2 Took 10ms
#3 Took 5ms
#4 Took 7ms
#5 Took 16ms
#6 Took 3ms
#7 Took 4ms
#8 Took 5ms
#9 Took 5ms
#10 Took 13ms

I'm running the tests with no extra parameters to the java-command. I'm running a MacMini3,1 from early 2009 and Sun JDK 7:

[luke@BlackBox ~]$ java -version
java version "1.7.0_09"
Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
Java HotSpot(TM) Client VM (build 23.5-b02, mixed mode)

This strikes me as a very heavy difference. Am I doing something wrong in measuring this, or is this supposed to happen?

simon
  • 15,344
  • 5
  • 45
  • 67
Lukas Knuth
  • 25,449
  • 15
  • 83
  • 111
  • Well, 10 iterations is not nearly enough for a benchmark, but apart from that it *seems* fine. Try increasing the number of iterations to, say, 100,000 and add a "warm-up" phase where you call the method 10,000 times (without timing it) to let the JIT do some inlining. – Cameron Skinner Mar 02 '13 at 18:43
  • @CameronSkinner Apparently you don't take into account that each high-level iteration already consists of 1661 method invocations. – Marko Topolnik Mar 02 '13 at 18:46
  • @MarkoTopolnik: The benchmark is being measured across the `betterRead` or `inefficientRead` methods, so those are the methods that you should call multiple times. So you are correct: You **don't** take into account the implementation details of the methods under test. – Cameron Skinner Mar 02 '13 at 18:54
  • @CameronSkinner At least the advice to call those methods 10,000 times *just to warm up* is unequivocally wrong: the JVM JITs according to the number of passes across *any single line of code*, which here happens before the tenth outer iteration. Starting from that, your request for 100,000 executions of the complete file-loading procedure is quite irrational. A sample size of a maximum of 100 would satisfy even the stringest statistician. – Marko Topolnik Mar 02 '13 at 19:03
  • @MarkoTopolnik: We can agree to disagree. But I'd recommend that you read more into what the JIT can do in terms of inlining entire method calls and whatnot. – Cameron Skinner Mar 02 '13 at 19:17
  • @CameronSkinner I have already read quite a lot about it. Hence my comment. – Marko Topolnik Mar 02 '13 at 19:18

2 Answers2

25

Am I doing something wrong in measuring this, or is this supposed to happen?

It's supposed to happen. Constructing a long string using repeated string concatenation is a known performance anti-pattern: every concatenation has to create a new string with a copy of the original string and also a copy of the additional string. You end up with O(N2) performance. When you use StringBuilder, most of the time you're just copying the additional string into a buffer. Occasionally the buffer will need to run out space and need to be expanded (by copying the existing data into a new buffer) but that doesn't happen often (due to the buffer expansion strategy).

See my article on string concatenation for details - it's a very old article, so predates StringBuilder, but the fundamentals haven't changed. (Basically StringBuilder is like StringBuffer, but without synchronization.)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
6

This is exactly what should happen. betterRead takes linear time; inefficientRead takes quadratic time.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • @MarkoTopolnik: You misunderstand what big-O complexity means. It measures the rate of increase of cost, not the absolute cost. If you added `Thread.sleep(1000)` to the `betterRead` method it is still O(n) but will take much longer in wall-clock time than the O(n^2) method, at least until n gets larger. – Cameron Skinner Mar 02 '13 at 18:56
  • 1
    @MarkoTopolnik: I certainly would imply that the constant factor is just irrelevant. – Louis Wasserman Mar 02 '13 at 19:08
  • 3
    From first caring about the asymptotics, and only much later optimizing constant factors. – Louis Wasserman Mar 02 '13 at 19:41
  • No different algorithms end up with exactly the same constant factors, and I'm entirely comfortable leaving out a discussion of why they differ here. – Louis Wasserman Mar 03 '13 at 00:21