57

I observed some strange behaviour in one of my Java programs. I have tried to strip the code down as much as possible while still being able to replicate the behaviour. Code in full below.

public class StrangeBehaviour {

    static boolean recursionFlag = true;

    public static void main(String[] args) {
        long startTime = System.nanoTime();
        for (int i = 0; i < 10000; i ++) {
            functionA(6, 0);
        }
        long endTime = System.nanoTime();
        System.out.format("%.2f seconds elapsed.\n", (endTime - startTime) / 1000.0 / 1000 / 1000);
    }

    static boolean functionA(int recursionDepth, int recursionSwitch) {
        if (recursionDepth == 0) { return true; }
        return functionB(recursionDepth, recursionSwitch);
    }

    static boolean functionB(int recursionDepth, int recursionSwitch) {
        for (int i = 0; i < 16; i++) {
            if (StrangeBehaviour.recursionFlag) {
                if (recursionSwitch == 0) {
                    if (functionA(recursionDepth - 1, 1 - recursionSwitch)) return true;
                } else {
                    if (!functionA(recursionDepth - 1, 1 - recursionSwitch)) return false;
                }
            } else {
                // This block is never entered into.
                // Yet commenting out one of the lines below makes the program run slower!
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
                System.out.println("...");
            }
        }
        return false;
    }
}

I have two functions, functionA() and functionB() that call each other recursively. Both functions take a recursionDepth parameter that controls the termination of the recursion. functionA() calls functionB() a maximum of once with recursionDepth unchanged. functionB() calls functionA() 16 times with recursionDepth - 1. The recursion terminates when functionA() is called with a recursionDepth of 0.

functionB() has a code block with a number of System.out.println() calls. This block is never entered into, as entry is controlled by a boolean recursionFlag variable that is set to true and never changed during program execution. However, commenting out even one of the println() calls causes the program to run slower. On my machine, execution time is <0.2s with all the println() calls present, and >2s when one of the calls is commented out.

What could be causing this behaviour? My only guess is that there is some naive compiler optimisation that is being triggered by a parameter related to the length of the code block (or the number of function calls etc.). Any further insight into this will be much appreciated!

Edit: I'm using JDK 1.8.

J3D1
  • 759
  • 6
  • 10
  • Actually, just changing the boolean to `final` does the exact same thing. Seems like a bug in the compiler optimizer perhaps. – Sylwester Dec 31 '16 at 18:32
  • If you create a new function `f(String s)`, and change `System.out.println("...")` -> `f("...")` then you need a lot more calls to get the speed increase. – Ryan Amos Dec 31 '16 at 18:33
  • Marking the variable as final, removing else block or just performing `if(true)` (compiler hints it's unreachable code) are all taking longer to execute. Surprised with this behavior. – Prathap Dec 31 '16 at 18:37
  • 3
    I started looking at this with the idea that another newbie had stumbled across the dangers of micro-optimization or something. But this is *weird*. Commenting out one of the println statements in a block that is never executed makes a tenfold difference in execution time! Still looking at it... – arcy Dec 31 '16 at 18:45
  • @RyanAmos - indeed, I had tried that myself! – J3D1 Dec 31 '16 at 19:04
  • i don't know about yours , but on my machine its around `5:20` with all `println` and with removing all `printn` it's `4:85` and with commenting out one `println` its `5:34` , which java version are you using ? – Pavneet_Singh Dec 31 '16 at 19:15
  • @PavneetSingh I'm on JDK 1.8. – J3D1 Dec 31 '16 at 19:20
  • 2
    Although you did find something interesting, your microbenchmark has issues. You're not doing a warm up loop to trigger HotSpot compilation before running the actual timing tests, so it's not clear if hotspot compilation or CPU runtime is slower. Slower hotspot compilation isn't relevant in real world applications as it only happens once (mostly) – Erwin Bolwidt Jan 02 '17 at 02:32
  • I'd be curious what happens when you start this with `-XX:MaxInlineSize=400`. (Default is 35). It might still be a tad slower, because I think that inlining a large method may not always be beneficial (there's certainly a *reason* for this limit - otherwise, you could set this parameter to 100000 or so...), but as a comparison and possible (additional) confirmation of the current answers, it might be interesting. – Marco13 Jan 02 '17 at 12:56

3 Answers3

41

The commented code affects how inlining is handled. If functionB gets longer/bigger (more bytecode instructions) it will not be inlined into functionA.

So @J3D1 was able to use VMOptions to manually switch off inlining for functionB(): -XX:CompileCommand=dontinline,com.jd.benchmarking.StrangeBeh‌​aviour::functionB This appears to eliminate the delay with the shorter function.

with the vm options you can display inlining -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining

The bigger Version, wont inline functionB

@ 8   StrangeBehaviour::functionB (326 bytes)   callee is too large
@ 21   StrangeBehaviour::functionA (12 bytes)
  @ 8   StrangeBehaviour::functionB (326 bytes)   callee is too large
@ 35   StrangeBehaviour::functionA (12 bytes)
  @ 8   StrangeBehaviour::functionB (326 bytes)   callee is too large

The shorter Version will try to inline functionB, causing a few of further attempts.

@ 8   StrangeBehaviour::functionB (318 bytes)   inline (hot)
 @ 21   StrangeBehaviour::functionA (12 bytes)   inline (hot)
   @ 8   StrangeBehaviour::functionB (318 bytes)   inline (hot)
     @ 35   StrangeBehaviour::functionA (12 bytes)   recursive inlining is too deep
 @ 35   StrangeBehaviour::functionA (12 bytes)   inline (hot)
   @ 8   StrangeBehaviour::functionB (318 bytes)   inline (hot)
     @ 21   StrangeBehaviour::functionA (12 bytes)   recursive inlining is too deep
     @ 35   StrangeBehaviour::functionA (12 bytes)   recursive inlining is too deep
@ 21   StrangeBehaviour::functionA (12 bytes)   inline (hot)
 @ 8   StrangeBehaviour::functionB (318 bytes)   inline (hot)
   @ 35   StrangeBehaviour::functionA (12 bytes)   inline (hot)
     @ 8   StrangeBehaviour::functionB (318 bytes)   recursive inlining is too deep
@ 35   StrangeBehaviour::functionA (12 bytes)   inline (hot)
 @ 8   StrangeBehaviour::functionB (318 bytes)   inline (hot)
   @ 21   StrangeBehaviour::functionA (12 bytes)   inline (hot)
    @ 8   StrangeBehaviour::functionB (318 bytes)   recursive inlining is too deep
   @ 35   StrangeBehaviour::functionA (12 bytes)   inline (hot)
     @ 8   StrangeBehaviour::functionB (318 bytes)   recursive inlining is too deep

Mostly guessing, but the bigger/inlined bytecode will cause problems with branch prediction and caching

Moshe Katz
  • 15,992
  • 7
  • 69
  • 116
k5_
  • 5,450
  • 2
  • 19
  • 27
  • That's a very interesting point. However, will the inlining/not inlining affect the compile time rather than the execution time? – J3D1 Dec 31 '16 at 19:14
  • Bigger machinecode can/will fill up processor caches, causing cache misses. It might also affect branch predictions. – k5_ Dec 31 '16 at 19:16
  • The problem with this theory is that taking out some of the lines makes it run longer. So making it smaller is somehow still making it run longer, and that's not consistent with being too large. Thanks much for the tip on the extra debugging info, though, will have to check that out. – arcy Dec 31 '16 at 19:20
  • @arcy the inlining behavior changes at exactly when you comment out that line. I see that as a strong indication that inlining is at least involved in that performance change. It changes how machine code and jumps/returns are divided between functionA and functionB. Additional jumps, broken branch prediction and cache misses lead to lower performance. – k5_ Dec 31 '16 at 19:38
  • I really like this theory. Will investigate it a bit more. Thanks for the pointer! – J3D1 Dec 31 '16 at 19:42
  • 1
    How about this? If it is inlined, it runs slower because of the recursion! Our normal model of recursion *depends* on not inlining, in order to get a new stack frame each call. But an inlined call doesn't have that, so the runtime has to implement the parts of the recursion that are normally handled by having a stack frame. According to the 'enhanced' output above, there is a maximum depth to the "inlined recursion" calls, and at that point the runtime has to do something more drastic in order to execute correctly. So I think it is a combination of the size of the routine and the recursion. – arcy Jan 01 '17 at 03:54
  • Something I read recently, adding -XX:+UnlockDiagnosticVMOptions makes your java licence commercial and chargeable depending on where you are using it. – prajeesh kumar Jan 01 '17 at 06:51
  • 3
    So I was able to use VMOptions to manually switch off inlining for `functionB()`: `-XX:CompileCommand=dontinline,com.jd.benchmarking.StrangeBehaviour::functionB` This appears to eliminate the delay with the shorter function. So at least we know that it's definitely the inlining that causes the issue. @k5_ Will you include this in your answer? I'll accept your answer. – J3D1 Jan 01 '17 at 12:02
  • Also, for information, `-XX:CompileCommand=inline` can be used to force inlining, even for the longer `functionB()`. – J3D1 Jan 01 '17 at 12:21
  • @J3D1 thanks for finding that command. I added it to the answer – k5_ Jan 01 '17 at 16:55
21

The complete answer is a combination of k5_ and Tony's answers.

The code that the OP posted omits a warmup loop to trigger HotSpot compilation before doing the benchmark; hence the 10-fold (on my computer) speedup when the print statements are included, combines both the time spent in HotSpot to compile the bytecode to CPU instructions, as well as the actual running of the CPU instructions.

If I add a separate warmup loop before the timing loop, there is only a 2.5-fold speedup with the print statement.

That indicates that both the HotSpot/JIT compilation takes longer when the method is inlined (as Tony explained) as well as that the running of the code takes longer, probably because of worse cache or branch-prediction/pipelining performance, as k5_ showed.

public static void main(String[] args) {
    // Added the following warmup loop before the timing loop
    for (int i = 0; i < 50000; i++) {
        functionA(6, 0);
    }

    long startTime = System.nanoTime();
    for (int i = 0; i < 50000; i++) {
        functionA(6, 0);
    }
    long endTime = System.nanoTime();
    System.out.format("%.2f seconds elapsed.\n", (endTime - startTime) / 1000.0 / 1000 / 1000);
}
Erwin Bolwidt
  • 30,799
  • 15
  • 56
  • 79
18

I am with @k5_, it seems that there exists a threshold to determine whether to inline a function. And if JIT compiler decided to inline it, it will cause a lot of work and time to do it as -XX:+PrintCompilation shows:

  task-id
    158   32       3       so_test.StrangeBehaviour::functionB (326 bytes)   made not entrant
    159   35       3       java.lang.String::<init> (82 bytes)
    160   36  s    1       java.util.Vector::size (5 bytes)
    1878   37 %     3       so_test.StrangeBehaviour::main @ 6 (65 bytes)
    1898   38       3       so_test.StrangeBehaviour::main (65 bytes)
    2665   39       3       java.util.regex.Pattern::has (15 bytes)
    2667   40       3       sun.misc.FDBigInteger::mult (64 bytes)
    2668   41       3       sun.misc.FDBigInteger::<init> (30 bytes)
    2668   42       3       sun.misc.FDBigInteger::trimLeadingZeros (57 bytes)
    2.51 seconds elapsed.

The upper is info without comment, the following is with a comment which reduce the method size from 326 byte to 318 byte. And you can notice the task id in the column 1 of output, is so much larger in latter, which causes more time.

  task-id
    126   35       4       so_test.StrangeBehaviour::functionA (12 bytes)
    130   33       3       so_test.StrangeBehaviour::functionA (12 bytes)   made not entrant
    131   36  s    1       java.util.Vector::size (5 bytes)
    14078   37 %     3       so_test.StrangeBehaviour::main @ 6 (65 bytes)
    14296   38       3       so_test.StrangeBehaviour::main (65 bytes)
    14296   39 %     4       so_test.StrangeBehaviour::functionB @ 2 (318 bytes)
    14300   40       4       so_test.StrangeBehaviour::functionB (318 bytes)
    14304   34       3       so_test.StrangeBehaviour::functionB (318 bytes)   made not entrant
    14628   41       3       java.util.regex.Pattern::has (15 bytes)
    14631   42       3       sun.misc.FDBigInteger::mult (64 bytes)
    14632   43       3       sun.misc.FDBigInteger::<init> (30 bytes)
    14632   44       3       sun.misc.FDBigInteger::trimLeadingZeros (57 bytes)
    14.50 seconds elapsed.

And if you change the code into following (add two line and commnet out a print line), you can see that the code size change into 326 byte, and runs faster now:

        if (StrangeBehaviour.recursionFlag) {
            int a = 1;
            int b = 1;
            if (recursionSwitch == 0) {
                if (functionA(recursionDepth - 1, 1 - recursionSwitch)) return true;
            } else {
                if (!functionA(recursionDepth - 1, 1 - recursionSwitch)) return false;
            }
        } else {
            // This block is never entered into.
            // Yet commenting out one of the lines below makes the program run slower!
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
          //System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
            System.out.println("...");
        }

New time and JIT compiler info:

    140   34       3       so_test.StrangeBehaviour::functionB (326 bytes)   made not entrant
    145   36       3       java.lang.String::<init> (82 bytes)
    148   37  s    1       java.util.Vector::size (5 bytes)
    162   38       4       so_test.StrangeBehaviour::functionA (12 bytes)
    163   33       3       so_test.StrangeBehaviour::functionA (12 bytes)   made not entrant
    1916   39 %     3       so_test.StrangeBehaviour::main @ 6 (65 bytes)
    1936   40       3       so_test.StrangeBehaviour::main (65 bytes)
    2686   41       3       java.util.regex.Pattern::has (15 bytes)
    2689   42       3       sun.misc.FDBigInteger::mult (64 bytes)
    2690   43       3       sun.misc.FDBigInteger::<init> (30 bytes)
    2690   44       3       sun.misc.FDBigInteger::trimLeadingZeros (57 bytes)
    2.55 seconds elapsed.

In conclusion:

  • When method size exceeds some limits, JIT will not inline this function;
  • And if we comment out a line, which reduce to a size below the threshold, JIT decide to inline it;
  • Inlining that function causes a lot of JIT tasks which slows down the program.

update:

Accroding to my latest trial, the answer for this question is not so easy:

As my code sample shows, a normal inline optimization will

  • accelerates program
  • and not costs much compiler work (in my test, it even cost less work when inline happens).

But in this problem, the code causes much JIT work and slows down program which seems a bug of JIT. And it is still not clear why it causes so much work of JIT.

Tony
  • 5,972
  • 2
  • 39
  • 58
  • Does this explain increased runtime, rather than increased compile time? – J3D1 Jan 01 '17 at 11:41
  • 1
    @J3D1 The compiler here I say is JIT, which runs when your program run and cause more running time rather than compile time. I will update and make it clear. – Tony Jan 02 '17 at 02:12