8

I was doing a little speed testing in c++ (MSVS) and got a strange result. I was testing the speed of using one for loop vs multiple nested for loops. Here is the code:

double testX = 0;
// Single loop executes in roughly 0.04 seconds
for( int i = 0; i < 27000000; i++ ){
    testX += 1;
}

// Nested loop executes in roughly 0.03 seconds
for( int x = 0; x < 300; x++ ){
    for( int y = 0; y < 300; y++ ){
        for( int z = 0; z < 300; z++ ){
            testX += 1;
        }
    }
}

As you can see, the speed difference is fairly obvious. I have run this many times, and those are the average times I am seeing (these are timed using glfwGetTime()).

So my question is: why? Is my test method inadequate? Am I using too few loops? I have tried searching google, and the only similar question I could find related his problem to cache coherency, but since these are empty for loops, I didn't think it would really have an effect.

Any help is appriciated :)

Edit: Thanks to the comments, I realized that using empty for loops probably wasn't the best way of testing things... So I have updated my code to do some (very) simple operations to a double. I also am compiling in release mode. However, though the two methods are a lot more similar in times, the second method is still slightly faster.

And yes, this is all the test code (minus the timing/output functions, but those aren't exactly specific to the question).

Jamie Syme
  • 528
  • 2
  • 6
  • 10

3 Answers3

11

The compiler won't "optimize" the loops away when the testX variable is used somewhere later in the code. When I just add one line to the code to output testX, the results are as follows:

  • single for loop: 1.218 ms
  • nested for loop: 1.218 ms

This pretty much shows that the compiler converts the nested loop into a single loop, whenever possible. The loop index can be used to prevent that optimisation:

Modifying the code this way

for( int i = 0; i < 27000000; i++ ){
    testX += i;
}

and

for( int x = 0; x < 300; x++ ){
    testX += x;
    for( int y = 0; y < 300; y++ ){
        testX += y;
        for( int z = 0; z < 300; z++ ){
            testX += z;
        }
    }
}

will add a little overhead to the nested loop but the execution time shows

  • single for loop: 1.224 ms
  • nested for loop: 1.226 ms

The times given here are averaged over 30.000 loop runs.

Note: The testX += x; only contributes 1 in 90000 and the testX += x; only contributes 1 in 300. Thus the two sections above remain comparable.

Nested loops are not much slower than single loops but your observation that they are faster is not true.

And: The times you show are about 40 times the times I observed. I'd suggest to carefully inspect the compiler settings since I ran the test on a medium speed hardware. Maybe the results of glfwGetTime() are questionable and this is the main reason for your question. Have you tried to use another timing scheme?

Edit: In order to prevent compiler optimisation the loop limit can be choosen to be non constant:

int lmt = rand() % 1 + 300;      // random value 300 or 301 
int big_lmt = lmt * lmt * lmt;   // random value 27000000 or 27270901

for( int i = 0; i < big_lmt; i++ ){
    testX += i;
}

for( int x = 0; x < lmt; x++ ){
    testX += x;
    for( int y = 0; y < lmt; y++ ){
        testX += y;
        for( int z = 0; z < lmt; z++ ){
            testX += z;
        }
    }
}

This avoids compiler predictability.

Results (for a lmt = 300 case to be comparable):

  • single for loop: 1.213 ms
  • nested for loop: 1.216 ms

Result:

  • Nested loops are not faster than single loops.
Arno
  • 4,994
  • 3
  • 39
  • 63
  • I wonder if the compiler unrolls the innermost loop and when it doesn't if this would give a speed boost. A smart compiler could first unroll the innermost loop and notice that the result is a chain of constant additions which can be substituted with a single addition. The next outer loop only contains a single instruction now, so it would also be a good candidate for unrolling. When the compiler does this unroll-and-subsitute recursively, it could substitute the whole three for loops with a single addition. – Philipp Sep 14 '12 at 11:48
  • @Philipp: True, but it would/could do the same with the single loop unless the compiler is for some reason scared/prohibited to unroll loops with large loop counters. The question was: Why is the nested structure faster? Taking your view into account one could believe that nested loops are more likely to be partially or fully unrolled by the compiler. In other words: Use nested loops to ease compiler optimisations. Is it that easy? – Arno Sep 14 '12 at 12:21
  • I would assume that loop unrolling has some upper limit. When the compiler would unroll a loop with 27 million iterations (and the result can not be substituted with a single addition like here), it would create a binary of over 100 MB which would definitely be much more speed-for-binary-size optimization than reasonable. – Philipp Sep 14 '12 at 12:28
  • @Philipp: Yes! I just added a few lines to the code to prohibit unrolling in order to see possible differences. – Arno Sep 14 '12 at 12:48
  • Thank you for the answer! I have replicated your (edited) code and I am now getting the result that single loops are indeed faster than nested loops :) My tests were obviously not intensive enough, and some things were being compiled out. The one thing that still worries me is that my times are still around 35 and 37 ms, while your times are around 1ms... What compiler are you using? – Jamie Syme Sep 14 '12 at 23:51
  • Test was done with MS VC++ 2010 and MS VC++ 2008 Express Edition. No noticable difference with those two. – Arno Sep 15 '12 at 15:45
  • P.S.: About the timing: There is reason to believe that `glfwGetTime()` is flawed. I'd recommend to use something different to capture the times. I created my own suite of precise timing routines. Description and download is available [here](http://www.windowstimestamp.com). – Arno Sep 15 '12 at 17:02
  • Well I have tried several timing methods, but they were all less accurate than `glfwGetTime()`. The timing results with `glfwGetTime()` are quite consistent, and if I swap the methods (nested loops first, single loop second), the timing results also change accordingly, which leads me to believe that it is not the timing method. However, I have come to the conclusion that my tests have somehow been skewed, and it makes no sense that nested loops would be quicker than a single. :P. Thanks for all your help! – Jamie Syme Sep 15 '12 at 17:40
1

If you don't use your for variables (x,y,z) inside your for loop, a smart compiler could (and should) convert your second form in a single for loop without nesting. Unless you prevent such compiler optimization removing static predictability by having the user input the x,y,z values at runtime from stdin, or reading from some stream, etc..

Furthermore, if you don't do something with your testX variable (say, printing it to stdout), a smart compiler could (and should) optimize it away, i.e. remove the dead code altogether.

So what I'm saying is that the benchmark, as it is right now, is somehow ill-formed.

Unai Vivi
  • 3,073
  • 3
  • 30
  • 46
0

Your best bet would be to look at disassembly and check for the differences in generated code, i guess compiler does some pretty heavy optimizations there.

Vorber
  • 313
  • 1
  • 5