while (n > 1) is 25% faster than while (n)?

Question

I have two logically equivalent functions:

long ipow1(int base, int exp) {
    // HISTORICAL NOTE:
    // This wasn't here in the original question, I edited it in,
    if (exp == 0) return 1;

    long result = 1;

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result * base;
}

long ipow2(int base, int exp) { 
    long result = 1;

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result;
}

NOTICE:

These loops are equivalent because in the former case we are returning result * base (handling the case when exp is or has been reduced to 1) but in the second case we are returning result.

Strangely enough, both with -O3 and -O0 ipow1 consequently outperforms ipow2 by about 25%. How is this possible?

I'm on Windows 7, x64, gcc 4.5.2 and compiling with gcc ipow.c -O0 -std=c99.

And this is my profiling code:

int main(int argc, char *argv[]) {
    LARGE_INTEGER ticksPerSecond;
    LARGE_INTEGER tick;
    LARGE_INTEGER start_ticks, end_ticks, cputime;

    double totaltime = 0;
    int repetitions = 10000;
    int rep = 0;
    int nopti = 0;

    for (rep = 0; rep < repetitions; rep++) {
        if (!QueryPerformanceFrequency(&ticksPerSecond)) printf("\tno go QueryPerformance not present");
        if (!QueryPerformanceCounter(&tick)) printf("no go counter not installed");  
        QueryPerformanceCounter(&start_ticks); 

        /* start real code */

        for (int i = 0; i < 55; i++) {
            for (int j = 0; j < 11; j++) {
                nopti = ipow1(i, j); // or ipow2
            }
        }

        /* end code */

        QueryPerformanceCounter(&end_ticks); 
        cputime.QuadPart = end_ticks.QuadPart - start_ticks.QuadPart;
        totaltime += (double)cputime.QuadPart / (double)ticksPerSecond.QuadPart;
    }   

    printf("\tTotal elapsed CPU time:   %.9f  sec  with %d repetitions - %ld:\n", totaltime, repetitions, nopti);

    return 0;
}

@quasiverse: Read the code, the two functions are logically equal. — orlp, Sep 17 '11 at 02:23
`ipow1` performs two "steps" less than `ipow2` (`exp > 1` opposed to `exp != 0`), but returns `result * base` instead of `result` making them effectively equal. Try it out in your mind. — orlp, Sep 17 '11 at 02:26
With GCC 4.5.2 in linux they ((while 1) vs. (while n < 1)) produce the exact same set of assembler instructions. You can see this by writing a simple test case and using `gcc -S` to see the assembler. — Brian Roach, Sep 17 '11 at 02:31
@nightcracker: Dude, it's a simple mistake that a lot of people are going to make because of the way it's presented. Three people have made the same mistake already, and many more would have if not for the comments here. You DO NOT need to be smug about it. — jason, Sep 17 '11 at 02:31
While the two calculations give the same result, they're not the same -- ipow2 ends up doing two extra branches that the compiler is not smart enough to remove. — Chris Dodd, Sep 17 '11 at 02:35
Is that enough to be responsible for 25% of the performance? This might be possible though, because `exp` is usually quite low. — orlp, Sep 17 '11 at 02:41
@Brian: I'm quite sure the compiler doesn't generate the same assembly, because the results are sometimes different. — Ben Voigt, Sep 17 '11 at 03:19
@Ben - I was referring to the title of the question. If you write a short test case for only `(while n)` vs. `(while n > 1)` (and of course include a simple decrement in the loop) , there is *no* difference. — Brian Roach, Sep 17 '11 at 03:24
while(n>1) will either use twice the number of instructions or the same number of instructions as while(n) (for many of the known instruction sets) if you compare those two in isolation. It depends on what you do to n and where as to whether while(n) can save an instruction. The blanket statement that there is no difference is incorrect. The reason is simple, many processors have a Z (zero) flag to indicate when a result is zero or not, they dont have a greater than one flag, it takes one or more flags for that plus an extra instruction or two to perform the compare with one. — old_timer, Sep 17 '11 at 23:06
The performance hit was related to while(n) vs while(n>1), but not directly tied to the instruction you save with the while(n) the way it was used here. So the question is both incorrect and correct depending on how you look at it. If the author knew exactly why there was a performance hit do you suppose they would be asking the question here? Not knowing the reason means you should give them a little freedom to make assumptions. — old_timer, Sep 17 '11 at 23:14
The up front aggressiveness of the functions giving equivalent results or not took this thing in the wrong direction as it is a very good performance exercise. I might borrow these loops to use as interview questions for software engineer candidates if you dont mind, both the question of when do you get the same results and, do you expect one to perform better than another. — old_timer, Sep 17 '11 at 23:16
I edited in the if statement to prevent future discussion about this. I'm sorry for being sloppy, it should've been there from the start, but I don't see this as a reason for 7 downvotes. — orlp, Sep 17 '11 at 23:40

Ben Voigt · Answer 1 · 2011-09-17T03:23:03.800

10

No, really, the two ARE NOT equivalent. ipow2 returns correct results when ipow1 doesn't.

http://ideone.com/MqyqU

P.S. I don't care how many comments you leave "explaining" why they're the same, it takes only a single counter-example to disprove your claims.

P.P.S. -1 on the question for your insufferable arrogance toward everyone who already tried to point this out to you.

edited Sep 17 '11 at 03:23

answered Sep 17 '11 at 03:18

Ben Voigt

277,958
43
419
720

Ah you're right, you must know that I had `if (exp == 0) return 1` in previous versions but I removed it because I thought it was obsolete, but it's not. The question still stands though. – orlp Sep 17 '11 at 10:13
...or maybe not, I have to test, that one branch might compensate for the difference of the rest of the code. Also, cuold you explain my "insufferable arrogance", because I didn't spot a comment yet that pointed out this difference to me. – orlp Sep 17 '11 at 10:20
1

@nightcracker, I have enough rep to see deleted answers and the comments you left on them. Jason's answer told you they were different for `exp = 0`. I quote: "These are not logically equivalent. You need `exp > 0` or `exp >=1` for them to be equivalent." He was exactly correct, if and only if `exp > 0` then `ipow1(base, exp) == ipow2(base, exp)`. But you responded "Wanna bet? Read the return statements and do some thinking. -1". – Ben Voigt Sep 17 '11 at 14:24
2

@Ben Voigt: Oh damn you're right, I mis-interpreted his answer, I thought he was talking about the `while` statement. Now that I re-read it I understand it. I have apologized to him for being so aggressive though. – orlp Sep 17 '11 at 17:44

score 3 · Answer 2 · answered Sep 17 '11 at 02:29

It's becouse with while (exp > 1) the for will run from exp to 2 (it will execute with exp = 2, decrement it to 1 and then end the loop). With while (exp), the for will run from exp to 1 (it will execute with exp = 1, decrement it to 0 and then end the loop).

So with while (exp) you have an extra iteration, which takes the extra time to run.

EDIT: Even with the multiplication after the loop with the exp>1 while, keep in mind that the multiplication is not the only thing in the loop.

score 2 · Accepted Answer · answered Sep 17 '11 at 04:40

If you dont want to read all of this skip to the bottom, I come up with a 21% difference just by analysis of the code.

Different systems, versions of the compiler, same compiler version built by different folks/distros will give different instruction mixes, this is just one example of what you might get.

long ipow1(int base, int exp) {
    long result = 1;

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result * base;
}

long ipow2(int base, int exp) {
    long result = 1;

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result;
}

0000000000000000 <ipow1>:
   0:   83 fe 01                cmp    $0x1,%esi
   3:   ba 01 00 00 00          mov    $0x1,%edx
   8:   7e 1d                   jle    27 <ipow1+0x27>
   a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  10:   40 f6 c6 01             test   $0x1,%sil
  14:   74 07                   je     1d <ipow1+0x1d>
  16:   48 63 c7                movslq %edi,%rax
  19:   48 0f af d0             imul   %rax,%rdx
  1d:   d1 fe                   sar    %esi
  1f:   0f af ff                imul   %edi,%edi
  22:   83 fe 01                cmp    $0x1,%esi
  25:   7f e9                   jg     10 <ipow1+0x10>
  27:   48 63 c7                movslq %edi,%rax
  2a:   48 0f af c2             imul   %rdx,%rax
  2e:   c3                      retq   
  2f:   90                      nop

0000000000000030 <ipow2>:
  30:   85 f6                   test   %esi,%esi
  32:   b8 01 00 00 00          mov    $0x1,%eax
  37:   75 0a                   jne    43 <ipow2+0x13>
  39:   eb 19                   jmp    54 <ipow2+0x24>
  3b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  40:   0f af ff                imul   %edi,%edi
  43:   40 f6 c6 01             test   $0x1,%sil
  47:   74 07                   je     50 <ipow2+0x20>
  49:   48 63 d7                movslq %edi,%rdx
  4c:   48 0f af c2             imul   %rdx,%rax
  50:   d1 fe                   sar    %esi
  52:   75 ec                   jne    40 <ipow2+0x10>
  54:   f3 c3                   repz retq

Isolating the loops:

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }


//if exp & 1 not true jump to 1d to skip   
  10:   40 f6 c6 01             test   $0x1,%sil
  14:   74 07                   je     1d <ipow1+0x1d>
//result *= base  
  16:   48 63 c7                movslq %edi,%rax
  19:   48 0f af d0             imul   %rax,%rdx
//exp>>=1  
  1d:   d1 fe                   sar    %esi
//base *= base  
  1f:   0f af ff                imul   %edi,%edi
//while(exp>1) stayin the loop  
  22:   83 fe 01                cmp    $0x1,%esi
  25:   7f e9                   jg     10 <ipow1+0x10>

Comparing something to zero normally saves you an instruction and you can see that here

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }


//base *= base  
  40:   0f af ff                imul   %edi,%edi
//if exp & 1 not true jump to skip  
  43:   40 f6 c6 01             test   $0x1,%sil
  47:   74 07                   je     50 <ipow2+0x20>
//result *= base  
  49:   48 63 d7                movslq %edi,%rdx
  4c:   48 0f af c2             imul   %rdx,%rax
//exp>>=1  
  50:   d1 fe                   sar    %esi
//no need for a compare  
  52:   75 ec                   jne    40 <ipow2+0x10>

Your timing method is going to generate a lot of error/chaos. Depending on the beat frequency of the loop and the accuracy of the timer you can create a lot of gain in one and a lot of loss in another. This method normally gives better accuracy:

starttime = ... for(rep=bignumber;rep;rep--) { //code under test ... } endtime = ... total = endtime - starttime;

Of course if you are running this on an operating system timing it is going to have a decent amount of error in it anyway.

Also you want to use volatile variables for your timer variables, helps the compiler to not re-arrange the order of execution. (been there seen that).

If we look at this from the perspective of the base multiplies:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int mults;

long ipow1(int base, int exp) {
    long result = 1;

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
        mults++;
    }

    result *= base;

    return result;
}

long ipow2(int base, int exp) {
    long result = 1;

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
        mults++;
    }

    return result;
}


int main ( void )
{
    int i;
    int j;

    mults = 0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow1(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

    mults=0;

        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow2(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

}

there are

mults 1045
mults 1595

50% more for ipow2(). Actually it is not just the multiplies it is that you are going through the loop 50% more times.

ipow1() gets a little back on the other multiplies:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int mults;

long ipow1(int base, int exp) {
    long result = 1;

    while (exp > 1) {
        if (exp & 1) mults++;
        exp >>= 1;
        base *= base;
    }
    mults++;

    return result;
}

long ipow2(int base, int exp) {
    long result = 1;

    while (exp) {
        if (exp & 1) mults++;
        exp >>= 1;
        base *= base;
    }

    return result;
}


int main ( void )
{
    int i;
    int j;

    mults = 0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow1(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

    mults=0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow2(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

}

ipow1() performs the result*=base a different number (more) times than ipow2()

mults 990
mults 935

being a long * int can make these more expensive. not enough to make up for the losses around the loop in ipow2().

Even without disassembling, making a rough guess on the operations/instructions you hope the compiler uses. Accounting here for processors in general not necessarily x86, some processors will run this code better than others (from a number of instructions executed perspective not counting all the other factors).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int ops;

long ipow1(int base, int exp) {
    long result = 1;
    ops++; //result = immediate
    while (exp > 1) {
        ops++; // compare exp - 1
        ops++; // conditional jump
            //if (exp & 1)
        ops++; //exp&1
        ops++; //conditional jump
        if (exp & 1)
        {
            result *= base;
            ops++;
        }
        exp >>= 1;
        ops++;
        //ops+=?; //using a signed number can cost you this on some systems
        //always use unsigned unless you have a specific reason to use signed.
        //if this had been a short or char variable it might cost you even more
        //operations
        //if this needs to be signed it is what it is, just be aware of
        //the cost
        base *= base;
        ops++;
    }
    result *= base;
    ops++;
    return result;
}

long ipow2(int base, int exp) {
    long result = 1;
    ops++;
    while (exp) {
        //ops++; //cmp exp-0, often optimizes out;
        ops++; //conditional jump
        //if (exp & 1)
        ops++;
        ops++;
        if (exp & 1)
        {
            result *= base;
            ops++;
        }
        exp >>= 1;
        ops++;
        //ops+=?; //right shifting a signed number
        base *= base;
        ops++;
    }
    return result;
}



int main ( void )
{
    int i;
    int j;

    ops = 0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow1(i, j); // or ipow2
            }
        }
    printf("ops %u\n",ops);

    ops=0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow2(i, j); // or ipow2
            }
        }
    printf("ops %u\n",ops);

}

Assuming I counted all the major operations and didnt unfairly give one function more than another:

ops 7865
ops 9515

ipow2 is 21% slower using this analysis.

I think the big killer is the 50% more times through the loop. Granted it is data dependent, you might find inputs in a benchmark test that make the difference between functions greater or worse than the 25% you are seeing.

score 0 · Answer 4 · answered Sep 17 '11 at 02:43

0

Your functions are not "logically equal".

while (exp > 1){...}

is NOT logically equal to

while (exp){...}

Why do you say it is?

answered Sep 17 '11 at 02:43

user732933

278
1
5

2

Have you read the comments and the __bold__ part of my question? – orlp Sep 17 '11 at 02:45
2

*facepalm* ... DID YOU READ THE COMMENTS!? – flight Sep 17 '11 at 02:45

score 0 · Answer 5 · answered Sep 17 '11 at 03:00

Does this really generate the same assembly code? When I tried (with gcc 4.5.1 on OpenSuse 11.4, I will admit) I found slight differences.

ipow1.s:

cmpl    $1, -24(%rbp)
jg  .L4
movl    -20(%rbp), %eax
cltq
imulq   -8(%rbp), %rax
leave

ipow2.s:

cmpl    $0, -24(%rbp)
jne .L4
movq    -8(%rbp), %rax
leave

Perhaps the processor's branch prediction is just more effective with jg than with jne? It seems unlikely that one branch instruction would run 25% faster than another (especially when cmpl has done most of the heavy lifting)

while (n > 1) is 25% faster than while (n)?

NOTICE:

5 Answers5