13

Consider the following code:

typedef void (*Fn)();

volatile long sum = 0;

inline void accu() {
    sum+=4;
}

static const Fn map[4] = {&accu, &accu, &accu, &accu};

int main(int argc, char** argv) {
    static const long N = 10000000L;
    if (argc == 1)
    {
            for (long i = 0; i < N; i++)
            {
                    accu();
                    accu();
                    accu();
                    accu();
            }
    }
    else
    {
            for (long i = 0; i < N; i++)
            {
                    for (int j = 0; j < 4; j++)
                            (*map[j])();
            }
    }
}

When I compiled it with:

g++ -O3 test.cpp

I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.

But the results surprised me:

time ./a.out 

real    0m0.108s
user    0m0.104s
sys 0m0.000s

time ./a.out 1

real    0m0.095s
user    0m0.088s
sys 0m0.004s

I don't understand why, so I did an objdump:

objdump -DStTrR a.out > a.s

and the disassembly doesn't seem to explain the performance result I got:

8048300 <main>:
8048300:       55                      push   %ebp
8048301:       89 e5                   mov    %esp,%ebp
8048303:       53                      push   %ebx
8048304:       bb 80 96 98 00          mov    $0x989680,%ebx
8048309:       83 e4 f0                and    $0xfffffff0,%esp
804830c:       83 7d 08 01             cmpl   $0x1,0x8(%ebp)
8048310:       74 27                   je     8048339 <main+0x39>
8048312:       8d b6 00 00 00 00       lea    0x0(%esi),%esi
8048318:       e8 23 01 00 00          call   8048440 <_Z4accuv>
804831d:       e8 1e 01 00 00          call   8048440 <_Z4accuv>
8048322:       e8 19 01 00 00          call   8048440 <_Z4accuv>
8048327:       e8 14 01 00 00          call   8048440 <_Z4accuv>
804832c:       83 eb 01                sub    $0x1,%ebx
804832f:       90                      nop
8048330:       75 e6                   jne    8048318 <main+0x18>
8048332:       31 c0                   xor    %eax,%eax
8048334:       8b 5d fc                mov    -0x4(%ebp),%ebx
8048337:       c9                      leave
8048338:       c3                      ret
8048339:       b8 80 96 98 00          mov    $0x989680,%eax
804833e:       66 90                   xchg   %ax,%ax
8048340:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048346:       83 c2 04                add    $0x4,%edx
8048349:       89 15 18 a0 04 08       mov    %edx,0x804a018
804834f:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048355:       83 c2 04                add    $0x4,%edx
8048358:       89 15 18 a0 04 08       mov    %edx,0x804a018
804835e:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048364:       83 c2 04                add    $0x4,%edx
8048367:       89 15 18 a0 04 08       mov    %edx,0x804a018
804836d:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048373:       83 c2 04                add    $0x4,%edx
8048376:       83 e8 01                sub    $0x1,%eax
8048379:       89 15 18 a0 04 08       mov    %edx,0x804a018
804837f:       75 bf                   jne    8048340 <main+0x40>
8048381:       eb af                   jmp    8048332 <main+0x32>
8048383:       90                      nop
...
8048440 <_Z4accuv>:
8048440:       a1 18 a0 04 08          mov    0x804a018,%eax
8048445:       83 c0 04                add    $0x4,%eax
8048448:       a3 18 a0 04 08          mov    %eax,0x804a018
804844d:       c3                      ret
804844e:       90                      nop
804844f:       90                      nop

It seems the direct call branch is definitely doing less than the function pointer branch. But why does the function pointer branch run faster than the direct call?

And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.

Ming
  • 365
  • 2
  • 12
  • Sorry for format: 'root@ubuntu:/cm/gt# g++ haha.cpp -o za.exx -O3 root@ubuntu:/cm/gt# time ./za.exx real 0m0.092s user 0m0.084s sys 0m0.004s root@ubuntu:/cm/gt# time ./za.exx 1 real 0m0.146s user 0m0.072s sys 0m0.000s' – MIKU_LINK May 11 '13 at 03:36
  • Both codes run in the same time on gcc 4.7. – mfontanini May 11 '13 at 03:47

1 Answers1

6

It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.

As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):

mov    %edx,0x804a018
mov    0x804a018,%edx

Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:

... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.

The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.

Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:

long x1 = 0;
for (long i = 0; i < N; i++)
{
  x1 ^= i<<8;
  accu();
  x1 ^= i<<1;
  accu();
  x1 ^= i<<2;
  accu();
  x1 ^= i<<4;
  accu();
}
sum += x1;
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • 2
    This effect can also be verified by changing things in the example program a bit so that there are 4 separate volatile variables and 4 separate inline functions. The inlined version takes half the time of the non-inlined version. – Michael Burr May 11 '13 at 08:38
  • 1
    I think the compiler writer's opinion is that either the variable is not volatile and it won't store and read immediately, or the variable is volatile and the user asked for it, so they won't make any effort to optimize. – Marc Glisse May 11 '13 at 08:48
  • The reason I used the volatile variable is to prevent the compiler from optimizing the whole function away. – Ming May 13 '13 at 19:39
  • the compiler I used was gcc 4.6.3 – Ming May 13 '13 at 19:39
  • I tried using a volatile array and the direct call is more than double the speed now. – Ming May 13 '13 at 19:44
  • oh, btw, I still don't understand why gcc doesn't inline the function pointer array branch. Any compiler that's smart enough to do this? – Ming May 13 '13 at 19:48
  • @Ming: I don't know it either. Version 4.6.3 is pretty old. Probably gcc 4.7 does this. Comment by mfontanini suggests such a conclusion, but I cannot check this because I also use gcc 4.6.3. Clang version 3.0 does this optimization. Intel's compiler version 12.1.0 does not do it. BTW there are other methods to prevent optimizing function away: you could add some cheap calculations there or maybe find compiler option that disables exactly this optimization (sorry, I don't know which one). – Evgeny Kluev May 13 '13 at 20:19
  • @Ming: Gcc does optimizations in a certain order (it doesn't cycle). By the time it discovers which function will be called, inlining is already done. – Marc Glisse May 20 '13 at 21:30