Why calls when jmps would suffice?

Question

I have two files:

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

int main()
{
    unsigned int input;
    scanf("%u", &input);

    switch (input)
    {
        case 0: print0(); break;
        case 1: print1(); break;
        case 2: print2(); break;
        case 3: print3(); break;
        case 4: print4(); break;
    }
    return 0;
}

and

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

int main()
{
    unsigned int input;
    scanf("%u", &input);

    static void (*jt[])() = { print0, print1, print2, print3, print4 };
    jt[input]();
    return 0;
}

I expected them to be compiled to almost identical assembly code. In both cases jump tables are generated, but the calls in the first file are represented by jmp, while the calls in the second one by call. Why doesn't compiler optimise calls? Is is possible to hint gcc that I would like to see jmps instead of calls?

Compiled with gcc -Wall -Winline -O3 -S -masm=intel, GCC version 4.6.2. GCC 4.8.0 produces slightly less code, but the problem still persists.

UPD: Defining jt as const void (* const jt[])() = { print0, print1, print2, print3, print4 }; and making the functions static const inline didn't help: http://ideone.com/97SU0

Is there a performance difference? And I don't believe that the calls are inlined in the second case. Can you post the assembly? — Mysticial, May 15 '12 at 13:56
Perhaps because in the latter case the functions are called indirectly? What happens if you make jt[] a const array of 5 const pointers to functions? — Alexey Frunze, May 15 '12 at 13:57
@Alex - you beat me! A non-const pointer array can be modified at run-time. — Martin James, May 15 '12 at 13:58
Can you try making jt const in the second example? Compiler has to assume it might change to something that requires call. @Alex was faster... — elmo, May 15 '12 at 13:59
@Mysticial, this is going to be a emulator of 6502 microprocessor. While there might be no serious performance difference, I would really like to know explanation of this gcc behaviour. First file assembly: http://ideone.com/378wv second: http://ideone.com/DC6oh — skink, May 15 '12 at 14:01
@Joulukuusi Yeah, that's what I thought. The calls are not inlined in the second case. (I don't see how they can be inlined without degenerating into the first case). So you might want to change your wording of that sentence. — Mysticial, May 15 '12 at 14:03
@Alex, @Martin James, @elmo, didn't help, there's still a `call`. I'll update my question in a moment. — skink, May 15 '12 at 14:11
TBH, it's a 6502 8-bit processor with one byte opcodes, so why not just implement the emulator in assembler anyway? A 256-byte jump table would be easy. — Martin James, May 15 '12 at 14:40
@MartinJames, I think I'll do that eventually, thanks. Currently I'm just playing around to ensure that I can implement it. — skink, May 15 '12 at 15:01

score 8 · Answer 1 · answered May 15 '12 at 14:03

Compiler writers have a lot of work to do. Obviously they prioritize the work that has the biggest and fastest payoff.

Switch statements are common in all kinds of code, so any optimizations performed on them will have an effect on lots of programs.

This code

jt[input]();

is a lot less common, and therefore a lot longer down on the compiler designers' TODO-list. Perhaps they haven't (yet) found it worth the effort to try to optimize it? Will that win them any known benchmarks? Or improve some widely used codebase?

score 5 · Answer 2 · answered May 15 '12 at 13:57

5

Because the array of function pointers is mutable. The compiler has decided it can't assume the pointers won't be changed. You might find the assembly different for C++, and/or make jt const.

answered May 15 '12 at 13:57

Matt Joiner

112,946
110
377
526

I don't particularly buy this argument. It's trivial to prove that the array will never change. It is declared local to `main` and is never modified within `main()` - nor is its address ever taken. Whether or not the compiler will do this data dependency analysis is a different story though. – Mysticial May 15 '12 at 14:07
That, and making `jt` `const` doesn't change anything. – Michael Burr May 15 '12 at 14:13
Thanks! Unfortunately, changing declaration of `jt` to `const void (* const jt[])()` didn't help. – skink May 15 '12 at 14:15

Sergey Kalinichenko · Answer 3 · 2012-05-15T14:23:45.127

My guess is that this optimization has to do with the fact that you have a return statement immediately after your switch: optimizer realizes that it could piggyback on the returns embedded into your print0..print4 functions, and reduces call to jmp; the ret the CPU hits inside the selected printN serves as the return from the main.

Try inserting some code after the switch to see if the compiler would replace jmp with call.

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

int main()
{
    unsigned int input;
    scanf("%u", &input);

    switch (input)
    {
        case 0: print0(); break;
        case 1: print1(); break;
        case 2: print2(); break;
        case 3: print3(); break;
        case 4: print4(); break;
    }
    /* Inserting this line should force the compiler to use call */
    printf("\nDone");
    return 0;
}

EDIT : Your code on ideone has a jmp for a different reason: it's an equivalent of this:

static const char* LC0 ="Zero";
static const char* LC1 ="One";
static const char* LC2 ="Two";
static const char* LC3 ="Three";
static const char* LC4 ="Four";

int main()
{
    unsigned int input;
    scanf("%u", &input);

    switch (input)
    {
        case 0: printf(LC0); break;
        case 1: printf(LC1); break;
        case 2: printf(LC2); break;
        case 3: printf(LC3); break;
        case 4: printf(LC4); break;
    }
    printf("\nDone");
    return 0;
}

Thanks! Though, the compiler outputs `jmp` even in this case: http://ideone.com/FBHuZ — skink, May 15 '12 at 14:14
Oh, true. I modified the first file a little: http://ideone.com/GJPQi Still, there's `jmp` in the output: http://ideone.com/F9AMo — skink, May 15 '12 at 14:36
@Joulukuusi But again all jumps lead to the common continuation (although it comes at an earlier point in the assembly output). — Sergey Kalinichenko, May 15 '12 at 14:39
Could you please provide a minimal test, if possible? I can't seem to think of one with different endings - if I throw `printf` away, the compiler throws the rest. — skink, May 15 '12 at 14:52
@Joulukuusi Try replacing one `printf` with `puts`, another one with `fputs("...", stdout)`, and yet another one to `fprintf(stdout, "...")` to avoid optimizing of shared calls. — Sergey Kalinichenko, May 15 '12 at 14:56

FrankH. · Accepted Answer · 2012-05-16T16:57:17.377

The first case (through the switch()) creates the following for me (Linux x86_64 / gcc 4.4):

  400570:       ff 24 c5 b8 06 40 00    jmpq   *0x4006b8(,%rax,8)
[ ... ]
  400580:       31 c0                   xor    %eax,%eax
  400582:       e8 e1 fe ff ff          callq  400468 <printf@plt>
  400587:       31 c0                   xor    %eax,%eax
  400589:       48 83 c4 08             add    $0x8,%rsp
  40058d:       c3                      retq
  40058e:       bf a4 06 40 00          mov    $0x4006a4,%edi
  400593:       eb eb                   jmp    400580 <main+0x30>
  400595:       bf a9 06 40 00          mov    $0x4006a9,%edi
  40059a:       eb e4                   jmp    400580 <main+0x30>
  40059c:       bf ad 06 40 00          mov    $0x4006ad,%edi
  4005a1:       eb dd                   jmp    400580 <main+0x30>
  4005a3:       bf b1 06 40 00          mov    $0x4006b1,%edi
  4005a8:       eb d6                   jmp    400580 <main+0x30>
[ ... ]
Contents of section .rodata:
[ ... ]
 4006b8 8e054000 p ... ]

Note the .rodata contents @4006b8 are printed network byte order (for whatever reason ...), the value is 40058e which is within main above - where the arg-initializer/jmp block starts. All the mov/jmp pairs in there use eight bytes, hence the (,%rax,8) indirection. In this case, the sequence is therefore:

jmp <to location that sets arg for printf()>
...
jmp <back to common location for the printf() invocation>
...
call <printf>
...
retq

This means the compiler has actually optimized out the static call sites - and instead merged them all into a single, inlined printf() call. The table use here is the jmp ...(,%rax,8) instruction, and the table contained within the program code.

The second one (with the explicitly-created table) does the following for me:

0000000000400550 <print0>:
[ ... ]
0000000000400560 <print1>:
[ ... ]
0000000000400570 <print2>:
[ ... ]
0000000000400580 <print3>:
[ ... ]
0000000000400590 <print4>:
[ ... ]
00000000004005a0 <main>:
  4005a0:       48 83 ec 08             sub    $0x8,%rsp
  4005a4:       bf d4 06 40 00          mov    $0x4006d4,%edi
  4005a9:       31 c0                   xor    %eax,%eax
  4005ab:       48 8d 74 24 04          lea    0x4(%rsp),%rsi
  4005b0:       e8 c3 fe ff ff          callq  400478 <scanf@plt>
  4005b5:       8b 54 24 04             mov    0x4(%rsp),%edx
  4005b9:       31 c0                   xor    %eax,%eax
  4005bb:       ff 14 d5 60 0a 50 00    callq  *0x500a60(,%rdx,8)
  4005c2:       31 c0                   xor    %eax,%eax
  4005c4:       48 83 c4 08             add    $0x8,%rsp
  4005c8:       c3                      retq
[ ... ]
 500a60 50054000 00000000 60054000 00000000  P.@.....`.@.....
 500a70 70054000 00000000 80054000 00000000  p.@.......@.....
 500a80 90054000 00000000                    ..@.....

Again, note the inverted byte order as objdump prints the data section - if you turn these around you get the function adresses for print[0-4]().

The compiler is invoking the target through an indirect call - i.e. the table usage is directly in the call instruction, and the table has _explicitly been created as data.

Edit:
If you change the source like this:

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

void main(int argc, char **argv)
{
    static void (*jt[])() = { print0, print1, print2, print3, print4 };
    return jt[argc]();
}

the created assembly for main() becomes:

0000000000400550 <main>:
  400550:       48 63 ff                movslq %edi,%rdi
  400553:       31 c0                   xor    %eax,%eax
  400555:       4c 8b 1c fd e0 09 50    mov    0x5009e0(,%rdi,8),%r11
  40055c:       00
  40055d:       41 ff e3                jmpq   *%r11d

which looks more like what you wanted ?

The reason for this is that you need "stackless" funcs to be able to do this - tail-recursion (returning from a function via jmp instead of ret) is only possible if you either have done all stack cleanup already, or don't have to do any because you have nothing to clean up on the stack. The compiler can (but needs not) choose to clean up before the last function call (in which case the last call can be made by jmp), but that's only possible if you return either the value you got from that function, or if you "return void". And, as said, if you actually use stack (like your example does for the input variable) there's nothing that can make the compiler force to undo this in such a way that tail-recursion results.

Edit2:

The disassembly for the first example, with the same changes (argc instead of input and forcing void main - no standard-conformance comments please this is a demo), results in the following assembly:

0000000000400500 <main>:
  400500:       83 ff 04                cmp    $0x4,%edi
  400503:       77 0b                   ja     400510 <main+0x10>
  400505:       89 f8                   mov    %edi,%eax
  400507:       ff 24 c5 58 06 40 00    jmpq   *0x400658(,%rax,8)
  40050e:       66                      data16
  40050f:       90                      nop
  400510:       f3 c3                   repz retq
  400512:       bf 3c 06 40 00          mov    $0x40063c,%edi
  400517:       31 c0                   xor    %eax,%eax
  400519:       e9 0a ff ff ff          jmpq   400428 <printf@plt>
  40051e:       bf 41 06 40 00          mov    $0x400641,%edi
  400523:       31 c0                   xor    %eax,%eax
  400525:       e9 fe fe ff ff          jmpq   400428 <printf@plt>
  40052a:       bf 46 06 40 00          mov    $0x400646,%edi
  40052f:       31 c0                   xor    %eax,%eax
  400531:       e9 f2 fe ff ff          jmpq   400428 <printf@plt>
  400536:       bf 4a 06 40 00          mov    $0x40064a,%edi
  40053b:       31 c0                   xor    %eax,%eax
  40053d:       e9 e6 fe ff ff          jmpq   400428 <printf@plt>
  400542:       bf 4e 06 40 00          mov    $0x40064e,%edi
  400547:       31 c0                   xor    %eax,%eax
  400549:       e9 da fe ff ff          jmpq   400428 <printf@plt>
  40054e:       90                      nop
  40054f:       90                      nop

This is worse in one way (does two jmp instead of one) but better in another (because it eliminates the static functions and inlines the code). Optimization-wise, the compiler has pretty much done the same thing.

Thanks! I got very close results - the links to assembly outputs are in the OP post. However, the question is - why doesn't the compiler optimises `call` in the second case to two `jmp`s (like in the first case)? This would eliminate stack manipulations, which are probably slower than two `jmp`s. — skink, May 16 '12 at 15:24
I don't understand fully what you mean; in both cases, you're invoking the actual function via `call`. I.e. `main()` as created here isn't _tail recursive_ (which it'd be if it used a `jmp print...` instead of a `call print...; retq`). But the reason for that is the fact `main()` does a `return 0` - hence it _cannot be tail-recursive_. — FrankH., May 16 '12 at 16:36
But regarding stack usage, there's also the fact to consider that providing `&input` (the address of a local var) to `scanf()` forces stack allocation. That in turn, again, prevents the compiler from making the function tail-recursive even if you make the return type `void`. See my edit. — FrankH., May 16 '12 at 16:44
Sorry for my awful phrasing! In the first case the compiler calls any of `printX` functions by `jmp [DWORD PTR L8[0+eax*4]]`. In the second case this is done by `call [DWORD PTR _jt.1677[0+eax*4]]`, and additionally each of the `printX` functions does `sub esp, 28` and `add esp, 28`. I think `sub` and `add` were renundant, if the compiler would change the `call` to two `jmp`s - one inside a printX label, one back to the place after the `call`. — skink, May 16 '12 at 17:54
Your assembly is for 32bit ... in which case you _cannot_ have stack-free code because arguments are passed on the stack, and your `printX` passes an argument to `printf()`. The space allocation for that must be done somehow. In 64bit mode (with the mod to your source as shown above) both `main()` and `print[0-4]()` end with `jmp`. — FrankH., May 16 '12 at 18:46
What you're expecting there ("... one back to the place after call ") cannot happen if you're declaring a jump table variable; that's because the compiler in this case cannot determine that only exactly one caller of the `print[0-4]` functions exists. Determining this is only possible for the `switch()` case, and that's why the compiler eliminates / inlines the `static` funcs in this case. — FrankH., May 16 '12 at 18:56
I think I got everything but this - "the compiler in this case cannot determine that only exactly one caller of the `print[0-4]` functions exists". The jump table variable is local to `main()`, how could there be more than one caller? — skink, May 16 '12 at 19:14
The jump table is not local - it's `static` (which is global - other functions can invoke `main`), and not constant. That means the compiler is forced to create code _as if the table can change_. In any case, as indicated, the `switch()` version inlines the code and creates what you want. — FrankH., May 17 '12 at 08:23

Michael Burr · Answer 5 · 2012-05-15T17:21:01.153

1

Have you profiled the different code? I think an argument might be made that the indirect call is optimized. The following analysis was done with GCC 4.6.1 targeting an x64 platform (MinGW).

If you look at what happens when jt[input]() is used, a call results in the following sequence of code being executed:

the indirect call to one of the printX() functions
the printX() function sets up the argument for printf(), then
jumps to printf()
the printf() call will return directly to the site of the `indirect call.

For a total of 3 branches.

When you use the switch statement what happens is:

an indirect jump to a bit of custom code for each case (inlined printX() calls)
the 'case handler' loads the appropriate argument for the printf() call
calls printf()
the printf() call will return to the 'case handler' which
jumps to the exit point of the switch (except for one case handler where the exit code is inlined - the other cases jump there)

For a total of 4 branches (in the general case).

In both situations you have: - an indirect branch (for one it's a call, in the other a jump) - a branch to the printf() (for one it's a jump, in the other a call) - a branch back to the call site

However, when the switch statement is used there's an additional branch to get to the 'end' of the switch (in most cases).

Now, it's possible that if you actually profiled things, the processor might handle an indirect jump faster than an indirect call, but I'd guess that even if that's the case the additional branch used in the switch-based code would still push the scales in favor of the call through the function pointer.

For those interested, here's the assembler generated using jk[input](); (both examples compiled with GCC MinGW 4.6.1 targeting x64, options used were -Wall -Winline -O3 -S -masm=intel):

print0:
    .seh_endprologue
    lea rcx, .LC4[rip]
    jmp printf
    .seh_endproc

// similar code is generated for each printX() function
// ...

main:
    sub rsp, 56
    .seh_stackalloc 56
    .seh_endprologue
    call    __main
    lea rdx, 44[rsp]
    lea rcx, .LC5[rip]
    call    scanf
    mov edx, DWORD PTR 44[rsp]
    lea rax, jt.2423[rip]
    call    [QWORD PTR [rax+rdx*8]]
    xor eax, eax
    add rsp, 56
    ret

And here is the code generated for the switch-based implementation:

main:
    sub rsp, 56
    .seh_stackalloc 56
    .seh_endprologue
    call    __main
    lea rdx, 44[rsp]
    lea rcx, .LC0[rip]
    call    scanf
    cmp DWORD PTR 44[rsp], 4
    ja  .L2
    mov edx, DWORD PTR 44[rsp]
    lea rax, .L8[rip]
    movsx   rdx, DWORD PTR [rax+rdx*4]
    add rax, rdx
    jmp rax
    .section .rdata,"dr"
    .align 4
.L8:
    .long   .L3-.L8
    .long   .L4-.L8
    .long   .L5-.L8
    .long   .L6-.L8
    .long   .L7-.L8
    .section    .text.startup,"x"
.L7:
    lea rcx, .LC5[rip]
    call    printf
    .p2align 4,,10


.L2:
    xor eax, eax
    add rsp, 56
    ret

.L6:
    lea rcx, .LC4[rip]
    call    printf
    jmp .L2

     // all the other cases are essentially the same as the one above (.L6)
     // where they jump to .L2 to exit instead of simply falling through to it
     // like .L7 does

edited May 15 '12 at 17:21

answered May 15 '12 at 15:59

Michael Burr

333,147
50
533
760

Thanks! I'd like to profile pure instructions, but I couldn't find a way to do that. My assembly output is slightly different from yours - there's `call _printf` generated (instead of your `jmp printf`) when using `jt[input]()`. Also, there's some stack aligning before and after calling `printf`. Does this matter? – skink May 15 '12 at 17:35
@Joulukuusi: I think that you're generating 32-bit x86 code while I'm generating 64-bit x64 code. It looks like the x86 code generation is slightly less optimal than the x64 code generation for the indirect call case. However, I'm still not sure that it ends up being significantly less optimal than the switch statement (but maybe). I think the stack adjustments are done because of calling contention requirements (?). Those adjustments can be optimized away in the `switch` because it's able to inline the indirect call since there are separate indirect calls through particular pointers. – Michael Burr May 15 '12 at 17:53
I think that in principle, the compiler could 'expand' the `jt[input]()` call into five separate direct calls and end up with the same code as the switch statement, but as Bo Persson mentioned, that scenario is probably not common enough to have gotten the attention of the compiler maintainers. Note that the x86 code that calls functions that perform stack alignment still have similar number of branches to the 'optimized' code the switch statement produces (4 branches either way by my count). So it still may perform pretty much the same as the `switch` on x86. – Michael Burr May 15 '12 at 18:22
I tried to profile both cases on my PC. Here are the links to the sources used: http://ideone.com/WE5N4 http://ideone.com/OQJa8 I ensured that both assembly outputs contained jump tables. The `sample.bin` was a random 250MB zip archive. According to `gprof`, the first program ran for 3.80s, and the second one for 4.69s. How do you think, can I trust this result? – skink May 15 '12 at 19:12
@Joulukuusi: 20% difference is pretty convincing; it's also pretty surprising that it's so big. But then again these kinds of things can often have surprising results. Did you run both cases several times (alternating between the two) just to see if the results are consistent? – Michael Burr May 15 '12 at 21:45
@Joulukuusi: I gave your examples some test runs on my computer (using MinGW 4.6.1 on Win7 x64), and found: The timing for 32-bit runs (with a 550MB zip file) were between 5.42 (switch) and 5.62 (indirect call) seconds - less than a 5% variation (the 64-bit times were actually slightly slower). The fastest indirect call in my small sample was 5.54 seconds - about 2.5% slower than the fastest switch based run. I think the test is simply too small/simple to draw too many conclusions, but I'm curious why you saw such a large difference and I didn't. – Michael Burr May 16 '12 at 06:38
Note: since there was no further function call in the static functions, the switch-based version (which inlined into the switch cases) kept `value` in a register; the indirect call version worked on the actual `value` memory variable. I assume that keeping `value` in the cache made the differences minimal. I think one might possibly conclude that the switch will gain you something if the inline/static functions are small & simple. 20%? Maybe, but maybe only a few percent. If they have much complexity (like call other functions) you'll probably see no significant gain one way or another. – Michael Burr May 16 '12 at 06:46
Yes, I did run both examples for three times on my laptop. However, a few minutes ago I tried to run both compiled executables on a PC, then analysed `.out` files on the laptop. The results were absolutely different - 5.08s for switch version and 5.02s for jump table version. Both machines run Win7 x86 on one-cored Intel Celeron CPU. Saying I'm confused is like saying nothing! – skink May 16 '12 at 13:07
I'm not sure why, but it seems like `gprof` tells wrong time. I tried to manually measure running time by `echo %time%` before and after the application, and I got these results: http://ideone.com/US2Ww Still, the switch version runs slower than indirect jump version on the PC, but faster on the laptop. This could mean either that the compiler thinks the switch version is slower than the indirect call one (so it doesn't optimise latter to the former), or it just doesn't know how to optimise latter (Bo Persson's answer). – skink May 16 '12 at 15:10

score 1 · Answer 6 · answered May 15 '12 at 18:04

1

Does the code for the latter function do nothing between the indirect call and the succeeding ret? I would not be surprised if the address computation for the indirect call makes use of a register whose value which the latter function is required to preserve (meaning it must save the value before the computation, and restore it some time after). While it might be possible to move the register-restore code before the indirect call, a compiler can only perform such code motion in those cases which it has been been programmed to recognize as legitimate opportunities.

Also, while I don't think it matters, I'd suggest that the routines shouldn't be inline since the compiler won't be able to execute them that way.

answered May 15 '12 at 18:04

supercat

77,689
9
166
211

Thanks! Yes, it uses `eax` for the address computation - `call [DWORD PTR _jt.1677[0+eax*4]]`. Right after that (i.e. after the called function returns) follows `xor eax,eax`, `leave` and `ret`. Can't find why `eax` must be preserved. Btw, I have included the links to assembly outputs in the question. Regarding `inline` advice - could you please explain that? – skink May 15 '12 at 18:24
@Joulukuusi: The return value is in eax. Since the compiler cannot infer what value will be in eax when the called function returns, it must load eax with zero. If you were to make the indirect call from a `void` function or else call a function of type `int` and return the value returned from that function, the `xor` could be eliminated, perhaps allowing the use of an indirect `jmp` rather than `call`. – supercat May 15 '12 at 18:31

Why calls when jmps would suffice?

6 Answers6