The first case (through the switch()
) creates the following for me (Linux x86_64 / gcc 4.4):
400570: ff 24 c5 b8 06 40 00 jmpq *0x4006b8(,%rax,8)
[ ... ]
400580: 31 c0 xor %eax,%eax
400582: e8 e1 fe ff ff callq 400468 <printf@plt>
400587: 31 c0 xor %eax,%eax
400589: 48 83 c4 08 add $0x8,%rsp
40058d: c3 retq
40058e: bf a4 06 40 00 mov $0x4006a4,%edi
400593: eb eb jmp 400580 <main+0x30>
400595: bf a9 06 40 00 mov $0x4006a9,%edi
40059a: eb e4 jmp 400580 <main+0x30>
40059c: bf ad 06 40 00 mov $0x4006ad,%edi
4005a1: eb dd jmp 400580 <main+0x30>
4005a3: bf b1 06 40 00 mov $0x4006b1,%edi
4005a8: eb d6 jmp 400580 <main+0x30>
[ ... ]
Contents of section .rodata:
[ ... ]
4006b8 8e054000 p ... ]
Note the .rodata
contents @4006b8
are printed network byte order (for whatever reason ...), the value is 40058e
which is within main
above - where the arg-initializer/jmp
block starts. All the mov
/jmp
pairs in there use eight bytes, hence the (,%rax,8)
indirection. In this case, the sequence is therefore:
jmp <to location that sets arg for printf()>
...
jmp <back to common location for the printf() invocation>
...
call <printf>
...
retq
This means the compiler has actually optimized out the static
call sites - and instead merged them all into a single, inlined printf()
call. The table use here is the jmp ...(,%rax,8)
instruction, and the table contained within the program code.
The second one (with the explicitly-created table) does the following for me:
0000000000400550 <print0>:
[ ... ]
0000000000400560 <print1>:
[ ... ]
0000000000400570 <print2>:
[ ... ]
0000000000400580 <print3>:
[ ... ]
0000000000400590 <print4>:
[ ... ]
00000000004005a0 <main>:
4005a0: 48 83 ec 08 sub $0x8,%rsp
4005a4: bf d4 06 40 00 mov $0x4006d4,%edi
4005a9: 31 c0 xor %eax,%eax
4005ab: 48 8d 74 24 04 lea 0x4(%rsp),%rsi
4005b0: e8 c3 fe ff ff callq 400478 <scanf@plt>
4005b5: 8b 54 24 04 mov 0x4(%rsp),%edx
4005b9: 31 c0 xor %eax,%eax
4005bb: ff 14 d5 60 0a 50 00 callq *0x500a60(,%rdx,8)
4005c2: 31 c0 xor %eax,%eax
4005c4: 48 83 c4 08 add $0x8,%rsp
4005c8: c3 retq
[ ... ]
500a60 50054000 00000000 60054000 00000000 P.@.....`.@.....
500a70 70054000 00000000 80054000 00000000 p.@.......@.....
500a80 90054000 00000000 ..@.....
Again, note the inverted byte order as objdump prints the data section - if you turn these around you get the function adresses for print[0-4]()
.
The compiler is invoking the target through an indirect call
- i.e. the table usage is directly in the call
instruction, and the table has _explicitly been created as data.
Edit:
If you change the source like this:
#include <stdio.h>
static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }
void main(int argc, char **argv)
{
static void (*jt[])() = { print0, print1, print2, print3, print4 };
return jt[argc]();
}
the created assembly for main()
becomes:
0000000000400550 <main>:
400550: 48 63 ff movslq %edi,%rdi
400553: 31 c0 xor %eax,%eax
400555: 4c 8b 1c fd e0 09 50 mov 0x5009e0(,%rdi,8),%r11
40055c: 00
40055d: 41 ff e3 jmpq *%r11d
which looks more like what you wanted ?
The reason for this is that you need "stackless" funcs to be able to do this - tail-recursion (returning from a function via jmp
instead of ret
) is only possible if you either have done all stack cleanup already, or don't have to do any because you have nothing to clean up on the stack. The compiler can (but needs not) choose to clean up before the last function call (in which case the last call can be made by jmp
), but that's only possible if you return either the value you got from that function, or if you "return void
". And, as said, if you actually use stack (like your example does for the input
variable) there's nothing that can make the compiler force to undo this in such a way that tail-recursion results.
Edit2:
The disassembly for the first example, with the same changes (argc
instead of input
and forcing void main
- no standard-conformance comments please this is a demo), results in the following assembly:
0000000000400500 <main>:
400500: 83 ff 04 cmp $0x4,%edi
400503: 77 0b ja 400510 <main+0x10>
400505: 89 f8 mov %edi,%eax
400507: ff 24 c5 58 06 40 00 jmpq *0x400658(,%rax,8)
40050e: 66 data16
40050f: 90 nop
400510: f3 c3 repz retq
400512: bf 3c 06 40 00 mov $0x40063c,%edi
400517: 31 c0 xor %eax,%eax
400519: e9 0a ff ff ff jmpq 400428 <printf@plt>
40051e: bf 41 06 40 00 mov $0x400641,%edi
400523: 31 c0 xor %eax,%eax
400525: e9 fe fe ff ff jmpq 400428 <printf@plt>
40052a: bf 46 06 40 00 mov $0x400646,%edi
40052f: 31 c0 xor %eax,%eax
400531: e9 f2 fe ff ff jmpq 400428 <printf@plt>
400536: bf 4a 06 40 00 mov $0x40064a,%edi
40053b: 31 c0 xor %eax,%eax
40053d: e9 e6 fe ff ff jmpq 400428 <printf@plt>
400542: bf 4e 06 40 00 mov $0x40064e,%edi
400547: 31 c0 xor %eax,%eax
400549: e9 da fe ff ff jmpq 400428 <printf@plt>
40054e: 90 nop
40054f: 90 nop
This is worse in one way (does two jmp
instead of one) but better in another (because it eliminates the static
functions and inlines the code). Optimization-wise, the compiler has pretty much done the same thing.