7

When I diassembled my program, I saw that gcc was using jmp for the second pthread_wait_barrier call when compiled with -O3. Why is it so?

What advantage does it get by using jmp instead of call. What tricks the compiler is playing here? I guess its performing tail call optimization here.

By the way I'm using static linking here.

__attribute__ ((noinline)) void my_pthread_barrier_wait( 
    volatile int tid, pthread_barrier_t *pbar ) 
{
    pthread_barrier_wait( pbar );
    if ( tid == 0 )
    {
        if ( !rollbacked )
        {
            take_checkpoint_or_rollback( ++iter == 4 );
        }
    }
    //getcontext( &context[tid] );
    SETJMP( tid );
    asm("addr2jmp:"); 
    pthread_barrier_wait( pbar );
    // My suspicion was right, gcc was performing tail call optimization, 
    // which was messing up with my SETJMP/LONGJMP implementation, so here I
    // put a dummy function to avoid that.
    dummy_var = dummy_func();
}
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356

5 Answers5

12

As you don't show an example, I can only guess: the called function has the same return type as the calling one, and this works like

return func2(...)

or has no return type at all (void).

In this case, "we" leave "our" return address on the stack, leaving it to "them" to use it to return to "our" caller.

glglgl
  • 89,107
  • 13
  • 149
  • 217
6

Perhaps it was a tail-recursive call. GCC has some pass doing tail-recursive optimization.

But why should you bother? If the called function is an extern function, then it is public, and GCC should call it following the ABI conventions (which means that it follows the calling convention).

You should not care if the function was called by a jmp.

And it might also be a call to a dynamic library function (i.e. with the PLT for dynamic linking)

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
2

jmp has less overhead than call. jmp just jumps, call pushes some stuff on stack and jumps

TJD
  • 11,800
  • 1
  • 26
  • 34
  • -1 for an incomplete answer. I also know jmp has less overhead. The question was how gcc is using jmp to perform the same functionality as call in the optimized version. – MetallicPriest Nov 02 '11 at 17:48
  • 2
    What you mention in your comment is not clearly stated in the question, maybe you should edit it. The only question statement I didn't answer was "What tricks the compiler is playing here?", which is unclear and broken English. – TJD Nov 02 '11 at 19:11
  • JMP has less overhead, sure. Now the function you jump to rather than call has to return at some point. Therefore, the return address has to be pushed on the stack before the JMP or at the end of the function you need to call another JMP to return to where you came from, which, all in all ends up being pretty much similar. You might save a couple cycles with a double JMP since there is no stack manipulation but you would have to store the return address in a register or something. – E.T Jul 26 '15 at 06:41
2

I'm assuming that this is a tail call, meaning either the current function returns the result of the called function unmodified, or (for a function that returns void) returns immediately after the function call. In either case, it is not necessary to use call.

The call instruction performs two functions. First, it pushes the address of the instruction after the call onto the stack as a return address. Then it jumps to the destination of the call. ret pops the return address off of the stack and jumps to that location.

Since the calling function returns the result of the called function, there is no reason for operation to return to it after the called function returns. Therefore, whenever possible and if the optimization level permits it, GCC will destroy its stack frame before the function call, so that the top of the stack contains the return address for the function that called it, and then simply jump to the called function. The result is that, when the called function returns, it returns directly to the first function instead of the calling function.

ughoavgfhw
  • 39,734
  • 6
  • 101
  • 123
-1

You will never know, but one of the likely reasons is "cache" (among other reasons such as the already mentioned tail call optimization).

Inlining can make code faster and it can make code slower, because more code means less of it will be in the L1 cache at one time.

A JMP allows the compiler to reuse the same piece of code at little or no cost at all. Modern processors are deeply pipelined, and pipelines go over a JMP without problems (there is no possibility of a misprediction here!). In the average case, it will cost as little as 1-2 cycles, in the best cases zero cycles, because the CPU would have to wait on a previous instruction to retire anyway. This obviously depends totally on the respective, individual code.
The compiler could in principle even do that with several functions that have common parts.

Damon
  • 67,688
  • 20
  • 135
  • 185