What is the meaning of the data32 data32 nopw %cs:0x0(%rax,%rax,1) instruction in disassembly of gcc's output?

Question

While running some tests for the -O2 optimization of the gcc compilers, I observed the following instruction in the disassembled code for a function:

data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

What does this instruction do?

To be more detailed I was trying to understand how does the compiler optimize useless recursions like the below with O2 optimization:

int foo(void)
{
   return foo();
}
int main (void)
{
   return foo();
}

The above code causes stack overflow when compiled without optimization, but works for O2 optimized code.

I think with O2 it completely removed the pushing the stack of the function foo, but why is the data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) needed?

0000000000400480 <foo>:
foo():
400480:       eb fe                   jmp    400480 <foo>
400482:       66 66 66 66 66 2e 0f    data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400489:       1f 84 00 00 00 00 00

0000000000400490 <main>:
main():
400490:       eb fe                   jmp    400490 <main>

It doesn't seem to be an instruction, but rather (random?) data, which the disassembler tried to parse nevertheless. (`66` is a "data size" prefix.) By the way, the optimization replaced the nested call with a tail-jump. That `jmp` is the only code in the function `foo`. — Jongware, Apr 25 '15 at 23:41
It's to align the functions to 16 bytes. See e.g. the *Alignment of code* section in http://www.agner.org/optimize/optimizing_assembly.pdf. Posting as a comment since someone might explain why those particular padding values were chosen. At the very least I guess you'd want some values that won't confuse a disassembler, so maybe that's all there is to it. — Ulfalizer, Apr 25 '15 at 23:45
@peterh: wait a minute ... take a look at that `main`! Seems the most obvious optimization was to ditch the entire function `foo`. (But I still feel the data is not part of the would-have-been executed code.) — Jongware, Apr 25 '15 at 23:45
@Ulfalizer No, it is explicitly a long no-op command, and I think it is intentional. Only for aligning the function it would be padded with zeros. — peterh, Apr 25 '15 at 23:46
@peterh: Guessing it might be a valid instruction just to not confuse disassemblers. Just a guess though... — Ulfalizer, Apr 25 '15 at 23:48
@Jongware Yes, but despite that `foo` needs to be compiled and exist in the binary, because from the viewpoint of the C, it wasn't declared as `static`. Thus there is no guarantee somebody don't want to call `foo()` from an external object. I don't know why a similar optimization after `main()` didn't happen - maybe it happened but the OP didn't copy it here. — peterh, Apr 26 '15 at 00:11

Chris Dodd · Answer 1 · 2015-04-26T02:04:32.100

7

To answer the question in the title, the instruction

data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

is a 14-byte NOP (no operation) instruction that is used to pad the gap between the foo function and the main function to maintain 16-byte alignment.

The x86 architecture has a large number of different NOP instructions of different sizes that can be used to insert padding into an executable segment such that they'll have no effect if the CPU ends up executing over them. Then Intel optimization manual contains information on recommended NOP encoding for different lengths that can be used as padding.

In this specific case, it is completely irrelevant, as the NOP will never be executed (or even decoded as it is after an unconditional jump), so the compiler could pad with any random garbage it wanted to.

edited Apr 26 '15 at 02:04

answered Apr 26 '15 at 01:57

Chris Dodd

119,907
13
134
226

It will be speculatively decoded. If it was something that decoded slowly, it could slow down decoding of the `jmp` , since they're probably decoded together. (Most modern CPUs decode up-to-4 instructions from up-to-16B at a time (oversimplification, see http://agner.org/optimize/ for details).) – Peter Cordes Jul 18 '17 at 00:31
1

This should be the accepted answer. @PeterCordes - it actually isn't clear to me that the instruction will usually be decoded. The branch target predictor has to inform the front-end very early that there is a branch at `jmp` and it can't go too far down the path of decoding those instructions. Almost certainly it doesn't leave the decoder into the IDQ (since that would then have to undone), and it seems likely that it never enters the buffer between the pre-decoder and the decoder. Perhaps even the pre-decoder doesn't decode the bytes. – BeeOnRope Jul 18 '17 at 02:53
I'm not sure which part of that chain suffers slowdown though, and I recall there was at least at one point a recommendation to put a `ud2` or at the start of code that wasn't really code to tell the decoder "here there be dragons". – BeeOnRope Jul 18 '17 at 02:54
@BeeOnRope: I think the block of x86 code will hit the decoders, and maybe cause a stall, even if the decode-results for everything after an unconditional branch is thrown away later. My guess is that any of the parallel decoders can set a discard-the-rest signal, and does so for `ud2`, `int *`, `jmp`, `call`, and so on. Agreed that they won't be added to the IDQ. But good question about where exactly branch prediction enters into it. Maybe once core2 correctly detects the `jmp`, it will repeat the `jmp` instruction and feed the decoders a block of 4 `jmp`? Early P6 had no loop buffer. – Peter Cordes Jul 18 '17 at 17:23
@BeeOnRope: The `ud2` recommendation that springs to my mind is after an *indirect* `jmp *%rax`. The default prediction is that the branch target is the next instruction, and putting `ud2` there will block speculative execution from filling the pipe with useless work that may take time to abort (e.g. the divider might not be interruptible), or worse a TLB miss that evicts a useful entry and starts a page walk. – Peter Cordes Jul 18 '17 at 17:28
@PeterCordes - yeah we agree on how it works for stuff that is detected first by the decoder, but it's all about BP. If I remember correctly there are something like 4 pipeline stages split 2-2 between pre-decode and decode (with a queue in between those two). Since lately chips can handle 1 taken branch per cycle, it (probably?) means that a new BP is available every cycle so the branch information about stuff fetched in cycle 0 is already available in cycle 1 when they enter stage 1 of the predecoder. So at this point you already _know_ nothing after the `jmp` should be used... – BeeOnRope Jul 18 '17 at 18:30
@BeeOnRope: Small-loop branches are special on a lot of architectures, especially when there's any kind of loop detection. The next-fetch prediction just has to predict what block to fetch next. It doesn't have to predict *where* the end of the predicted-taken branch instruction is inside the current block. (A lot of the hardware is probably the same for predicted-taken vs. unconditional). Anyway, I can easily imagine just throwing the same 16B block at the decoders every cycle, even if it could tell a taken branch is the first instruction. It's up to software not to stall the decoders... – Peter Cordes Jul 18 '17 at 18:36
In general, the BP is used to redirect fetch, and also to inform the later decode stages about what is valid, and those are probably available at the same time (very early). For example, it seems feasible that the follows-a-jump instructions never even enter the instruction queue (not the IDQ) between the pre-decoders and the decoders. That's an obvious and early place to "correct" or stitch together the actual branch-aware instruction stream, rather than trying to wait until decoding. – BeeOnRope Jul 18 '17 at 18:36
@PeterCordes - but why talk about small loops? Is it different there? Is your mental model fetch + decode? I'm breaking it info fetch -> pre-decode -> InsQueue -> decode. I think it is very likely that instructions following a predicted taken or unconditional branch never hit the decode part of that, although they likely hit the pre-decode part. If that's the case, the chance of stalls depends on what part actually stalls for what instruction. Yes, it is true that _fetch_ doesn't strictly need to know where the branch was, but later stages do and this info comes from the same place. – BeeOnRope Jul 18 '17 at 18:40
@BeeOnRope: depending on the CPU, yes it's different. If the loop is handled "implicitly" by some kind of loop-buffer mode that recycles fetched instructions or decoded uops, branch-prediction results for the loop branch don't have to be available every cycle. Also, I'm imagining that maybe next-fetch-block predictions are available earlier than detailed here's-the-jmp predictions. I guess we should test with non-empty loops that contain taken branches, maybe on Core2 where there's no LSD, just a loop buffer of pre-decoded x86 machine code. – Peter Cordes Jul 18 '17 at 18:52
@BeeOnRope: actually, this may shed some light: Agner's uarch pdf says this for Core2 : "*A loop that can be completely contained in four aligned blocks of 16 bytes each can execute at a rate of up to 32 bytes of code per clock cycle. The four 16-bytes blocks do not even have to be consecutive.*" So it doesn't sound like instructions are packed head-to-tail trace-cache style in the loop buffer, and taken branches in the middle of a decode group leaves wasted decode throughput. (Like I said, that should be testable). – Peter Cordes Jul 18 '17 at 18:53
@PeterCordes - I guess I'm just not that interested in the question of legacy decoding interactions in small loops (also how did we start talking about that: the original question was about a forward unconditional jump, and now we are talking about small loops with backwards conditional jumps?), since on all recent archs these are going to hit in the LSD or uop cache which makes the decoding behavior irrelevant. I am interested more in the original claim that instructions following an unconditional branch will be decoded, and if they decode slowly they might slow down the decoders. – BeeOnRope Jul 18 '17 at 19:04
... so you'd test it with something like a series of forward unconditional `jmp` instructions with each one followed by some some slow-to-decode stuff. I can only test on Skylake though and is there even any slow-to-decode stuff there? Do LCPs still hurt? – BeeOnRope Jul 18 '17 at 19:05
@BeeOnRope: This question was about `jmp`-to-itself, or tiny loops. I didn't realize you were talking about forward branches. But yeah, good point, this is much easier to test with something other than a tiny loop. Yes, LCP stalls are still a thing in SKL, with the same penalty as SnB. (But only for operand-size prefix, not address-size, and only for ALU, not MOV). SnB is different from Core2, though. Each LCP instruction causes an extra 2-3c penalty, instead of a 6c penalty regardless of how many LCP instructions. This per-instruction retry might let it skip the retry after a `jmp`. – Peter Cordes Jul 18 '17 at 19:47
@BeeOnRope: I'm making a test that I'll run on SKL and Core2, and maybe Pentium III. I'll try forward branches. Note that LCP stalls are in the pre-decode insn-length-marking phase, not in the actual decoders. In Core2, they only happen the first time through a loop. (But in SKL, a loop that busts the uop cache could suffer every time). Too-many-prefixes stalls are right in the decoders on other CPUs. – Peter Cordes Jul 18 '17 at 19:51
@PeterCordes - oops, I never looked carefully at the original (weird) code, which does indeed jump to itself due to recursion as people already pointed it out. I mentally just assumed it was "jump over `nop`". It may change things in many of the ways you already pointed out. Despite all my comprehension errors, the _general_ case I think is the interesting one since infinitely recursive jumps have little value :) – BeeOnRope Jul 18 '17 at 20:01
1

@BeeOnRope: Right. The real version of this is a non-empty loop that ends with `jmp` followed by garbage (where the exit condition is a normally not-taken jcc). Forward branches are an interesting even-more-general case. – Peter Cordes Jul 18 '17 at 20:08

peterh · Accepted Answer · 2015-04-26T00:07:04.837

6

You see an operand forwarding optimization of the cpu pipeline.

Although it is an empty loop, gcc tries to optimize this as well :-).

The cpu you are running has a superscalar architecture. It means, that it has a pipeline in it, and different phases of the executions of the consecuting instructions happen parallel. For example, if there is a

mov eax, ebx ;(#1)
mov ecx, edx ;(#2)

then the loading & decoding of instruction #2 can happen already while #1 is executed.

The pipelining has major problems to solve in the case of the branches, even if they are unconditional.

For example, while the jmp is decoding, the next instruction is already prefetched into the pipeline. But the jmp changes the location of the next instruction. In such cases, the pipeline needs to by emptied and refilled, and a lot of worthy cpu cycles will be lost.

Looks this empty loop will run faster if the pipeline is filled with a no-op in this case, despite that it won't be ever executed. It is actually an optimization of some uncommon feature of the x86 pipeline.

Earlier dec alphas could even segfault from such things, and empty loops had to have a lot of no-ops in them. x86 would be only slower. This is because they must be compatible with the intel 8086.

Here you can read a lot from the handling of branching instructions in pipelines.

edited Apr 26 '15 at 00:07

answered Apr 25 '15 at 23:59

peterh

11,875
18
85
108

For conditional branches I would agree, but unconditional branches are normaly decoded at a very early stage (if possible) and result in fetching code from the jump-target's address instead of the linear address. Also, some architectures have delayed branches. These execute the immediately following instructions(s) of a branch even if the branch is taken to avoid pipeline stalls. MIPS is a typical example for this and the problem with Alphas could very well also result from this. – too honest for this site Apr 26 '15 at 01:26
Just another point: a superscalar architecture is not the requirement for a pipeline. The other way 'round: yes. However, a pipeline can (and is actually) used in almost every CPU architecture, including some of the late 70ies 8/16-Bitters. Superscalarity (roughly) referes to the capability having more than one instruction in the _same_ execution stage at the same time. – too honest for this site Apr 26 '15 at 02:58
@Olaf Thank you very much your information! BTW, is there a cpu which is superscalar but without pipeline, or it is only a theoretical possibility? I think, the concurrent execution of consecutive instructions requires a dependency analysis, which should happen _before_ the execution. – peterh Apr 26 '15 at 03:18
Superscalar without pipeline is basically possible, but would result in horribly low clock speed and a very complicated fetch/decode/execute system. The former, however, is similar for single issue and not common even for very simple implementations. Most basic CPUs have at least 2 pipeline stages as I stated since the late 70ies at least (the server-CPUs back then had pipelines much longer I think - did not verify this, however). Also, remember a single pipelinesstage can still require more than one clock to complete (Z80, 68000 e.g.). – too honest for this site Apr 26 '15 at 12:42
1

Operand forwarding has nothing to do with this. Also, almost all modern x86 CPUs can run short loops at 1 cycle per iteration, as a special case even if normal taken-branch throughput is less than one per clock. More recently than that, CPUs even have loop buffers that optimize decode when re-running the same block repeatedly. Intel since Nehalem doesn't re-decode at all, but recycles the decoded uops from the queue that feeds the issue stage (LSD). See http://agner.org/optimize/ – Peter Cordes Jul 18 '17 at 00:38
1

A `nop` following the `jmp` is not necessary to make this loop run fast. That just happens to be what the GNU assembler uses for `.p2align` directives, and gcc doesn't override it for the case where the padding is never executed. Some compilers use `int3` or similar faulting instructions for padding between functions, but of course still `nop` to align the tops of loops inside functions. **Anything that doesn't stall the decoder would be fine,** since the `nop` won't even issue (except maybe once on some CPUs before branch prediction detects the jmp) – Peter Cordes Jul 18 '17 at 00:46
On some CPUs, this many prefixes on one instruction (even if it's a NOP) could actually stall the decoders for that 16B chunk, unless they special-case detecting an unconditional branch and don't use the slow-path for that case. Intel Silvermont is like this, but has a decoded-uop loop buffer. AMD Bulldozer-family is very slow to decode instructions with more than 3 prefixes, and only Steamroller has a decoded-uop loop buffer. Agner Fog isn't clear on whether the decoder can output the instructions it did decode quickly, or if a slow-decode stalls the whole group. – Peter Cordes Jul 18 '17 at 00:53
Core2's loopback buffer recycles pre-decode instruction-length finding, so an LCP stall (e.g. on `add ax, imm16` in 32-bit mode) wouldn't impact it after the first trip. Pentium-M might actually stall on the OP's loop, because instructions with more than one prefix take 2+n cycles to decode. In Pentium III, there's no pre-decoded loop buffer, so an LCP stall would probably affect it every iteration. – Peter Cordes Jul 18 '17 at 01:01

too honest for this site · Answer 3 · 2015-04-26T01:27:12.723

The functions foo() is an infinite recursion without termination. Without optimization, gcc generates normal subroutine calls, which include stacking the return address at least. As the stack is limited, this will create an stack overflow which is _undefined_behaviour_.

If optimizing, gcc detects foo() does not require a stack frame at all (there are no arguments or local variables). It also detects, foo() instantly returns to the caller (which would also be foo()). This is called tail-chaining: a function call right at the end of a function (i.e. explicit/implicit return) is converted to a jump to that function, so there is no need for a stack.

This is still undefined behaviour, but this time, nothing "bad" is observed.

Just remenber: undefined includes fatal behaviour as well as expected behaviour (but that just by chance). Code which behaves differently with different optimization levels should always be regarder errorneous. There is one exception: Timing. This is not subject to the C language standard (neither of most other languages).

As others stated, the data32 ... is very certain padding to get an 16 byte alignment which might be the size of the internal instruction bus and/or cache lines.

What is the meaning of the data32 data32 nopw %cs:0x0(%rax,%rax,1) instruction in disassembly of gcc's output?

3 Answers3

Linked