4

I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of alignment inserted at one place, it will move instructions somewhere further pass the cache border line...

I was hoping to implement an automatic alignment procedure, which can process a code as a whole and insert alignment according to the specification of the CPU (cache line width, 32/64 bits and so on)...

Can someone give some hints about this procedure? As an example the target CPU could be Intel Core i7 CPU 64-bit platform.

Thank you.

Sergei
  • 71
  • 4

4 Answers4

3

I'm not qualified to answer your question because this is such a vast and complicated topic. There are probably many more mechanisms in play here, other than cache line size.

However, I would like to point you to Agner Fog's site and the optimization manuals for compiler makers that you can find there. They contain a plethora of information on these kind of subjects - cache lines, branch prediction and data/code alignment.

Martin
  • 37,119
  • 15
  • 73
  • 82
  • +1 for mentioning Agner Fog. Additionally I would like to point out, that there is no simple answer to this question - it depends on the architecture, on the code size vs. cache size, on the predictability of the branches, so there is no alignment "according to the specification". – Gunther Piez Mar 07 '11 at 12:17
  • Yes, thought about that, however looks like some effects I can exclude, for example, branch misprediction and data alignment. In my tests I have exactly the same data input and branching. The only thing I did was insertion of different alignment instructions at different places in the code. So for instance, code without any alignment works for 50 ns, then alignment on branches can shave 4 ns, so I have 46 ns then, and insertion at some specific places (just experimenting) can shave again 2 ns. And sometimes too many alignment will reduce it back to 50 or even worse... – Sergei Mar 07 '11 at 12:56
  • This block of code will be executed many times and defines overall throughput, so even reducing its execution by 5 ns will be quite measurable at the end. Thank you for pointing to Agner, I know about this stuff already, but never found some definitive suggestions about correct code alignment. – Sergei Mar 07 '11 at 13:00
2

Paragraph (16-byte) alignment is usually the best. However, it can force some "local" JMP instructions to no longer be local (due to code size bloat). May also result in not as much code being cached. I would only align major segments of code, I would not align every tiny subroutine/JMP section.

Brian Knoblauch
  • 20,639
  • 15
  • 57
  • 92
1

Not an expert, however... Branches to places that are not going to be in the instruction cache should benefit from alignment the most because you'll read whole cache-line of instructions to fill the pipeline. Given that statement, forward branches will benefit on the first run of a function. Backward branches ("for" and "while" loops for example) will probably not benefit because the branch target and following instructions have been read into cache already. Do follow the links in Martins answer.

phkahler
  • 5,687
  • 1
  • 23
  • 31
1

As mentioned previously this is a very complex area. Agner Fog seems like a good place to visit. As to the complexities I ran across the article here Torbjörn Granlund on "Improved Division by Invariant Integers" and in the code he uses to illustrate his new algorithm the first instruction at - I guess - the main label is nop - no operation. According to the commentary it improves performance significantly. Go figure.

Olof Forshell
  • 3,169
  • 22
  • 28
  • Thanks for the resource - quite interesting articles there. Looks like it is just an aligning of a branch destination. – Sergei Mar 09 '11 at 09:01
  • Unfortunately, Mr Granlund has never revealed his methods of squeezing the most out of a processor - at least that I know of. His software GMPLIB might be of interest to you, if only to study his instruction usage. It's an arbitrary precision math library and has been used to calculate hundreds of millions of decimals of pi, among other things (gmplib.org). His PDF summary on instruction latencies per processor family is also very informative. – Olof Forshell Mar 09 '11 at 11:23