22

At the bottom of Demystifying the Restrict Keyword is this curious advice:

Due to the order in which scheduling is done in GCC, it is always better to simplify expressions. Do not mix memory access with calculations. The code can be re-written as follows:

then there is an example which is essentially transforming this

velocity_x[i] += acceleration_x[i] * time_step;

into this

const float ax  = acceleration_x[i];       // Then the same follows for y, z
const float vx  = velocity_x[i];           // etc for y, z
const float nvx = vx + ( ax * time_step ); // etc
velocity_x[i]   = nvx;                     // ...

Really? I would have thought this sort of transformation was trivial compared to other stuff optimising compilers have to do, such as lambda arguments to std::foreach and so on.

Is this just stale, silly advice? Or is there a good reason why GCC can't or won't do this? (It makes me worry about writing the above as velocity += acceleration * time_step using my Vector3f class!

spraff
  • 32,570
  • 22
  • 121
  • 229
  • 5
    You can easily try this yourself by inspecting the generated assembler code. This you can do almost interactively using a tool like the gcc explorer: http://gcc.godbolt.org/ – PlasmaHH Feb 26 '13 at 15:46
  • Depends. But the former is more readable to me. –  Feb 26 '13 at 15:47
  • 13
    @PlasmaHH sure I can dig through the assembly for a **single example**, but I would like to understand the guiding principles. – spraff Feb 26 '13 at 15:51
  • @spraff: Then read the gcc source code. They don't have a principle that says "oh we make optimizations worse for when people intermix memory access and calculations". It just happens and depends on so many parameters that all you can do is to actually check for the examples that you are interested in. – PlasmaHH Feb 26 '13 at 15:53
  • 8
    @PlasmaHH Sorry but that’s shitty advice. Might as well close this site. Questions redundant. Read the source. – Konrad Rudolph Feb 26 '13 at 15:54
  • 3
    @KonradRudolph: Prove me wrong and show the documentation of gcc where they document their possible optimizations and general design guidelines in this regard. – PlasmaHH Feb 26 '13 at 15:56
  • 5
    I think it's just stale advice. The compiler version he's using is very old, and he's talking about micro-optimizations. I don't have an assembly listing to prove it, but I'd bet the compiler backend has improved considerably since the article was written, and that GCC emits more optimal code in the straightforwardly-written case without worrying about this stuff. – GManNickG Feb 26 '13 at 16:03
  • Are we using a Cell processor and a compiler from 2006? If so, that might be the problem and not the way we write the code. – Bo Persson Feb 26 '13 at 17:14
  • That assertion assumes that a target processor can't pipeline memory access (quite probably processor cached memory) and simple math. That's quite an assumption isn't it? – mark Feb 26 '13 at 17:56
  • 1
    @mark, _and_ the compiler doesn't know about that, or generates naïve code. – vonbrand Feb 26 '13 at 18:54
  • @GManNickG, I'd bet that even back then (when gcc was quite bad) this already was ludicrous. I still remember poring over gcc output in 1985 (!) and being surprised at the transformations done (no, not an expert, but still). And I'd go with 0A0D, readablity trumps all unless proven otherwise. – vonbrand Feb 26 '13 at 18:58

2 Answers2

15

Edit: (I'm removing details about restrict because it deviates from the actual question being asked and is causing confusion. The OP is assuming restict is used.)

The transformation in your question is indeed trivial for an optimizing compiler, but that is not what Acton's paper is suggesting.

Here is the transformation done in the paper:

This code...

  for (size_t i=0;i<count*stride;i+=stride)
  {
    velocity_x[i] += acceleration_x[i] * time_step;
    velocity_y[i] += acceleration_y[i] * time_step;
    velocity_z[i] += acceleration_z[i] * time_step;
    position_x[i] += velocity_x[i]     * time_step;
    position_y[i] += velocity_y[i]     * time_step;
    position_z[i] += velocity_z[i]     * time_step;
  }

... was transformed into this code:

  for (size_t i=0;i<count*stride;i+=stride)
  {
    const float ax  = acceleration_x[i];
    const float ay  = acceleration_y[i];
    const float az  = acceleration_z[i];
    const float vx  = velocity_x[i];
    const float vy  = velocity_y[i];
    const float vz  = velocity_z[i];
    const float px  = position_x[i];
    const float py  = position_y[i];
    const float pz  = position_z[i];

    const float nvx = vx + ( ax * time_step );
    const float nvy = vy + ( ay * time_step );
    const float nvz = vz + ( az * time_step );
    const float npx = px + ( vx * time_step );
    const float npy = py + ( vy * time_step );
    const float npz = pz + ( vz * time_step );

    velocity_x[i]   = nvx;
    velocity_y[i]   = nvy;
    velocity_z[i]   = nvz;
    position_x[i]   = npx;
    position_y[i]   = npy;
    position_z[i]   = npz;
  }

What is the optimization?

The optimization is not - as suggested - the separation of 1 expression into 3 expressions.

The optimization is the insertion of useful instructions between the instructions that operate on any particular piece of data.

If you follow the data moving from velocity_x[i] to vx to nvx back to velocity_x[i], the CPU is doing other work between each of those steps.

Why is this an optimization?

Modern CPUs typically have a pipelined architecture.

Since instructions are executed in phases, the CPU allows multiple instructions to be processed at the same time. However, when an instruction requires the result of another instruction that hasn't been fully executed, this pipeline is stalled. No further instructions are executed until the stalled instruction can run.

Why isn't my optimizing compiler doing this automatically?

Some do.

GCC stands out as being relatively poor with this optimization.

I disassembled both loops above using gcc 4.7 (x86-64 architecture, optimization at -O3). Similar assembly was produced, but the order of the instructions was different and the first version produced significant stalls where a single float would be loaded, changed, and stored within the span of a few instructions.

You can read a little about gcc's instruction scheduling here, or just search the web for gcc instruction scheduling to see a lot of frustrated articles about this issue.

Drew Dormann
  • 59,987
  • 13
  • 123
  • 180
  • 7
    The question is not "why does this transformation work?", it is "why doesn't this transformation happen in the compiler?" It looks simple enough, and assuming `velocity_x` etc have `restrict` applied it looks like exactly what the optimiser should be doing anyway. I don't see how manually separating the steps is adding any kind of hint. – spraff Feb 26 '13 at 16:10
  • 1
    That article is from 2006. That is eons in GCC evloution. Is that still true? – wilx Feb 26 '13 at 16:10
  • 1
    I suppose the real point is that **if I have to write the full `const float...` version myself, *why bother with `restrict` at all**? – spraff Feb 26 '13 at 16:11
  • @spraff: I personally would argue that this happens in the compiler, and would like to see anything that back ups the claim given in this answer. – PlasmaHH Feb 26 '13 at 16:31
  • 3
    @LieRyan, not at all - aliasing is a well known problem for C and C++ optimizers and has been for a long time. It's not something that can be fixed by better optimizers, it's inherent to the code structure. – Mark Ransom Feb 26 '13 at 16:59
  • 1
    @MarkRansom: maybe it is, but it's still groundless to make any performance claim without an actual benchmark. – Lie Ryan Feb 26 '13 at 17:02
  • 2
    Hang on, isn't the WHOLE POINT of the `restrict` keyword that I don't have to do this separation manually? – spraff Feb 26 '13 at 18:11
  • I agree with spraff here: "it would break the code when position_x and velocity_x point to the same array" - if we apply `restrict` we tell the compiler explicitly that this can't happen, so where's the problem? – Voo Feb 27 '13 at 06:19
  • The post http://blog.stuffedcow.net/2012/07/compiling-a-contrived-chunk-of-code/ seems to say that gcc 4.7 is reasonable for instruction scheduling, on one test. – Joseph Quinsey Feb 25 '14 at 18:26
1

In my opinion, stale/silly advice. I mean that level of detail is specific to the compiler, compiler version, processor, processor version, etc. I'd stick with readability and let the compiler do its job. If someone is that worried about a possible clock cycle or two in a certain target, write some #def assembly for that target and leave the higher-level code there for other targets and reference.

mark
  • 5,269
  • 2
  • 21
  • 34