Edit: (I'm removing details about restrict
because it deviates from the actual question being asked and is causing confusion. The OP is assuming restict
is used.)
The transformation in your question is indeed trivial for an optimizing compiler, but that is not what Acton's paper is suggesting.
Here is the transformation done in the paper:
This code...
for (size_t i=0;i<count*stride;i+=stride)
{
velocity_x[i] += acceleration_x[i] * time_step;
velocity_y[i] += acceleration_y[i] * time_step;
velocity_z[i] += acceleration_z[i] * time_step;
position_x[i] += velocity_x[i] * time_step;
position_y[i] += velocity_y[i] * time_step;
position_z[i] += velocity_z[i] * time_step;
}
... was transformed into this code:
for (size_t i=0;i<count*stride;i+=stride)
{
const float ax = acceleration_x[i];
const float ay = acceleration_y[i];
const float az = acceleration_z[i];
const float vx = velocity_x[i];
const float vy = velocity_y[i];
const float vz = velocity_z[i];
const float px = position_x[i];
const float py = position_y[i];
const float pz = position_z[i];
const float nvx = vx + ( ax * time_step );
const float nvy = vy + ( ay * time_step );
const float nvz = vz + ( az * time_step );
const float npx = px + ( vx * time_step );
const float npy = py + ( vy * time_step );
const float npz = pz + ( vz * time_step );
velocity_x[i] = nvx;
velocity_y[i] = nvy;
velocity_z[i] = nvz;
position_x[i] = npx;
position_y[i] = npy;
position_z[i] = npz;
}
What is the optimization?
The optimization is not - as suggested - the separation of 1 expression into 3 expressions.
The optimization is the insertion of useful instructions between the instructions that operate on any particular piece of data.
If you follow the data moving from velocity_x[i]
to vx
to nvx
back to velocity_x[i]
, the CPU is doing other work between each of those steps.
Why is this an optimization?
Modern CPUs typically have a pipelined architecture.
Since instructions are executed in phases, the CPU allows multiple instructions to be processed at the same time. However, when an instruction requires the result of another instruction that hasn't been fully executed, this pipeline is stalled. No further instructions are executed until the stalled instruction can run.
Why isn't my optimizing compiler doing this automatically?
Some do.
GCC stands out as being relatively poor with this optimization.
I disassembled both loops above using gcc 4.7 (x86-64 architecture, optimization at -O3). Similar assembly was produced, but the order of the instructions was different and the first version produced significant stalls where a single float would be loaded, changed, and stored within the span of a few instructions.
You can read a little about gcc's instruction scheduling here, or just search the web for gcc instruction scheduling to see a lot of frustrated articles about this issue.