I was refactoring some performance-critical code and when I didn't didn't get the results I was hoping for, examined the ASM code (that I keep from each build) and noticed that the access to structures in C results in 3 expensive multiplications (which effectively rendered other optimizations in code useless, as MUL is very slow - ~70 cycles).
To get a better look at how exactly is the vbcc compiler handling structures, I created this simple test method:
void TestMethod (struct SCube * C, struct SVis * V)
{
int i;
char idx;
struct SWall * R;
for (i = 0; i < V->NumLeft; i++)
{
idx = V->Left [i];
R = &C [idx].Left;
R->IsVisible = 1;
}
}
Here are the data structures (Note that the sizes in comments are just non-aligned estimates to get a rough idea how big each struct is - e.g. I am aware of 68000's alignment rules):
struct SWall /* (5+4*4+2*1,292) = 2,605 Bytes */
{
char PotentiallyVisible;
char IsVisible;
char IsVertical;
char MaterialID;
char LightIdx;
struct SPoint Point [4];
struct SBresenham L1, L2;
};
struct SCube /* 6*2,605 + 1 + 8 = 15,639 Bytes */
{
struct SWall Front, Back, Left, Right, Top, Bottom;
bool IsPartialBack;
short int Imgxp, Imgyp, Imgxl, Imgyl;
};
struct SVis
{
int NumLeft, NumRight, NumTop, NumBottom;
char Left [8], Right [9], Top [8], Bottom [8];
};
And here is the resulting ASM code (produced at -O2 optimization level; haven't checked O3 yet, but the performance difference at O3 is negligible (~2.5%) and takes 10xtimes longer to compile and introduces other issues). I added some comments to make it more readable:
public _SScreen_vis1
cnop 0,4
_SScreen_vis1
movem.l l7543,-(a7)
move.l (4+l7545,a7),a3
move.l (8+l7545,a7),a2 a2 = V->NumLeft
moveq #0,d1 d1 = i
tst.l (a2) Early Out (bypass the loop)
ble l7542
lea (16,a2),a1
l7541 Loop Start
move.b (a1)+,d0 d0 = idx = V->Left [i]
ext.w d0
ext.l d0
move.l #15648,d2 d2 = #15648 (sizeof SCube = 15,648)
move.l d0,d3 d3 = d0 = idx
move.l d2,d4 d4 = #15648
swap d3
swap d4
mulu.w d2,d3 d3 = (d3 * d2) = idx * #15648
mulu.w d0,d4 d4 = (d4 * d0) = #15648 * idx
mulu.w d2,d0 d0 = (d0 * d2) = idx * #15648
add.w d4,d3 d3 = (d3 + d4) = (idx * #15648) + (#15648 * idx)
swap d3
clr.w d3
add.l d3,d0
lea (a3,d0.l),a0 a0 = R
move.b #1,(5213,a0) R->IsVisible = 1
addq.l #1,d1 i++
cmp.l (a2),d1
blt l7541 Loop End
l7542
l7543 reg a2/a3/d2/d3/d4
movem.l (a7)+,a2/a3/d2/d3/d4
l7545 equ 20
rts
I went over the whole ASM listing and at every single place that I use 1 structure, there is that ~11-op combo with 3 MULs. I can understand 1 MUL, but not 3.
What options do I have to speed up the access to the structs ? I can think of these:
- Switch compiler to gcc (will happen eventually, just not now)
- Pointer arithmetic - I am hoping the compiler would use ADD instruction when doing something like ptr++
- Use SoA instead of AoS (e.g. structure of arrays instead of arrays of structures) - the idea here being that the pointers would be pointing to intrinsic types (char, short int, int, bool), so compiler should avoid MUL instruction.
Are there any other proven approaches I could try (save for coding it in straight ASM) to speed up the access to arrays of structures ?