0

I was refactoring some performance-critical code and when I didn't didn't get the results I was hoping for, examined the ASM code (that I keep from each build) and noticed that the access to structures in C results in 3 expensive multiplications (which effectively rendered other optimizations in code useless, as MUL is very slow - ~70 cycles).

To get a better look at how exactly is the vbcc compiler handling structures, I created this simple test method:

void TestMethod (struct SCube * C, struct SVis * V)
{
    int i;
    char idx;
    struct SWall * R;

    for (i = 0; i < V->NumLeft; i++)
    {
        idx = V->Left [i];
        R = &C [idx].Left;

        R->IsVisible = 1;
    }
}

Here are the data structures (Note that the sizes in comments are just non-aligned estimates to get a rough idea how big each struct is - e.g. I am aware of 68000's alignment rules):

struct SWall /* (5+4*4+2*1,292) = 2,605 Bytes */
{
    char PotentiallyVisible;    
    char IsVisible;             
    char IsVertical;            
    char MaterialID;            
    char LightIdx;              
    struct SPoint Point [4];    
    struct SBresenham L1, L2;
};

struct SCube /* 6*2,605 + 1 + 8 = 15,639 Bytes */
{
    struct SWall Front, Back, Left, Right, Top, Bottom;
    bool IsPartialBack;
    short int Imgxp, Imgyp, Imgxl, Imgyl;
};

struct SVis
{
    int NumLeft, NumRight, NumTop, NumBottom;
    char Left [8], Right [9], Top [8], Bottom [8];
};

And here is the resulting ASM code (produced at -O2 optimization level; haven't checked O3 yet, but the performance difference at O3 is negligible (~2.5%) and takes 10xtimes longer to compile and introduces other issues). I added some comments to make it more readable:

    public  _SScreen_vis1
    cnop    0,4
_SScreen_vis1
    movem.l l7543,-(a7)

    move.l  (4+l7545,a7),a3
    move.l  (8+l7545,a7),a2     a2 = V->NumLeft

    moveq   #0,d1               d1 = i
    tst.l   (a2)                Early Out  (bypass the loop)
    ble l7542
    lea (16,a2),a1
l7541                           Loop Start
    move.b  (a1)+,d0            d0 = idx = V->Left [i]
    ext.w   d0
    ext.l   d0

    move.l  #15648,d2           d2 = #15648 (sizeof SCube = 15,648)
    move.l  d0,d3               d3 = d0 = idx
    move.l  d2,d4               d4 = #15648
    swap    d3
    swap    d4
    mulu.w  d2,d3               d3 = (d3 * d2) = idx * #15648
    mulu.w  d0,d4               d4 = (d4 * d0) = #15648 * idx
    mulu.w  d2,d0               d0 = (d0 * d2) = idx * #15648
    add.w   d4,d3               d3 = (d3 + d4) = (idx * #15648) + (#15648 * idx)
    swap    d3
    clr.w   d3
    add.l   d3,d0
    lea (a3,d0.l),a0            a0 = R

    move.b  #1,(5213,a0)        R->IsVisible = 1

    addq.l  #1,d1               i++
    cmp.l   (a2),d1

    blt l7541                   Loop End

l7542
l7543   reg a2/a3/d2/d3/d4
    movem.l (a7)+,a2/a3/d2/d3/d4
l7545   equ 20
    rts

I went over the whole ASM listing and at every single place that I use 1 structure, there is that ~11-op combo with 3 MULs. I can understand 1 MUL, but not 3.

What options do I have to speed up the access to the structs ? I can think of these:

  1. Switch compiler to gcc (will happen eventually, just not now)
  2. Pointer arithmetic - I am hoping the compiler would use ADD instruction when doing something like ptr++
  3. Use SoA instead of AoS (e.g. structure of arrays instead of arrays of structures) - the idea here being that the pointers would be pointing to intrinsic types (char, short int, int, bool), so compiler should avoid MUL instruction.

Are there any other proven approaches I could try (save for coding it in straight ASM) to speed up the access to arrays of structures ?

3D Coder
  • 498
  • 1
  • 5
  • 10
  • 5
    Pad your structs to powers of 2 size, or some other size that's easily calculated without multiplication. Possibly use pointers instead of embedding to keep size down. – Jester Oct 11 '15 at 13:21
  • Which CPU takes 70 clocks for 16 bit mul? IIRC not the 68000. You can use pointer arthmetics for the first index. And/or use a better optimising compiler. – too honest for this site Oct 11 '15 at 13:23
  • @Olaf I got a PDF open (some Freescale Semiconductor, Inc.) in front of me that has this line in the table for the "*16-bit Instruction Execution times*": ' MULU (1/0)+ ' But, each 68000 should have identical design, right ? Shouldn't matter what device it really is, it's still 68000, no ? – 3D Coder Oct 11 '15 at 13:34
  • Please post a link. Long time I haved used that. Huh? A 68010 is no 68000. Which device do you actually use? (IIRC, you cannot even get an original 68000 anymore). – too honest for this site Oct 11 '15 at 13:35
  • Found a link (See table 8-4 at page 120): http://www.freescale.com/files/32bit/doc/ref_manual/MC68000UM.pdf – 3D Coder Oct 11 '15 at 13:38
  • 2
    Hmm, and you are really using the original 68000? Ok, these are **up to** 70 clocks. Note this is **maximimum** time (wow! ARM and others are faster even with soft-multiply - this is just 16bit*16bit!). However, best is to use a better compiler, e.g. gcc. If the compiler does not even optimise this properly, I'd suspect it won't be better elsewhere. – too honest for this site Oct 11 '15 at 13:43
  • Uhm, what exactly do you mean by 'original' ? I'm still confused if all 68000 are manufactured the same (I don't mean 68020-60, just 68000). Upon close reading of that MUL instruction it does seem its execution time depends on the total count of 1 bits in the , so it's not exactly constant 70 cycles. – 3D Coder Oct 11 '15 at 13:49
  • 2
    `But, each 68000 should have identical design, right ? Shouldn't matter what device it really is, it's still 68000, no`. No, each microarchitecture has its own features with different computing blocks and different timing. Look at x86 and you'll see that each generation will have different clocks for the same instruction – phuclv Oct 11 '15 at 16:14
  • @Olaf The original 68000 was a CISC ISA implemented in microcode, so of course its slow compared to any later technology (remember the 68000 was introduced in 1978, the first ARM was 1985?). You see a *rapid* decrease in clock cycles taken with each major generation (iirc the 68020 (1984) takes less than half, about ~30 cycles, 68060 (1992?) takes only 2 clocks for the mul). 70 cycles was pretty good at *that* time, although *now* its considered snails pace. – Durandal Oct 11 '15 at 16:31
  • 1
    @Durandal: I know the MC68000 very well. Programmed it for years in Assember myself. I was just a bit surprised it takes _that_ long. Note that the 68010, which was not that much of a new design (and CISC/Microcode, too) reduced that to ~40 clocks already. – too honest for this site Oct 11 '15 at 18:25
  • @LưuVĩnhPhúc Do you mean the design/microcode implementation differences like there were between Intel/Amd/Cyrix during 80386/80486 times ? Meaning, even though they all supported the same instruction set, their performance characteristics differed greatly (for same instructions) ? Because if that's the case here with 68000, then the PDF I got may very well be completely useless, as it's just a generic PDF with timings I took from net, and is not at all representative of the actual HW platform. – 3D Coder Oct 11 '15 at 20:50
  • My point was if you really use a 68000, or a 68010, one of the 68300 SoC, some with 68000 core, some with CPU32(+), or even ColdFire or - yes - 68020 and above. IIRC, the orliginal 68000/10 are not manufactured anymore (not sure about the 68040/60). That is a whole Zoo and - yes - at least STMicroelectronics and Hitachi were 2nd sources back then. Memories ... – too honest for this site Oct 11 '15 at 22:56
  • @3DCoder even in architectures without microcode there will be differences in performance. For example if a newer generation has a better multiplier you'll get higher mul throughput. That's why there are `--march` and similar compiling option in gcc – phuclv Oct 12 '15 at 07:04
  • @3DCoder I presume you took the clock cycles from the original *Motorola 68000 Users Manual* (now labeled Freescale). These might or might not apply exactly to 3rd party manufactured 68000's. *But* the numbers can be off even for an MC68000 depending on the system its in, if it were for example the CBM Amiga youre talking about, there are other constraints that may increase the cycle count. – Durandal Oct 12 '15 at 20:24
  • Instead of multiplication you can always use bit shifting to achieve the same results faster – Shady Programmer Oct 14 '15 at 10:55

3 Answers3

2

Considering idx has a small value range, you can use table lookup for pointer computation.

static const size_t table[] =
{
  sizeof(struct x) * 0,
  sizeof(struct x) * 1,
  ...
};

...
R = (struct x*)((char*)C + table[idx]);

Also, it is possible to use a smaller-sized table to compute right pointer. For example, let's say we have an index range of [0..255], but want to use 16-entry table:

static const size_t table[] =
{
  sizeof(struct x) * 0,
  sizeof(struct x) * 1,
  ...
  sizeof(struct x) * 15
};
...
R = (struct x*)((char*)C + (table[idx>>4] << 4) + table[idx&15]);
Valeri Atamaniouk
  • 5,125
  • 2
  • 16
  • 18
  • Wow! It's been well over a decade since I last saw a C gem like that - that I have absolutely no idea what it's supposed to do without spending at least half a minute to decipher it :-) I'm gonna try it and see what kind of code compiler generates. I'd hazard a guess this is a compile-time precalc'ed address LUT, but will examine the code further. Thanks a lot for this ! – 3D Coder Oct 12 '15 at 13:31
  • Sorry for style - I've posted it from my phone. – Valeri Atamaniouk Oct 14 '15 at 10:52
1

some performances can be gained if you store some values in variable since they never change and avoid pointers calculation. Eg:

void TestMethod (struct SCube * C, struct SVis * V)
{
    int i,NumLeft,Left;
    char idx;
    struct SWall * R;
    /***************************/
    ?? Left;
    Left=V->Left;
    NumLeft=V->NumLeft;
    /***************************/

    for (i = 0; i < NumLeft; i++)
    {
        idx = Left [i];
        R = &C [idx].Left;

        R->IsVisible = 1;
    }
}
milevyo
  • 2,165
  • 1
  • 13
  • 18
  • 2
    Just to add a halfway recent compiler should do this itself. – too honest for this site Oct 11 '15 at 13:25
  • 1
    @Olaf I know for sure that gcc catches that always (though it was for a non-68000 target). The vbcc compiler catches that about 50% of times, so I already got used to taking out all such constants outside of the loop every time (just to be on the safe side). – 3D Coder Oct 11 '15 at 13:28
  • @Olaf Since there's a memory write through a pointer in the loop (`R->IsVisible`), the optimizer can't be sure that it doesn't modify `V->Left` or `V->NumLeft` without a fairly sophisticated analysis. It's not something I would expect out of a compiler like vbcc. – Ross Ridge Oct 11 '15 at 17:06
  • @RossRidge: Not sure what you mean. `R` has just been set, so it is not constant anyway. And the type of `R->IsVisible` is not a pointer, so how should that modify `V->..`? – too honest for this site Oct 11 '15 at 18:29
  • @Olaf `R` is pointer, so writing to `R->IsVisible` writes to an address calculated using `R`. How does the compiler know that `R` and `V` don't point to overlapping objects? How does the compiler know that `R->IsVisible` and `V->Left` don't refer to the same location in memory? – Ross Ridge Oct 11 '15 at 20:15
  • there is an affectation to R (R = &C [idx].Left;) but V is constant in the loop, so we can take V out from the loop, but not R. – milevyo Oct 11 '15 at 20:20
1

The 11 instruction sequence looks like a 32bit mul, which the compiler probably uses because the structure's size-of is considered a 32bit value (since its a compile-time constant the compiler should be easily able to determine that its 16 bits, but... well).

A smarter compiler may do better (in a single mul), but since the 68000 is so ancient there are probably few choices left, first candidate to try is probably gcc; otherwise there may still be commercial compilers available (there were plenty in the 80's and 90's, I haven't worked with anything 68k related since then).

You also see that your choice of a byte sized index (idx) introduces an overhead because its need to be extended to long:

l7541                           Loop Start
    move.b  (a1)+,d0            d0 = idx = V->Left [i]
    ext.w   d0
    ext.l   d0

Thats also odd because I would think char to be unsigned, so it shouldn't be sign extended, but zero extended instead. Using byte sized items is also pretty pointless for indices in general, since the aligment restrictions will force the compiler to add padding anyway. And a word sized access is no slower than a byte sized one. Smaller type means slower here, as counterintuitive as it is.

It may be better to avoid arrays of structures altogether and instead use pointered data structures (that is, linked lists and the like).

Durandal
  • 19,919
  • 4
  • 36
  • 70
  • The sign extension is another thing that I'm pulling my hairs on, as at this particular place at least the original type is actually signed (because it's just char), but I got at least a dozen other instances where I explicitly use unsigned *everywhere* in the method, yet the code is cluttered with ext.w and ext.l. I'm most obviously missing something here - I'd hazard a guess it's that compiler feature called 'int promotion' where everything is regarded as integer, but don't know. As for the 8-bit operands, even when padded (at structures),they should take couple cycles less than 16-bit ones. – 3D Coder Oct 11 '15 at 20:45
  • @3DCoder Byte and Word accesses both take a single bus cycle (4 clocks when no wait states) on the 68000, and arithmetic runs at the same clock counts regardless if byte or word. Instructions that encode an immediate operand generally store the operand in an extension *word*, that is a byte occupies only half of it (with few exceptions, like moveq). It would be different for the 68008 because of its narrower bus and for the 68030+ because of data cache. For the 68000 byte and word are equally slow. – Durandal Oct 12 '15 at 15:51