Why is dbra so fast for a very large loop count in Motorola 68k?

Question

I'm learning Motorola 68k assembly, and I wrote the following time wasting loop:

    move.l #0x0fffffff,%d0
    bsr timewaster
    rts

timewaster:
    dbra %d0,timewaster
    rts

This time wasting loop finishes almost immediately. I stepped through the code in a debugger to make sure that it actually subtracts d0 down to 0 (which it does). However, this other time wasting loop takes forever to finish:

    move.l #0x0fffffff,%d0
    bsr timewaster
    rts

timewaster:
    sub.l #1,%d0
    bne timewaster
    rts

So why is the code using dbra so much faster?

I ran these in a TI-89 simulator.

AFAIK, Easy68k is not a cycle-accurate simulator; the speed it runs things is not expected to be proportional to speed on a *Motorola* 68k like your question title is talking about timing. I don't know specifically, but Easy68k is open source so if you don't get an answer, you could go digging yourself to find out why simulating sub.l and bne is slower. Maybe flag-setting is implemented less efficiently? — Peter Cordes, Jun 25 '20 at 06:41
@PeterCordes The `DBcc` operates as word-sized, while the `SUB.L` used is long-sized. — Thomas Jager, Jun 25 '20 at 12:17
Side note: as of the MC68010, the `dbcc` instruction was implemented using a two-word instruction cache, so that instruction fetch would be avoided in 2-instruction loops, such as for memory copy: `loop: mov.l (a0)+,(a1)+; dbra d0,loop` — Erik Eidt, Jun 25 '20 at 13:58
Update, maybe Easy68k is cycle-accurate. e.g. [this changelog](http://www.easy68k.com/EASy68Kforum/viewtopic.php?t=2) mentions bugfixes for cycle counts. Some games on 68k platforms like Atari ST did depend on cycle counts for timing and correctness. Although since @ThomasJager noticed that the loops are using different counts, that can explain the huger factor either way. Still, quantitative timing would be nice, not just "almost immediate" vs. "forever". — Peter Cordes, Jun 25 '20 at 19:10
I only put Easy68k because the assembly tag recommended me to add a processor, and I couldn't find a Motorola 68k tag. I ran the code in a TI-89 emulator. — Jason, Jun 25 '20 at 19:56
Ok, I fixed your question for you again by removing the easy68k tag. I assume the simulator you did use is supposed to be cycle-accurate? You could at least edit your question to link the simulator you used. (Not really important now that Thomas noticed that your loops have different trip counts, but without already knowing the answer that would have been better.) — Peter Cordes, Jun 25 '20 at 21:17
Another pitfall is writing `sub.l #1,dn` instead of `subq.l #1,dn`. Former is 3 words long, while latter is just 1 word. Alternatively, write `subq.w #1,dn` for a word-sized subtraction. — lvd, Jun 29 '20 at 08:36
`DBcc` is hardcoded to operate at word length, so having `#$0FFFFFFF` as your loop count is the same as having `#$0000FFFF`. The `DBcc` instruction only modifies the "low half" (the rightmost 4 hex digits) and only looks at those 4 digits to decide whether to loop or exit. — puppydrum64, Jun 29 '23 at 12:47

Thomas Jager · Accepted Answer · 2020-06-25T14:10:53.937

While there would be some improvement due to less fetches on a real processor, the reason that there is such a big difference in timing is the the two methods use different sizes.

From the Programmer's Reference Manual, on the page for DBcc:

If the termination condition is not true, the low-order 16 bits of the counter data register decrement by one. If the result is -1, execution continues with the next instruction. If the result is not equal to -1,execution continues at the location indicated by the current value of the program counter plus the sign-extended 16-bit displacement.

So, the DBcc instruction only manipulates and checks the lower word of the loop count register. The SUB and Bcc version will therefore take ~4000 times longer than the DBcc one. If you use SUB.W instead of SUB.L I'd expect that you get more similar run times.

The DBcc instruction will execute 0x10000 times while the BNE instruction will execute 0xFFFFFFF times.

Note that the higher-order word of the loop counter if not affected by DBcc, so your loop should exit with 0x0FFFFFFF in D0. The SUB.L/BNE version should exit with 0 in D0.

This isn't particularly related to the question, but reading through the manuals, there seems to be a slight disagreement in some places on the exact behaviour of the DBcc instruction. Specifically, the behaviour when the loop counter is 0 when the condition is true. Both result in the branch not being taken, but they disagree on the final result in the loop count register.

The Programmer's Reference Manual, Revision 1 (M68000PM/AD, REV. 1) indicates that the condition being true takes precedence, and the decrement value of the loop counter is not stored back, leaving 0 in the register. The following is from the manual:

If Condition False
    Then (Dn - 1 -> Dn; If Dn != -1 Then PC + d_n -> PC)

The M68000 Microprocessors User’s Manual, Ninth Edition (MC68000UM), Appendix A (MC68010 Loop Mode Operation), says that the subtraction-by-one result takes precedence, and the result being -1 causes the result to be stored back, leaving -1 in the register. The following is constructed from description in the manual:

If Dn - 1 == -1
    Then Dn - 1 -> Dn
Else
    If Condition False
        Then (Dn - 1 -> Dn; PC + d_n -> PC)

Normally, an exit due to the count would leave -1, while a condition exit would leave a different value (assuming that the counter didn't start at 0xFFFF). The two sources disagree on the value in the register when both are true.

I'd assume that the PRM is correct, being the authoritative source for the behaviour, and since it matches the description earlier in the UM, but the UM might be hinting at how the instruction is implemented, at least on the MC68010.

Why is dbra so fast for a very large loop count in Motorola 68k?

1 Answers1