4

In the ARM NEON documentation, it says:

[...] some pairs of instructions might have to wait until the value is written back to the register file.

I haven't come across a list that defines the instruction pairs that can use forwarded results and the instruction pairs that have to wait for write back.

Does anyone know of a table or documentation that lists these pairs?

Anthony Blake
  • 5,328
  • 2
  • 25
  • 24

3 Answers3

1

Integer multiply accumulates.

The section at the end of http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/ch16s06s03.html is helpful:

If a multiply-accumulate follows a multiply or another multiply-accumulate, and depends on the result of that first instruction, then if the dependency between both instructions are of the same type and size, the processor uses a special multiplier accumulator forwarding. This special forwarding means the multiply instructions can issue back-to-back because the result of the first instruction in N5 is forwarded to the accumulator of the second instruction in N4. If the size and type of the instructions do not match, then Dd or Qd is required in N3. This applies to combinations of the multiply-accumulate instructions VMLA, VMLS, VQDMLA, and VQDMLS, and the multiply instructions VMUL and VQDMUL

Don't assume that floating point multiply accumulates work in the same way. I haven't used floating point NEON instructions for anything performance critical so I can't offer any experience here, but make sure you read and understand the note at the end of http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/BCGDCECC.html

robbie_c
  • 2,428
  • 1
  • 19
  • 28
1

Does anyone know of a table or documentation that lists these pairs?

These pairs are over 9000 and they all can't be listed.
For example:

VADD.F32 q0,q0,q1
VMUL.F32 q3,q0,q2

the first instruction writes-back the result in 4th cycle, while the second instruction requires it (q0) as a source in 2nd cycle, so as the source is not ready yet there's a stall (or pipeline "hole") between this two instructions.

To calculate this stalls you can use the following online tool:
http://pulsar.webshaker.net/ccc/result.php?lng=us

n0p
  • 713
  • 10
  • 23
  • Yes, I realize there would be a huge space of *all* pairs, but I was hoping there was a table with general pairs. E.g., an arithmetic instruction has to wait until write back of a load instruction. – Anthony Blake Dec 07 '11 at 00:01
  • 1
    I wasn't asking about the cycle timings, because they are listed in the documentation. In the Cortex-A9 documentation, some instructions *don't* have to wait until write back -- they can use a forwarded result sooner. So I'm asking which pairs can use the result after write back, and which can use the forwarded results -- both timings are given in the A9 documentation, but it is unclear which timing is used with a given pair. – Anthony Blake Dec 07 '11 at 00:03
  • 1
    I think the OP is talking about A9, not A8. The A9 documentation states both "result" and "writeback", and the difference is not exactly declared. – Jake 'Alquimista' LEE Dec 07 '11 at 08:01
1

Broadly speaking, what you would reasonably expect to forward, forwards. vmul.f32 forwards to vadd.f32 and the like.

I don't believe that the exact forwarding paths are precisely documented anywhere in the manner you're looking for. I haven't found them, anyway. If you do find them, be sure to let us know where. It is, of course, not too hard to determine for any given pair of instructions whether or not forwarding occurs, but that's not a general solution. Sorry.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269