Neon VLD consuming more cycles than what is expected?

Question

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:

http://pulsar.webshaker.net/ccc/sample-d3a7fe78

As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?

I have taken care of the following:

The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.

Can someone tell me what am I doing wrong, why this latency?

Other Details:

With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly

Does this reflect more reality? http://pulsar.webshaker.net/ccc/beta-sample-d3a7fe78 (using 'beta' simulator) — Aki Suihkonen, Feb 14 '13 at 08:21
@AkiSuihkonen, how is that possible? VPADAL and VLD should be able to run in parallel, doesn't look like from the simulator link you gave, and also why does NEON have to start so late? — nguns, Feb 14 '13 at 08:44
are you expecting loads to be completed in one cycle? Depending on distance (l1, l2, dynamic ram) it would take many cycles. I believe it would be much beneficial if you first issue loads then make the adds. — auselen, Feb 14 '13 at 09:14
@auselen, the TRM says: 'VLD1 2-reg(@128)' should take only 1 cycle. Since I'll be loading only 128-bits of data per cycle, and have specified the alignment hint, and making sure the address begins at 128-bit boundary, the loads should be completed in 1 cycle right? — nguns, Feb 14 '13 at 09:32
no way :) you are using a DDR memory right? which trm, page are you referring - may be I can find a descriptive answer there for you. — auselen, Feb 14 '13 at 09:41
If some manual says loads is 1 cycle, they probably mean it takes 1 cycle to put load into load queue. After it is moved to load queue, CPU will make its best effort to load it into your destination register. However time for 'system' to load data from where ever it is (cache, ram, even disk :) ) it would take considerable amount of cycles. — auselen, Feb 14 '13 at 09:50
@auselen page: 16-28 of [DDI0344K_cortex_a8_r3p2_trm](http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/DDI0344K_cortex_a8_r3p2_trm.pdf) — nguns, Feb 14 '13 at 09:51
Also as you can see in the same table on the same page he says, the result registers (Dn and Dn+1) should be available in stage N1, which means in the same cycle right? — nguns, Feb 14 '13 at 10:04
sorry, couldn't find some easy copy paste. However answer to your question /confusion is, CPU's can not make timing guarantees on external memories. Let this be a cache (if it is not tightly integrated) or worse external memory. That's why people talk about ddr2, ddr3 etc. They have different performance characteristics. You should read about your whole system at this stage to understand how much stall you can get from l1, l2 and ram. — auselen, Feb 14 '13 at 10:07
Just to add one more thing (I'll try to create a proper answer later if someone wouldn't do that before), that timing on the TRM is for "executing" / "issuing" - so I believe they refer to putting your load request in load/store queue. — auselen, Feb 14 '13 at 10:18
Sure, I'll be glad if you could also quote text from TRM or other related documents as reference as you always do. Will be waiting for your answer. — nguns, Feb 14 '13 at 10:42
The manual indeed claims loading of qX @128 is a single cycle operation -- but it has to presuppose something -- e.g. that the address has been pre-fetched. — Aki Suihkonen, Feb 14 '13 at 11:11
Pre-fetch in cache? pipeline? PLD should be able to pre-fetch right? I did use PLD, but doesn't seem to help. — nguns, Feb 14 '13 at 11:26

score 2 · Answer 1 · answered Feb 18 '13 at 21:21

Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:

    veor.u16 q12,q12,q12     @ clear accumulated sum
top_of_loop:
    vld1.u16 {q0,q1},[r0,:128]!
    vld1.u16 {q2,q3},[r0,:128]!
    vpadal.u16 q12,q0
    vpadal.u16 q12,q1
    vpadal.u16 q12,q2
    vpadal.u16 q12,q3
    vld1.u16 {q0,q1},[r0,:128]!
    vld1.u16 {q2,q3},[r0,:128]!
    vpadal.u16 q12,q0
    vpadal.u16 q12,q1
    vpadal.u16 q12,q2
    vpadal.u16 q12,q3
    subs r1,r1,#8
    bne top_of_loop

Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.

Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).

thanks for your answer, I have verified my code on the board and the stalling seems to be happening for requests over 64bytes(one cache line in cortex-a8) i.e. after four load requests in my code, but I'm not sure about the reason yet. So the `vpadal` immediately after the `vld` load request doesn't seem to be the problem. I tried your code and boy..!! the stalls gave me a mini heart attack..!! The problem using same register to accumulate the result, lead to heavy stalling, read @auselen's answer [here](http://stackoverflow.com/questions/12932940/using-neon-multiply-accumulate-on-ios) — nguns, Feb 19 '13 at 05:58
Good to know about using the same register to accumulate the result. Interleaving of multiple registers will solve that. Another thing to note is to not use too many PLD's in a row or you will also stall the memory reads since it will be waiting to fill all of those read requests. The memory speed is very dependent on your hardware. Lately I have been working with Qualcomm MSM8074's with DDR3 memory and the memory throughput is quite impressive; instruction stalls and memory stalls are about equal on that system. — BitBank, Feb 19 '13 at 15:34

Neon VLD consuming more cycles than what is expected?

1 Answers1