0

I wrote some neon code in assembly and was aiming for maximum optimization. Though the numbers seem satisfactory, I was interested in understanding the possibilities of optimizing it further. Then I came across an online tool which helps in counting the cycles of each instruction.

Here goes the link to my code: http://pulsar.webshaker.net/ccc/sample-115d4c29

It clearly marked the areas of my concern, but I could not clearly understand the reason for those statements to contain the overheads.

The code segment is divided into 7 sections in the 'comment' area to make it easier for referring.

Thanks in advance. :)

Anoop K. Prabhu
  • 5,417
  • 2
  • 26
  • 43
  • 2
    http://codereview.stackexchange.com/ ? – Jens Björnhager Dec 16 '11 at 14:03
  • It seems fairly self-explanatory if you read the key at the bottom - is it because it's in French that you are having difficulty understanding it, or is it that you're not familiar with the technical terms relating to pipelines, stalls, etc ? – Paul R Dec 16 '11 at 14:50
  • its the problem related with technical terms: 'n.7-0 2c neon-a','n.43-0 2c n0 d16:7' what is neon-a in the first case, how can it take 7 half cycles in the latter case etc. Then what are those 'red' and 'yellow' referring to. The one thing i noted in the site is that its not that accurate. I could have better performance profiling with my optimized code, which they say have many cycle overheads than the one in the sample.! But still its worth a tool.! – Anoop K. Prabhu Dec 16 '11 at 15:11
  • @JensBjörnhager: yea. i was about to post there, but felt this is a much better place as no topics or users related to arm, cortex, neon etc were found in codereview :) – Anoop K. Prabhu Dec 16 '11 at 15:13

1 Answers1

4

you can try this link

http://pulsar.webshaker.net/ccc/beta-sample-115d4c29

this use the beta version 0.9 of the cycle counter. The main difference is that NEON simulator do not use 2 distincts pipelines anymore. Due to Cortex A9 that can't execute 2 NEON instructions in one cycle.

I Started to udpate some part of the cycle counter.

The result Is:

-The cycle information are more accurate for Cortex A9.

-The result is easier to read because most of NEON latency information are due to unpaired instructions.

Orange color mean latency due to waiting for pipeline

Red color mean latency due to register conflict.

The number spécified near the register is not the number of loosed cycles. This is the max number of instructions you could place before this instruction.

I hope that help !

webshaker
  • 467
  • 6
  • 17
  • oh.. i ddnt know pipelining is not supported anymore. My beagleboard still supports pipelining and seems effecient when used wisely. The red color appeared in some situations where I couldn't notice any register conflict and kept me confusing. Now the beta version is much more clear. "n.48-0 2c d16:3" does it mean 3 half cycles or complete cycles? I had noticed that 'vext' was able to perform parallel with 'vld'. The old version of your calculator didn't seem to accept that. :) The calculator you provide was very much useful to me to have a good idea about neon. Thanks a ton.!! – Anoop K. Prabhu Dec 19 '11 at 09:17
  • one more doubt: "neon-a", what does that stand for. – Anoop K. Prabhu Dec 19 '11 at 09:25
  • 1
    For d16:3 No there is no half cycle anymore in the new version. 3 is the number of intruction you can put before the VLD. On Cortex A9, the best NEON performance you can have is 1 instruction / cycle. So 3 is 3 instructions or 3 cycles (as you want). – webshaker Dec 19 '11 at 09:32
  • neon-alu (neon-a) is a bug that mean that the alu unit of neon is in use. In this case the simulator should write 'n0' in orange color and not red. Because it mean that NEON have to wait for the pipeline 0 due to unability to pair the instruction with the previous One. I have some bug to patch into the simulator, but I have no time for the moment. – webshaker Dec 19 '11 at 09:34
  • Last comment. Pipelining is still supported into Cortex A9. What is not Supported anymore is dual instruction capabilities (On NEON unit only). this is not the same thing !!! – webshaker Dec 19 '11 at 09:38
  • "vld" and "vadd" could take the same cycle. Isn't this what both dual instruction and pipelining refers to.? – Anoop K. Prabhu Dec 19 '11 at 10:42
  • My last doubt: There seems 3 register conflicts and one pipeline latency in the beta calculator. Is there a chance for the 'yellow color' to occur in that? why do there be a latency of 3 instructions in the last line of my code? A single instruction will do right? – Anoop K. Prabhu Dec 19 '11 at 10:49
  • On the A8 you can START a VLD and a VADD in the same cycle. After that both instruction are executes in parallel (even if they need many cycles). On A9 you can START only 1 instruction by cycle, but they still can execute in parallel if needed. Dual instruction mean START 2 instructions in the same cycle. Pipelining mean, starting a new instruction while other (1, 2, 3, ... much more) instructions are working. It's like a queue, you can have one people entering in the queue every second but having 10 people in the queue !!! – webshaker Dec 19 '11 at 10:55
  • you have 3 cycles latency because, a VCGE can be executed every cycle, but the result will be produced 4 cycles later.So the VLD must wait for q8 to be computed before to be able to use it. – webshaker Dec 19 '11 at 10:58
  • read this post http://forums.arm.com/index.php?/topic/14646-does-cortex-a8-out-of-order/ that could help you – webshaker Dec 19 '11 at 11:00
  • Thanks for your time. Your replies were extremely useful to me. :) – Anoop K. Prabhu Dec 19 '11 at 15:25
  • The link in the comment does not work anymore. Use this instead: http://community.arm.com/message/7932#7932 – Anoop K. Prabhu Dec 12 '13 at 07:17