I wrote some neon code in assembly and was aiming for maximum optimization. Though the numbers seem satisfactory, I was interested in understanding the possibilities of optimizing it further. Then I came across an online tool which helps in counting the cycles of each instruction.
Here goes the link to my code: http://pulsar.webshaker.net/ccc/sample-115d4c29
It clearly marked the areas of my concern, but I could not clearly understand the reason for those statements to contain the overheads.
The code segment is divided into 7 sections in the 'comment' area to make it easier for referring.
Thanks in advance. :)