I have written a neon-optimized box filter in assembler. It's running on an i.MX6 (cortex-a9). I now about the memory bandwidth problems of the machine, but this doesn't explain my observation:
My code (inline assembler)
"loopSlide: \n\t"
"vld1.16 {q0-q1}, [%[add]]! \n\t"
"vld1.16 {q2-q3}, [%[add]]! \n\t"
"vsra.u16 q6, q0, #5 \n\t"
"vsra.u16 q7, q1, #5 \n\t"
"vsra.u16 q8, q2, #5 \n\t"
"vsra.u16 q9, q3, #5 \n\t"
"vld1.16 {q0-q1}, [%[sub]]! \n\t"
"vld1.16 {q2-q3}, [%[sub]]! \n\t"
"vshr.u16 q0, q0, #5 \n\t"
"vsub.u16 q6, q6, q0 \n\t"
"vshr.u16 q1, q1, #5 \n\t"
"vsub.u16 q7, q7, q1 \n\t"
"vst1.16 {q6-q7}, [%[sub]]! \n\t"
"vshr.u16 q2, q2, #5 \n\t"
"vsub.u16 q8, q8, q2 \n\t"
"vshr.u16 q3, q3, #5 \n\t"
"vsub.u16 q9, q9, q3 \n\t"
"vst1.16 {q8-q9}, [%[sub]]! \n\t"
"add %[dst], %[dst], %[inc] \n\t"
"pldw [%[dst]] \n\t"
"add %[add], %[add], %[inc] \n\t"
"add %[sub], %[sub], %[inc] \n\t"
"cmp %[src], %[end] \n\t"
"bne loopSlide \n\t"
takes 105 ms for the whole picture, which results in 25 cpu cycles per instruction!
Removing only the vst instructions, the algorithm speeds up to 9.5 ms, which fits my expectation on the memory bandwidth.
Now I tried exchanging the input and output buffers, and it took less than 17 ms for the same amount of loads and stores! If I had expected a difference then the other way around, because the input buffer had be written to before, so it might still be in L2 cache and could be read faster, but it's 6 times faster to read from the uncached data and store to the cached ...
Both buffers have 512-bit-alignment and reside in the same memory region, with same cache policy.
Do you have any idea what could be the cause of the problem or what to try to further examine it?