0

I have written a neon-optimized box filter in assembler. It's running on an i.MX6 (cortex-a9). I now about the memory bandwidth problems of the machine, but this doesn't explain my observation:

My code (inline assembler)

    "loopSlide: \n\t"
    "vld1.16 {q0-q1}, [%[add]]! \n\t"
    "vld1.16 {q2-q3}, [%[add]]! \n\t"
    "vsra.u16 q6, q0, #5 \n\t"
    "vsra.u16 q7, q1, #5 \n\t"
    "vsra.u16 q8, q2, #5 \n\t"
    "vsra.u16 q9, q3, #5 \n\t"
    "vld1.16 {q0-q1}, [%[sub]]! \n\t"
    "vld1.16 {q2-q3}, [%[sub]]! \n\t"
    "vshr.u16 q0, q0, #5 \n\t"
    "vsub.u16 q6, q6, q0 \n\t"
    "vshr.u16 q1, q1, #5 \n\t"
    "vsub.u16 q7, q7, q1 \n\t"
    "vst1.16 {q6-q7}, [%[sub]]! \n\t"
    "vshr.u16 q2, q2, #5 \n\t"
    "vsub.u16 q8, q8, q2 \n\t"
    "vshr.u16 q3, q3, #5 \n\t"
    "vsub.u16 q9, q9, q3 \n\t"
    "vst1.16 {q8-q9}, [%[sub]]! \n\t"

    "add %[dst], %[dst], %[inc] \n\t"
    "pldw [%[dst]] \n\t"
    "add %[add], %[add], %[inc] \n\t"
    "add %[sub], %[sub], %[inc] \n\t"
    "cmp %[src], %[end] \n\t"
    "bne loopSlide \n\t"

takes 105 ms for the whole picture, which results in 25 cpu cycles per instruction!

Removing only the vst instructions, the algorithm speeds up to 9.5 ms, which fits my expectation on the memory bandwidth.

Now I tried exchanging the input and output buffers, and it took less than 17 ms for the same amount of loads and stores! If I had expected a difference then the other way around, because the input buffer had be written to before, so it might still be in L2 cache and could be read faster, but it's 6 times faster to read from the uncached data and store to the cached ...

Both buffers have 512-bit-alignment and reside in the same memory region, with same cache policy.

Do you have any idea what could be the cause of the problem or what to try to further examine it?

Philippos
  • 284
  • 2
  • 14
  • Firstly, these are not "commands", but are CPU instructions. There are many aspects to NEON optimization that you're not taking into account. As you have observed, the biggest obstacle to performance is memory latency. You need to 'hide' the memory loading/storing latency as much as possible. Things to do: use the pld instruction (properly) to preload your data in cache, unroll your loop to allow time for data to load/store, interleave instructions so that register/addresses don't depend on recent instructions. – BitBank Mar 16 '17 at 11:12
  • @BitBank Thank you! Changed "command" to "instruction"; sorry for my english! And I added a new observation: I exchanged input and output buffer completely: 17 ms!! While your hints are absolutely correct, wouldn't they only speedup the reading? But my problem is writing, not reading. How can pld speedup writing? And why does it depend on which buffer I use for reading and which for writing? – Philippos Mar 16 '17 at 11:28
  • I see there is also a `pldw`, didn't know that. I'll give it a try, although my knowledge of how caches are working doesn't say this is promising. – Philippos Mar 16 '17 at 11:46
  • I added a `pldw` for the buffer position to write in the following loop, but I didn't sense any difference. Further unrolling helps a tiny bit (around 2%). Thank you anyhow. – Philippos Mar 16 '17 at 12:18
  • Need to see your use of pld - show your updated code. You need to preload ahead of where you're reading (e.g. pld [addr].. vld1 xx,[addr] won't have any effect). – BitBank Mar 16 '17 at 12:25
  • @BitBank Edited the question to reflext where I added the prefetch for writing: It's almost the whole loop before the next write. – Philippos Mar 16 '17 at 12:30
  • @BItBank Meanwhile I've done a lot of tests with `plwd` for different cases, and in any case execution is slower than without, sometimes almost equal, sometimes awfully slower. And I think I know why. According to the documentation "_Any linefill started by a PLDW instruction causes the data to be invalidated in other processors, so that the line is ready to be written to_". So `plwd` will cause additional memory accesses to regions I don't need to read, slowing down loading of the input buffer. And as I stay on one core, The cache invalidation is useless, too. – Philippos Mar 21 '17 at 08:30

0 Answers0