I wrote this very naive NEON implementation to convert from RGBA to RGB. It works but I was wondering if there was anything else I could do to further improve performances.
I tried playing around with the prefetching size and unrolling the loop a bit more but performances didn't change much. By the way, are there any rule of thumbs when it comes to dimension the prefetching? I couldn't find anything useful on the net. Furthermore in the "ARMv8 Instruction Set Overview" I see there's also a prefetch for store, how is that useful?
Currently I'm getting around 1.7ms to convert a 1280x720 image on an iPhone5s.
// unsigned int * rgba2rgb_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
_rgba2rgb_neon:
cmp w2, #0x7
b.gt loop
mov w0, #0
ret
loop:
prfm pldl1strm, [w1, #64]
ld4.8b {v0, v1, v2, v3}, [w1], #32
ld4.8b {v4, v5, v6, v7}, [w1], #32
prfm pldl1strm, [w1, #64]
st3.8b {v0, v1, v2}, [w0], #24
st3.8b {v4, v5, v6}, [w0], #24
subs w2, w2, #16
b.gt loop
done:
ret