Optimize RGBA->RGB arm64 assembly

Question

I wrote this very naive NEON implementation to convert from RGBA to RGB. It works but I was wondering if there was anything else I could do to further improve performances.

I tried playing around with the prefetching size and unrolling the loop a bit more but performances didn't change much. By the way, are there any rule of thumbs when it comes to dimension the prefetching? I couldn't find anything useful on the net. Furthermore in the "ARMv8 Instruction Set Overview" I see there's also a prefetch for store, how is that useful?

Currently I'm getting around 1.7ms to convert a 1280x720 image on an iPhone5s.

// unsigned int * rgba2rgb_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
_rgba2rgb_neon:
    cmp     w2, #0x7
    b.gt    loop

    mov     w0, #0
    ret

loop:
    prfm    pldl1strm, [w1, #64]

    ld4.8b  {v0, v1, v2, v3}, [w1], #32
    ld4.8b  {v4, v5, v6, v7}, [w1], #32

    prfm    pldl1strm, [w1, #64]

    st3.8b  {v0, v1, v2}, [w0], #24
    st3.8b  {v4, v5, v6}, [w0], #24

    subs    w2, w2, #16
    b.gt    loop

done:
    ret

Stephen Canon · Accepted Answer · 2013-12-18T16:45:48.550

6

First (since I assume you’re targeting iOS), vImage (part of the Accelerate.framework) provides this conversion for you, as vImageConvert_RGBA8888toRGB888. This has the advantage of being available on all iOS and OS X systems, so you don’t need to write separate implementations for arm64, armv7s, armv7, i386, x86_64.

Now, it may be that you’re writing this conversion as an exercise yourself, and not because you simply didn’t know that one was already available. In that case:

Avoid using ld[34] or st[34]. They are convenient but generally slower than using ld1 and a permute.
For completely regular data access patterns like this, manual prefetch isn’t necessary.
Load four 16b RGBA vectors with ld1.16b, extract three 16b RGB vectors from them with three tbl.16b instructions, and store them with st1.16b
Alternatively, try using non-temporal loads and stores (ldnp/stnp), as your image size is too large to fit in the caches.

Finally, to answer your question: a prefetch hint for stores is primarily useful because some implementations might have a significant stall for a partial line write that misses cache. Especially simple implementations might have a stall for any write that misses cache.

edited Dec 18 '13 at 16:45

answered Dec 18 '13 at 16:31

Stephen Canon

103,815
19
183
269

Testing on the iPad Mini Retina (Apple A7 processor), seems to indicate the the non-temporal hint for storing has no effect on performance. – BitBank Jan 16 '15 at 14:10
@BitBank: the conditions under which non-temporal stores benefit performance are somewhat tricky to characterize. It's important to keep in mind that one of their biggest benefits is that they avoid allocating into the inner cache, which means that their impact is sometimes only seen in the code that surrounds the loop that was modified to use them. My guidance is really "try them, measure whole program performance, and if they give an improvement, use them". – Stephen Canon Jan 16 '15 at 16:36
I came to this conclusion by testing a function which writes to an image buffer bigger than L2 cache. The data is only written and not referenced again until later. This seemed like the ideal case to try the "streaming" version of the store instruction. I need to test this on the Nvidia K1 Denver to see if the behavior is different from the Apple A7. Update soon... – BitBank Jan 16 '15 at 16:43

score 2 · Answer 2 · answered Jan 13 '14 at 22:59

2

See also vImageFlatten_RGBA8888toRGB888 if you want something interesting done with the alpha channel besides chucking it over your shoulder.

answered Jan 13 '14 at 22:59

Ian Ollmann

1,592
9
16

Optimize RGBA->RGB arm64 assembly

2 Answers2