accelerate rgb planar to rgba interleaved conversion using sse or mmx

Question

I have to pass medical image data retrieved from one proprietary device SDK to an image processing function in another - also proprietary - device SDK from a second vendor.

The first function gives me the image in a planar rgb format:

int mrcpgk_retrieve_frame(uint16_t *r, uint16_t *g, uint16_t *b, int w, int h);

The reason for uint16_t is that the device can be switched to output each color value encoded as 16-bit floating point values. However, I'm operating in "byte mode" and thus the upper 8 bits of each color value are always zero.

The second function from another device SDK is defined like this:

BOOL process_cpgk_image(const PBYTE rgba, DWORD width, DWORD height);

So we get filled three buffers with the following bits: (16bit planar rgb)

R: 0000000 rrrrrrrr  00000000 rrrrrrrr ...
G: 0000000 gggggggg  00000000 gggggggg ...
B: 0000000 bbbbbbbb  00000000 bbbbbbbb ...

And the desired output illustrated in bits is:

RGBA: rrrrrrrrggggggggbbbbbbbb00000000 rrrrrrrrggggggggbbbbbbbb00000000 ....

We don't have access to the source code of these functions and cannot change the environment. Currently we have implemented the following basic "bridge" to connect the two devices:

void process_frames(int width, int height)
{
    uint16_t *r = (uint16_t*)malloc(width*height*sizeof(uint16_t));
    uint16_t *g = (uint16_t*)malloc(width*height*sizeof(uint16_t));
    uint16_t *b = (uint16_t*)malloc(width*height*sizeof(uint16_t));
    uint8_t *rgba = (uint8_t*)malloc(width*height*4);
    int i;

    memset(rgba, 0, width*height*4);

    while ( mrcpgk_retrieve_frame(r, g, b, width, height) != 0 )
    {
        for (i=0; i<width*height; i++)
        {
            rgba[4*i+0] = (uint8_t)r[i];
            rgba[4*i+1] = (uint8_t)g[i];
            rgba[4*i+2] = (uint8_t)b[i];
        }

        process_cpgk_image(rgba, width, height);
    }
    free(r);
    free(g);
    free(b);
    free(rgba);
}

This code works perfectly fine but processing takes very long for many thousands of high resolution images. The two functions for processing and retrieving are very fast and our bridge is currently the bottleneck.

I know how to do basic arithmetic, logical and shifting operations with SSE2 intrinsics but I wonder if and how this 16bit planar rgb to packed rgba conversion can be accelerated with MMX, SSE2 or [S]SSE3?

(SSE2 would be preferable because there are still some pre-2005 appliances in use).

Are you really limited to SSE2 ? If you can use SSSE3 (which has been standard on Intel CPUs for at least 7 years now) then the byte shuffling is going to be a lot easier. — Paul R, Oct 21 '13 at 17:54
Not necessarily. Edited the question to also accept [S]SSE3. However SSE2 would be preferred. — Veterinarian, Oct 21 '13 at 19:54

Paul R · Accepted Answer · 2013-10-21T21:06:05.497

3

Here is a simple SSE2 implementation:

#include <emmintrin.h>            // SSE2 intrinsics

assert((width*height)%8 == 0);    // NB: total pixels must be multiple of 8

for (i=0; i<width*height; i+=8)
{
    __m128i vr = _mm_load_si128((__m128i *)&r[i]);    // load 8 pixels from r[i]
    __m128i vg = _mm_load_si128((__m128i *)&g[i]);    // load 8 pixels from g[i]
    __m128i vb = _mm_load_si128((__m128i *)&b[i]);    // load 8 pixels from b[i]
    __m128i vrg = _mm_or_si128(vr, _mm_slli_epi16(vg, 8));
                                                      // merge r/g
    __m128i vrgba = _mm_unpacklo_epi16(vrg, vb);      // permute first 4 pixels
    _mm_store_si128((__m128i *)&rgba[4*i], vrgba);    // store first 4 pixels to rgba[4*i]
    vrgba = _mm_unpackhi_epi16(vrg, vb);              // permute second 4 pixels
    _mm_store_si128((__m128i *)&rgba[4*i+16], vrgba); // store second 4 pixels to rgba[4*i+16]
}

edited Oct 21 '13 at 21:06

answered Oct 21 '13 at 17:59

Paul R

208,748
37
389
560

That's straight forward. As you wrote that permutation step is probably tricky and that is where I have failed with my limited simd experience. SSE2/MMX would be preferred if it is really doable (and if it makes any sense of course in terms of amount of cycles) as we still have some pre-2005 appliances in use. However, any speedup using a SSSE3 solution is very welcome. – Veterinarian Oct 21 '13 at 20:02
1

OK - SSE2 was actually easier than I expected - the above code is tested and seems to work OK. – Paul R Oct 21 '13 at 21:00
Very nice indeed! Quick test program confirmed that it works. Will test performance improvement when I have access to the equipment. – Veterinarian Oct 21 '13 at 22:06
I did some reading in order to fully understand your solution and came across possible issues with caching. Is there a specific reason why you used _mm_store_si128 and not _mm_stream_si128 ? And what about using _mm_prefetch on the r, g and b buffers ? – Veterinarian Oct 21 '13 at 22:18
I doubt that `_mm_prefetch` will make any difference. `_mm_stream_si128` *might* help - go ahead and experiment with it, but since you have a very simple access pattern I would think it will have no effect, at least on modern CPUs. Go ahead and experiment though, but note that the results may be CPU-dependent. – Paul R Oct 22 '13 at 05:42

score 2 · Answer 2 · answered Nov 06 '13 at 07:28

Reference implementation with using of AVX2 instructions:

#include <immintrin.h>            // AVX2 intrinsics

assert((width*height)%16 == 0);    // total pixels count must be multiple of 16
assert(r%32 == 0 && g%32 == 0 && b%32 == 0 && rgba% == 0); // all pointers must to have 32-byte alignment

for (i=0; i<width*height; i+=16)
{
    __m256i vr = _mm256_permute4x64_epi64(_mm265_load_si256((__m256i *)(r + i)), 0xD8);    // load 16 pixels from r[i]
    __m256i vg = _mm256_permute4x64_epi64(_mm265_load_si256((__m256i *)(g + i)), 0xD8);    // load 16 pixels from g[i]
    __m256i vb = _mm256_permute4x64_epi64(_mm265_load_si256((__m256i *)(b + i)), 0xD8);    // load 16 pixels from b[i]
    __m256i vrg = _mm256_or_si256(vr, _mm256_slli_si256(vg, 1));// merge r/g
    __m256i vrgba = _mm256_unpacklo_epi16(vrg, vb);      // permute first 8 pixels
    _mm256_store_si256((__m256i *)(rgba + 4*i), vrgba);    // store first 8 pixels to rgba[4*i]
    vrgba = _mm256_unpackhi_epi16(vrg, vb);              // permute second 8 pixels
    _mm256_store_si256((__m256i *)(rgba + 4*i+32), vrgba); // store second 8 pixels to rgba[4*i + 32]
}

accelerate rgb planar to rgba interleaved conversion using sse or mmx

2 Answers2