De-interleave image channel in SSE 16 bit vectors

Question

byte I have 32 bpp image. I need to de interleave R G B color channels in diferent 16 bits vectors i am using following code to do that( how to deinterleave image channel in SSE)

  // deinterleave chaneel R, G, B ,A in 16 bits vectors
  {
     __m128i vrgba = _mm_loadu_si128((__m128i *)(pSrc));
     __m128i vr1 = _mm_and_si128(vrgba, _mm_set1_epi32(0xff));
     __m128i vg1 = _mm_and_si128(_mm_srli_epi32(vrgba, 8), _mm_set1_epi32(0xff));
     __m128i vb1 = _mm_and_si128(_mm_srli_epi32(vrgba, 16), _mm_set1_epi32(0xff));
     __m128i va1 = _mm_srli_epi32(vrgba, 24);

     vrgba = _mm_loadu_si128((__m128i *)(pSrc + 4));  // since pSrc is uint32_t type
     __m128i vr2 = _mm_and_si128(vrgba, _mm_set1_epi32(0xff));
     __m128i vg2 = _mm_and_si128(_mm_srli_epi32(vrgba, 8), _mm_set1_epi32(0xff));
     __m128i vb2 = _mm_and_si128(_mm_srli_epi32(vrgba, 16), _mm_set1_epi32(0xff));
     __m128i va2 = _mm_srli_epi32(vrgba, 24);

     vr = _mm_packs_epi32(vr1, vr2);
     vg = _mm_packs_epi32(vg1, vg2);
     vb = _mm_packs_epi32(vb1, vb2);
     va = _mm_packs_epi32(va1, va2);
  }

can we make this more efficient? Below is the code for Gaussian with out de-interleaving channels.I am finding it terribly inefficient

    static inline void ConvertTo16Bits(__m128i& v1, __m128i& v2, const __m128i& v0)
    {
        __m128i const zero = _mm_setzero_si128();
        v1 = _mm_unpacklo_epi8(v0, zero);
        v2 = _mm_unpackhi_epi8(v0, zero);
    }

    static inline void mul32bits(__m128i &vh, __m128i &vl,           // output - 2x4xint32_t
        const __m128i& v0, const __m128i& v1) // input  - 2x8xint16_t
    {
        const __m128i vhi = _mm_mulhi_epu16(v0, v1);
        const __m128i vlo = _mm_mullo_epi16(v0, v1);
        vh = _mm_unpacklo_epi16(vlo, vhi);
        vl = _mm_unpackhi_epi16(vlo, vhi);
    }

    struct Pixel
    {
        unsigned char r;
        unsigned char g;
        unsigned char b;
        unsigned char a;
    };

    void computePixelvalue(unsigned int * pixelArray, int count, unsigned short * gaussArray, Pixel& out)
    {
        __m128i sumRGBA;
        sumRGBA = _mm_set1_epi32(0);
        unsigned int countMod4 = count % 4;
        unsigned int b, g, r, a;
        constexpr int shuffle = _MM_SHUFFLE(3, 1, 0, 0);

        while (count >= 4)
        {
            __m128i vrgba = _mm_loadu_si128((__m128i *)(pixelArray));
            __m128i rgba12, rgba34;

            ConvertTo16Bits(rgba12, rgba34, vrgba);

            unsigned short s1 = *gaussArray++;
            unsigned short s2 = *gaussArray++;

            __m128i shift8 = _mm_set1_epi16(s1);
            __m128i shift16 = _mm_set1_epi16(s2);
            __m128i gaussVector = _mm_shuffle_epi32(_mm_unpacklo_epi32(shift8, shift16), shuffle);

            __m128i multl, multh;
            mul32bits(multl, multh, rgba12, gaussVector);
            sumRGBA = _mm_add_epi32(sumRGBA, multl);
            sumRGBA = _mm_add_epi32(sumRGBA, multh);

            s1 = *gaussArray++;
            s2 = *gaussArray++;
            shift8 = _mm_set1_epi16(s1);
            shift16 = _mm_set1_epi16(s2);
            gaussVector = _mm_shuffle_epi32(_mm_unpacklo_epi32(shift8, shift16), shuffle);

            mul32bits(multl, multh, rgba34, gaussVector);
            sumRGBA = _mm_add_epi32(sumRGBA, multl);
            sumRGBA = _mm_add_epi32(sumRGBA, multh);

            count = count - 4;
            pixelArray = pixelArray + 4;
        }

        r = sumRGBA.m128i_u32[0];
        g = sumRGBA.m128i_u32[1];
        b = sumRGBA.m128i_u32[2];
        a = sumRGBA.m128i_u32[3];

        while (countMod4)
        {
            auto pixelArrayByte = reinterpret_cast<unsigned char*>(pixelArray);

            unsigned short k = static_cast<unsigned short>(*gaussArray++);
            r += *pixelArrayByte++ * k;
            g += *pixelArrayByte++ * k;
            b += *pixelArrayByte++ * k;
            a += *pixelArrayByte++ * k;

            countMod4--;
        }

        out.r = static_cast<unsigned char>(r >> 15);
        out.g = static_cast<unsigned char>(g >> 15);
        out.b = static_cast<unsigned char>(b >> 15);
        out.a = static_cast<unsigned char>(a >> 15);
    }

If you can, store your data in planar format in the first place. — Peter Cordes, Mar 09 '16 at 17:21
From your previous questions it seems you just want to apply a Gaussian filter, so there shouldn't really be any need to separate the RGBA components out like this. You can just apply the filters to the data in its original format, no ? — Paul R, Mar 09 '16 at 17:23
separation makes accumptlation easy isn't it.. once i de interleave i can just use somehting like this Rsum = Rsum + Rvector( 16bit* gaussVector 16 bit ) and finally i can place it in destination location as Rsum[0] + Rsum[1]+ rSum[2]. Obviously other way would be to first store your data in planar format but i guess i can afford a copy here due to limited memory — Bharat Ahuja, Mar 09 '16 at 17:27
The accumulation is much the same whether you de-interleave or not - the only slight advantage with de-inerleaved data is that the horizontal shifts are smaller when you're applying the filter in the X axis. Performance-wise I suspect you'd be better off just working with interleaved data, and the code should be much simpler too. — Paul R, Mar 10 '16 at 08:47
Actually i did that initially but my de -interleaving code significantly outperforms interleaving code.. May be you want to take a look at that? — Bharat Ahuja, Mar 10 '16 at 10:56
Try using `_mm_set_epi16` instead of your horrible stuff with arrays and undefined behaviour (`*pGauss, *pGauss++, ...`. The commas aren't sequence points.) I'm not surprised at all that the second block of code you posted is slow. It's not compilable, so I can't see exactly how bad it is. Actually, you might need to load + shuffle yourself for best results, if your compiler doesn't see the pattern in the `_mm_set_epi16`. And why don't you use a vector store for the result? The little `while` loop at the end looks nasty, too. — Peter Cordes, Mar 10 '16 at 12:01
i tried using _mm_set_epi16 but i guess its still slow round about 5 ms for 300 * 300 image from de-interleave version. The reason i was trying to optimize de interleave version is i have written inline assembly for mmx and my sse code is not able to beat up that code for t for once so i think there is lot to be improved. while loop at the end looks nasty? how ? I have updated code snap so that it compiles may be you can seen now how bad it it? — Bharat Ahuja, Mar 10 '16 at 18:26
and thanks for helping out so much .. i never did this and i hated assembly in my grad days but i guess i am learning very slowly — Bharat Ahuja, Mar 10 '16 at 18:28
I am also using vector sumRGBA for storing sum but not using vector for final pixel value since i want to process pixels separately that helps for alpha onlly blur wheere we can skip pixels. — Bharat Ahuja, Mar 10 '16 at 18:54
the final `while` loop is nasty because it's not vectorized. It's totally dependent on the compiler to auto-vectorize those multiplies and do a vector store. Getting data between components in integer regs and vectors elements is very slow compared to vertical vector ops. — Peter Cordes, Mar 10 '16 at 22:12
oh, that final while loop is the scalar cleanup loop, I think. So nvm. It was hard to follow the code since there was unvectorized stuff all over the place. It looks a little better now. — Peter Cordes, Mar 11 '16 at 01:13
Thanks peter. yeah final while loop is the scalar cleanup loop. so can you please suggest where more optimization margin is available since its still not able to beat my de interleave code which intern is slower than my mmx code. may be i can share whole deinterleave code also if you like to take a look? — Bharat Ahuja, Mar 11 '16 at 01:32

score 3 · Answer 1 · answered Mar 09 '16 at 18:05

pshufb vectors of { a b g r ... } into vectors of { a a a a b b b b g g g g r r r r } (one pshufb per source vector).

punpckldq between two shuffled source vectors to get { g2g2g2g2 g1g1g1g1 r2r2r2r2 r1r1r1r1 }. pmovzxbw the low half, and unpack the high half with zero, to get vectors of just g and just r.

Similarly, punpckhdq the same two source vectors to get { a2a2a2a2 a1a1a1a1 b2b2b2b2 b1b1b1b1 }.

So per 4 input vectors (producing 8 output vectors), that's:

4x pshufb (all using the same control mask)
2x punpckh/l dq
4x punpckh/l bw (or replace 2 of these with pmovzxbw)

Total 10 ALU instructions, not including any copying to avoid destroying data that's still needed.

This compares pretty well against the 32 total instructions needed for the mask/shift/pack method. (And without AVX, that will involve quite a bit of copying to mask the same vector 4 different ways.) 8 of these instructions are pack shuffle instructions, so it's a tiny bit less pressure on the shuffle port in exchange for way more total instructions.

Haswell can only shuffle on one execution port, which is not the same port as bit-shifts. (And _mm_and can run on any of the three vector execution ports). I'm pretty confident that the 10 shuffles way will win by a fair margin, because so much more computation can overlap with it.

shufps is potentially useful as a shuffle from two source vectors, but it has 32bit granularity so I don't see a use for it. On Intel SnB-family and AMD Bulldozer-family, there's no penalty for using it between integer vector instructions.

Another idea:

__m128i rgba1 = _mm_loadu_si128((__m128i *)(pSrc));   // { a1.4 b1.4 g1.4 r1.4 ... a1.1 b1.1 g1.1 r1.1 }
__m128i rgba2 = _mm_loadu_si128((__m128i *)(pSrc+4)); // { a2.4 b2.4 ... g2.1 r2.1 }

 __m128i rg1 = _mm_and_si128 (rgba1, _mm_set1_epi32(0xffff));
 __m128i rg2 = _mm_slli_epi32(rgba2, 16);

 __m128i rg_interleaved = _mm_or_si128(rg2, rg1);    // { g2.4 r2.4  g1.4 r1.4 ... g2.1 r2.1  g1.1 r1.1 }

Separate rg_interleaved into zero-extended 16bit r and g vectors with another _mm_and_si128 and a _mm_srli_epi16.

Thanks Peter i got some idea but i am very naive in this so it will be much helpful if you can provide code for approach 1 — Bharat Ahuja, Mar 09 '16 at 18:11
@BharatAhuja: No thanks, working out the right `_mm_set_epi8` constant for `_mm_shuffle_epi8` isn't very much fun. Go ahead and post your own answer once you get it implemented and tested, for the benefit of future readers. Also, if Paul R says you should be able to implement a Gaussian filter without unpacking this way, then it's probably true. You should look into that. Intel has published some examples, like https://software.intel.com/en-us/search/site/field_tags/gaussian_blur_filter-22642/language/en — Peter Cordes, Mar 09 '16 at 18:21

De-interleave image channel in SSE 16 bit vectors

1 Answers1