I need an SSE shuffle routine to avoid negative numbers in a parallel subtraction

Question

I'm working on an SSE2 implementation of an RGB565/RGB555 Alpha Blend and I've run into an issue I haven't been able to wrap my head around. This is the Alpha Blend in C++:

#define ALPHA_BLEND_X_W(dst, src, alpha)\
    ts = src; td = dst;\
    td = ((td | (td << 16)) & RGBMask); ts = ((ts | (ts << 16)) & RGBMask);\
    td = (((((ts - td) * alpha + RGBrndX) >> 5) + td) & RGBMask);\
    dst= (td | (td >> 16));

This is for a filter plugin for the VBA-M and Kega Fusion emulators. This is an extremely fast and accurate blend already, but speed is critical if I'm going to implement all the features I plan to implement in my filter plugin. ts and td are 32-bit INTs which allows me to shift green out, calculate the blend in one go, and then shift green back into place.

This is what I've got so far for my SSE implementation:

#define AlphaBlendX(s, d0, d1, d2, d3, v0, v1, v2, v3)\
    D = _mm_set_epi32(d0, d1, d2, d3);\
    S = _mm_set1_epi32(s);\
    V = _mm_set_epi16(v0, v0, v1, v1, v2, v2, v3, v3);\
    sD = _mm_slli_si128(D, 2);\
    sS = _mm_slli_si128(S, 2);\
    oD = _mm_or_si128(D, sD);\
    oS = _mm_or_si128(S, sS);\
    mD = _mm_and_si128(oD, RGB);\
    mS = _mm_and_si128(oS, RGB);\
    sub = _mm_sub_epi32(mS, mD);\
    hi = _mm_mulhi_epu16(sub, V);\
    lo = _mm_mullo_epi16(sub, V);\
    mul = _mm_or_si128(_mm_slli_si128(hi, 2), lo);\
    rnd = _mm_add_epi64(mul, RND);\
    div = _mm_srli_epi32(rnd, 5);\
    add = _mm_add_epi64(div, mD);\
    D = _mm_and_si128(add, RGB);\
    DD = _mm_srli_si128(D, 2);\
    DDD = _mm_or_si128(D, DD);\
    d0 = _mm_extract_epi16(DDD, 1); d1 = _mm_extract_epi16(DDD, 3); d2 = _mm_extract_epi16(DDD, 5); d3 = _mm_extract_epi16(DDD, 7);

It's a noticeable performance improvement even in the horribly unoptimized state it's in (all the different variables instead of swapping from D to DD and back at each arithmetic operation). However, it's returning incorrect values! I'm pretty confident that the first area it's having trouble with is the subtraction. It's definitely possible to get a negative value out of that subtraction operation.

My planned solution would be to compare the four 32-bit values and then swap them in-place before subtraction to get an absolute value of the subtraction. I'm aware of the _mm_cmpgt/_mm_cmplt intrinsics and how they work, though I have no idea how to use the bitmasks they output to do what I need.

Any possible solution for how I'd get absolute value while keeping the source and destination DWORDS in their places would be greatly appreciated. Tips regarding optimization of this code would also be nice.

May I suggest that you write it as a function instead of a macro. The compiler will almost certainly inline it anyway, but a function can be debugged by stepping through it, so you can see what values you have in what SSE register. I have no idea why you use a comma operator on the last line... — Mats Petersson, Aug 02 '13 at 08:13
I probably should have mentioned this, but all of the dn parameters are pointers to an array. If I write it as an inline function I'd have to deal with returning values for the blended colors and then actually setting them separately. I'm pretty confident that the first problem area is subtraction. The second possible problem area is in multiplication, but I already know how I'd deal with that if I'm still getting incorrect results after fixing subtraction. Changed the commas to semicolons. — user2645004, Aug 02 '13 at 09:24
I'm pretty sure you can solve all the problems of a macro -> inline by using references. And the compiler is likely going to produce equivalent code either way. — Mats Petersson, Aug 02 '13 at 09:31
And if you make it a function, you can single-step to the point where the subtraction is, and see if the results are what you expect. — Mats Petersson, Aug 02 '13 at 09:34
It doesn't really matter whether it's a macro or a function, it's fairly trivial to step through assembly code and look at register values in a debugger. Especially when you're dealing with SSE whose intrinsics more or less map to single instructions anyway. — , Aug 02 '13 at 09:42
I've already determined subtraction is a problem; no debugger necessary. Some pixels blend fine, others come out neon green (for example) when blending two blue pixels. If I swap mS with mD and subtract the other way, the output changes dramatically; again, some pixels fine, others not, only the bad pixels from before are good now and the good ones are way off. inline vs. macro isn't an issue right now. — user2645004, Aug 02 '13 at 09:55

score 1 · Answer 1 · edited Dec 01 '13 at 21:57

1

Here's how to get absolute value of 16 (or 32-bit) values using SSE2:

2's complement negation is 1's complement followed by increment

-A == (A ^ -1) + 1;

__m128i xmmOriginal, xmmZero, xmmMask, xmmAbsolute;

// xmmOriginal is assumed to be initialized to positive/negative values

xmmZero = _mm_setzero_si128();
xmmMask = _mm_cmplt_epi16(xmmOriginal, xmmZero); // mask = FFFF where negative values are
xmmAbsolute = _mm_xor_si128(xmmMask, xmmOriginal); // bitwise invert the negative values
xmmMask = _mm_srli_epi16(xmmMask, 15); // convert mask FFFF's into 1's
xmmAbsolute = _mm_add_epi16(xmmAbsolute, xmmMask); // done

edited Dec 01 '13 at 21:57

Ivan Aksamentov - Drop

12,860
3
34
61

answered Aug 05 '13 at 19:27

BitBank

8,500
3
28
46

SSSE3 provides `_mm_abs_epi8` / 16 / 32 which does this in a single operation, so use that if available. Another option here is subtract from zero and `_mm_max_epi16`, which is available in SSE2 (some other sizes and signedness of min/max are only SSE4.1) – Peter Cordes Aug 31 '23 at 20:38

I need an SSE shuffle routine to avoid negative numbers in a parallel subtraction

1 Answers1