I'm working on an SSE2 implementation of an RGB565/RGB555 Alpha Blend and I've run into an issue I haven't been able to wrap my head around. This is the Alpha Blend in C++:
#define ALPHA_BLEND_X_W(dst, src, alpha)\
ts = src; td = dst;\
td = ((td | (td << 16)) & RGBMask); ts = ((ts | (ts << 16)) & RGBMask);\
td = (((((ts - td) * alpha + RGBrndX) >> 5) + td) & RGBMask);\
dst= (td | (td >> 16));
This is for a filter plugin for the VBA-M and Kega Fusion emulators. This is an extremely fast and accurate blend already, but speed is critical if I'm going to implement all the features I plan to implement in my filter plugin. ts and td are 32-bit INTs which allows me to shift green out, calculate the blend in one go, and then shift green back into place.
This is what I've got so far for my SSE implementation:
#define AlphaBlendX(s, d0, d1, d2, d3, v0, v1, v2, v3)\
D = _mm_set_epi32(d0, d1, d2, d3);\
S = _mm_set1_epi32(s);\
V = _mm_set_epi16(v0, v0, v1, v1, v2, v2, v3, v3);\
sD = _mm_slli_si128(D, 2);\
sS = _mm_slli_si128(S, 2);\
oD = _mm_or_si128(D, sD);\
oS = _mm_or_si128(S, sS);\
mD = _mm_and_si128(oD, RGB);\
mS = _mm_and_si128(oS, RGB);\
sub = _mm_sub_epi32(mS, mD);\
hi = _mm_mulhi_epu16(sub, V);\
lo = _mm_mullo_epi16(sub, V);\
mul = _mm_or_si128(_mm_slli_si128(hi, 2), lo);\
rnd = _mm_add_epi64(mul, RND);\
div = _mm_srli_epi32(rnd, 5);\
add = _mm_add_epi64(div, mD);\
D = _mm_and_si128(add, RGB);\
DD = _mm_srli_si128(D, 2);\
DDD = _mm_or_si128(D, DD);\
d0 = _mm_extract_epi16(DDD, 1); d1 = _mm_extract_epi16(DDD, 3); d2 = _mm_extract_epi16(DDD, 5); d3 = _mm_extract_epi16(DDD, 7);
It's a noticeable performance improvement even in the horribly unoptimized state it's in (all the different variables instead of swapping from D to DD and back at each arithmetic operation). However, it's returning incorrect values! I'm pretty confident that the first area it's having trouble with is the subtraction. It's definitely possible to get a negative value out of that subtraction operation.
My planned solution would be to compare the four 32-bit values and then swap them in-place before subtraction to get an absolute value of the subtraction. I'm aware of the _mm_cmpgt/_mm_cmplt intrinsics and how they work, though I have no idea how to use the bitmasks they output to do what I need.
Any possible solution for how I'd get absolute value while keeping the source and destination DWORDS in their places would be greatly appreciated. Tips regarding optimization of this code would also be nice.