How does this color blending trick that works on color components in parallel work?

Question

I saw this Java code that does a perfect 50% blend between two RGB888 colors extremely efficiently:

public static int blendRGB(int a, int b) {
    return (a + b - ((a ^ b) & 0x00010101)) >> 1;
}

That's apparently equivalent to extracting and averaging the channels individually. Something like this:

public static int blendRGB_(int a, int b) {
    int aR = a >> 16;
    int bR = b >> 16;
    int aG = (a >> 8) & 0xFF;
    int bG = (b >> 8) & 0xFF;
    int aB = a & 0xFF;
    int bB = b & 0xFF;
    int cR = (aR + bR) >> 1;
    int cG = (aG + bG) >> 1;
    int cB = (aB + bB) >> 1;
    return (cR << 16) | (cG << 8) | cB;
}

But the first way is much more efficient. My questions are: How does this magic work? What else can I do with it? And are there more tricks similar to this?

i have to work with lot's of older code which i inheritated from co-workers... some of them prefered code similar to the first code block and some similar to the second code block - and i must confess i am so grateful when i find those longer code fragment, because they are so much easier to read! so whenever you write code wich (even properbly) is read by another one, use those 'longer' code, please! — Martin Frank, Dec 19 '13 at 12:55
[Quick colour averaging](https://www.compuphase.com/graphic/scale3.htm), [Fast Averaging of High Color (16 bit) Pixels](https://medium.com/@luc.trudeau/fast-averaging-of-high-color-16-bit-pixels-cb4ac7fd1488) — phuclv, Apr 13 '21 at 15:22

harold · Accepted Answer · 2013-12-19T19:16:33.523

(a ^ b) & 0x00010101 is what the least significant bits of the channels would have been in a + b if no carry had come from the right.

Subtracting it from the sum guarantees that the bit that is shifted into the most significant bit of the next channel is just the carry from that channel, untainted by this channel. Of course that also means that this channel is no longer effected by the carry from the next channel.

An other way to look this, not the way it does it but a way that may help you understand it, is that effectively the inputs are changed so that their sum is even for all channels. The carries then go nicely into the least significant bits (which are zero, because even), without disturbing anything. Of course what it actually does is sort of the other way around, first it just sums them, and only then does it ensure that the sums are even for all channels. But the order doesn't matter.

More concretely, there are 4 cases (before the carry from the next channel is applied):

the lsb of a channel is 0 and there is no carry from the next channel.
the lsb of a channel is 0 and there is a carry from the next channel.
the lsb of a channel is 1 and there is no carry from the next channel.
the lsb of a channel is 1 and there is a carry from the next channel.

The first two cases are trivial. The shift puts the carried bit back in channel it belongs to, it doesn't even matter whether it was 0 or 1.

Case 3 is more interesting. If the lsb is 1, that means the shift would shift that bit into the most significant bit of the next channel. That's bad. That bit has to be unset somehow - but you can't just mask it away because maybe you're in case 4.

Case 4 is the most interesting. If the lsb is 1 and there is a carry into that bit, it rolls over to a 0 and the carry is propagated. That can't be undone by masking, but it can be done by reversing the process, ie subtracting 1 from the lsb (which puts it back to 1 and undoes any damage done by the propagated carry).

As you can see, in both case 3 and case 4, the cure is subtracting 1 from the lsb, and those are also the cases in which the lsb really wanted to be 1 (though maybe it isn't any more, due to a carry from the next channel), and in both case 1 and 2, you don't have to anything (in other words, subtract 0). That exactly corresponds to subtracting "what the lsb would have been in a + b if no carry had come from the right".

Also, the blue channel can only fall into cases 1 or 3 (there is no next channel which could carry), and the shift would just discard that bit instead of putting it in the next channel (because there is none). So alternatively, you may write (note the mask has lost the least significant 1)

public static int blendRGB(int a, int b) {
    return (a + b - ((a ^ b) & 0x00010100)) >> 1;
}

Doesn't really make any difference, though.

To make it work for ARGB8888, you can switch to the good old "SWAR average":

// channel-by-channel average, no alpha blending
public static int blendARGB(int a, int b) {
    return (a & b) + (((a ^ b) & 0xFEFEFEFE) >>> 1);
}

Which is a variation on a recursive way to define addition: x + y = (x ^ y) + ((x & y) << 1) which computes the sum without carries, then adds the carries in separately. The base case is when one of the operands is zero.

Both halves are effectively shifted right by 1, in such a way that the carry out of the most significant bit is never lost. The mask ensures that bits don't move to a channel to the right, and simultaneously ensures that a carry won't propagate out of its channel.

Thank you I get it now. What threw me was when you said "next channel" I thought you meant the one to the left but you meant the one to the right. Now I see the original code actually had a subtle bug: the best mask is not `0x00010101` or `0x00010100`, but `0x01010100`. With this change, the red channel no longer gets corrupted if the low alpha bits are non-zero. It still won't average the upper 8 bits correctly, but at least it doesn't mind if they're set. — Boann, Dec 19 '13 at 18:32
@Boann well to be fair, the name and description imply that it expects RGB888, not ARGB8888. Blending two ARGB8888's is a little trickier because the carry out of the the A channel disappears, but it's doable. — harold, Dec 19 '13 at 18:36
@harold: I guess using `long` for the intermediate result of ARGB is the fastest way, at least on a 64-bit JVM. However, averaging two ARGB values makes no sense, anyway; you'd need a weighted sum or whatever. — maaartinus, Dec 20 '13 at 12:04
@maaartinus yes it's probably not too useful. I don't think there's a nice way to do it properly though. — harold, Dec 20 '13 at 12:15
I only just saw your edit that added the blendARGB code, and **woah that is amazing!** I wouldn't have believed that was possible without increasing the number of operations. But that can average 4 packed bytes in just 5 operations, or the `long` int version on a 64-bit machine could average 8 packed bytes still in 5 operations. I don't actually have a use for it right now, but I damn like it! — Boann, Jan 03 '14 at 19:59
Hi again. Can you give me any pointers to where I can find out more about this "SWAR average" or similar techniques, or was the above code your own invention? I tried searching online but all I found was a lot of crappy waffle. I don't think I know what to search for. — Boann, Feb 08 '14 at 10:11
@Boann I didn't invent it (though I would have liked to), there's some on [chessprogramming:SWAR](http://chessprogramming.wikispaces.com/SIMD+and+SWAR+Techniques) (with potentially useful links), more in TAOCP 4A (bitwise tricks and techniques, tweaking several bytes at once), there's probably something about in Hacker's Delight but I can't find it right now.. — harold, Feb 08 '14 at 10:39

How does this color blending trick that works on color components in parallel work?

1 Answers1

Linked