Optimizing alpha blending for two colors with alpha

Question

I'm looking for a way to optimize alpha blending, but for two colors with alpha (what differs from the question How to alpha blend RGBA unsigned byte color fast? )

Initially I used a solution with floats (RGB ranging from 0.0f to 255.0f and A ranging from 0.0f to 1.0f):

inline void alphaBlend(Color& baseColor, Color targetColor)
{
    float newAlpha = (1 - targetColor.A) * baseColor.A + targetColor.A;
    baseColor.R = ((1 - targetColor.A) * baseColor.A * baseColor.R + targetColor.A * targetColor.R) / newAlpha;
    baseColor.G = ((1 - targetColor.A) * baseColor.A * baseColor.G + targetColor.A * targetColor.G) / newAlpha;
    baseColor.B = ((1 - targetColor.A) * baseColor.A * baseColor.B + targetColor.A * targetColor.B) / newAlpha;
}

I changed the algorithm to work on unsigned int RGBA colors. I replaced every reference to alpha with (alpha / 255) and then corrected the formulas, so that values are still within the proper ranges.

baseColor.R = ((1 - targetColor.A) * baseColor.A * baseColor.R + targetColor.A * targetColor.R) / newAlpha;

Shorthand (targetColor.A -> tA etc.):

R = ((1 - tA) * bA * bR + tA * tR) / newAlpha

(introducing 255-based alpha requires replacing all A instances with A/255)

  = ((1 - (tA / 255)) * (bA / 255) * bR + (tA / 255) * tR) / (newAlpha / 255)

(remove 255 from the denominator's denominator)

  = (((1 - (tA / 255)) * (bA / 255) * bR + (tA / 255) * tR) * 255) / newAlpha

(get rid of direct alpha divisions by 255 by multiplying parethesis by 255/255)

  = (( ((255 - tA) * bA * bR) / 255^2 + (tA * tR) / 255) * 255) / newAlpha

(multiplying by the last 255 causes denominators to reduce)

  = ( ((255 - tA) * bA * bR) / 255 + (tA * tR * 255) / 255 ) / newAlpha
  
(Pushing numerator's denominator (255) to the denominator)

  = ( (255 - tA) * bA * bR) + (tA * tR * 255) ) / (255 * newAlpha)

(Expanding first multiplication in numerator)

  = ( 255 * bA * bR - tA * bA * bR + tA * tR * 255) / (255 * newAlpha)
                      ^^^^^^^^^^^^   ^^^^^^^^^^^^^
(reordering not to fall below 0 during calculations)

  = ( 255 * bA * bR + tA * tR * 255 - tA * bA * bR ) / (255 * newAlpha)

(grouping to minimize multiplications)

 = ( (ba * bR + tA * tR) * 255 - tA * bA * bR ) / (255 * newAlpha)

(introducing bit shifting - losing precision, but in an acceptable range)

 ~= ( ((ba * bR + tA * tR) << 8) - tA * bA * bR) / (newAlpha << 8)

I managed to write the following code:

inline void alphaBlend(IntColor& baseColor, IntColor targetColor)
{
    unsigned int a = (((baseColor.A + targetColor.A) << 8) - targetColor.A * baseColor.A) >> 8;

    if (a > 0)
    {
        unsigned int divisor = a << 8;

        unsigned int baseAR = baseColor.A * baseColor.R;
        baseColor.R = (((targetColor.A * targetColor.R + baseAR) << 8) - (baseAR * targetColor.A)) / divisor;

        unsigned int baseAG = baseColor.A * baseColor.G;
        baseColor.G = (((targetColor.A * targetColor.G + baseAG) << 8) - (baseAG * targetColor.A)) / divisor;

        unsigned int baseAB = baseColor.A * baseColor.B;
        baseColor.B = (((targetColor.A * targetColor.B + baseAB) << 8) - (baseAB * targetColor.A)) / divisor;

        baseColor.A = a;
    }
    else
    {
        baseColor.R = 0;
        baseColor.G = 0;
        baseColor.B = 0;
        baseColor.A = 0;
    }
}

This change reduced the rendering of sample data from 27559 ms to 17751 ms. Since the alpha blending seems to be the most common operation in the rendering workflow I'm curious if there is a way to optimize it even further.

I thought about doing calculations on R and B at the same time, but unfortunately in some circumstances the calculations will exceed two bytes (for instance if bA = bR = tA = tR = 255, the left part of the subtraction will equal to 33162750 = 0x1faa05fe).

Is there any other optimization I could apply to make this code faster?

Edit: responding to comments:

Target architecture is x64, target processor may be Intel Core family
Input type is guaranteed to be 32-bit RGBA
Memory layout is BGRA (8888)
Regarding SIMD, my application is a vector animation renderer. Every object is rendered on a separate bitmap and then alpha-blended into result one, because every single object may have applied alpha/mask/transformations/effects or may consist of multiple sub-objects, from which every single one also may have those applied.
Compiler is the one from Microsoft Visual Studio 2022. Application is Windows-only.

Many important details are missing in the question. What is the target architecture (and target processor)? An optimal code for a given architecture can be a very inefficient code on another (eg. FP mul on Intel AlderLake VS ARM Cortex M0). The input type seems not well defined. It is generally fixed for a given application and conversions (eg. float/int) can be pretty expensive. The memory layout is also not mentioned. Data reordering due to a bad memory layout can be more expensive than the target computation. Please clarify these points. Do you have multiple colors to compute (see SIMD)? — Jérôme Richard, Aug 30 '23 at 21:32
Oh, I forgot to also ask for compiler optimizations use so far as well as the target compiler. Besides, do you want to optimize the latency of this code or the throughput? — Jérôme Richard, Aug 30 '23 at 21:53
I found some answer for [alpha-blending using SSE2](https://stackoverflow.com/a/5438343/555045), SSE2 is baseline for x64 so you can use that unconditionally. There may be better solutions for SSSE3 — harold, Aug 31 '23 at 17:33

Optimizing alpha blending for two colors with alpha

0 Answers0

Linked