How can I speed this routine up?

Question

I have the following code that needs to run at 25fps or better which we can at the moment. Eventually we will be using HD video so this will need to be optimized more to accomodate.

Is there any way I can optimize this method?

public unsafe void OverlayImage(Bitmap overlay, Bitmap background, Bitmap output)
    {
        Rectangle lrEntire = new Rectangle(new Point(), background.Size);

        BitmapData bdBack = background.LockBits(lrEntire, ImageLockMode.ReadOnly, background.PixelFormat);
        BitmapData bdOverlay = overlay.LockBits(lrEntire, ImageLockMode.ReadOnly, overlay.PixelFormat);
        BitmapData bdOut = output.LockBits(lrEntire, ImageLockMode.WriteOnly, output.PixelFormat);

        uint* pBack = (uint*) bdBack.Scan0;
        uint* pOverlay = (uint*) bdOverlay.Scan0;
        uint* pOut = (uint*) bdOut.Scan0;

        for (int luiToProcess = (bdBack.Height*bdBack.Stride) >> 2; luiToProcess != 0; luiToProcess--)
        {
            //get each pixel component
            uint red = (*pBack & 0x00ff0000) >> 16; // red color component
            uint green = (*pBack & 0x0000ff00) >> 8; // green color component
            uint blue = *pBack & 0x000000ff; // blue color component

            uint oalpha = (*pOverlay & 0xff000000) >> 24;
            uint ored = (*pOverlay & 0x00ff0000) >> 16; // red color component
            uint ogreen = (*pOverlay & 0x0000ff00) >> 8; // green color component
            uint oblue = *pOverlay & 0x000000ff; // blue color component

            //get each pixel color component
            uint rOut = (red*(255 - oalpha) + (ored*oalpha))/255;
            uint gOut = (green*(255 - oalpha) + (ogreen*oalpha))/255;
            uint bOut = (blue*(255 - oalpha) + (oblue*oalpha))/255;

            *pOut = bOut | gOut << 8 | rOut << 16 | 0x00 << 24;
            //move to the next pixel
            pBack++;
            pOverlay++;
            pOut++;
        }

        overlay.UnlockBits(bdOverlay);
        background.UnlockBits(bdBack);
        output.UnlockBits(bdOut);
    }

Is there a reason for you accessing the bytes in such a convoluted way?? — TaW, Oct 25 '14 at 12:45
Graphics.DrawImage() already does this, it will be a lot faster. — Hans Passant, Oct 25 '14 at 19:32
Can't confirm. Looks indentical but is not a lot faster. Anything wrong with my code? — TaW, Oct 25 '14 at 20:38
This question appears to be off-topic because it is about optimizing working code. It is better suited for [codereview.se]. — Ken White, Oct 25 '14 at 23:08

TaW · Accepted Answer · 2014-10-26T10:22:44.357

Warning: Long answer, lots of numbers.

Short version: It depends on your overlays whether the code below will almost double your framerate..

Looking at the posted code a couple of things come to mind:

As the color channels are bytes is seems to be more natural to treat them as such instead of all the masking and shifting, cheap as it may be..
you do quite a few calculations with oalpha; unless you expect it to mostly be unequal 255 or 0 extra branches would save some multiplications..(6 per such pixel)
since it is not shown just how you call the routine you may already doing it, but this kind of thing begs for parallel processing; if you get 25fps on one core HD shouldn't be a problem on a multicore machine with even sonething as simple as a Parallel.For will multiply your output..
Additionally there is the option of using Lockbits & Mashalling instead of unsafe; not sure if that'll be faster, but I guess I will write a benchmark to do some tests..

BTW: There is an error in your code, afaiks, I think you need to change this

*pOut = bOut | gOut << 8 | rOut << 16 | 0x00 << 24;

to this, or else the output has an alpha channel = 0

*pOut = (bOut | gOut << 8 | rOut << 16 ) | 0xff000000;

Or you may want to calculate the final alpha..

Update 1: First tests show your code to be a good deal faster (~2x) than a Lockbits & Mashalling` version, unless I messed it up..) so I'll ignore #4 from now on..

Update 2:

Preliminary numbers:

Running your code on the UI thread (!) of an i7-3770T 2.5GHz, W8.1 64

QVGA_size (320x240) 666,7 fps
NTSC_size (720x480) 161,3 fps
HR_size (1280x720) 64,1 fps
HD_size (1920x1080) 29,2 fps

Update 3:

Running DrawImage instead:

QVGA_size (320x240) 641,0 fps
NTSC_size (720x480) 194,2 fps
HR_size (1280x720) 77,2 fps
HD_size (1920x1080) 33,4 fps

using this code:

public void DrawImage(Bitmap overlay, Bitmap background, Bitmap output)
{
    overlay.SetResolution(96, 96);
    background.SetResolution(96, 96);
    output.SetResolution(96, 96);

    using (Graphics G = Graphics.FromImage(output) )
    {
        G.DrawImage(background, 0, 0);
        G.CompositingMode = CompositingMode.SourceOver;
        G.DrawImage(overlay, 0, 0);
    }
}

Update 4:

I have now tried a few more things and can say

using bytes instead of int32 makes the code cleaner imo, but doesn't change its speed so point #1 isn't important
if all your pixels have alpha-blending and you will always do this kind of blending, using DrawImage will be only fractionally faster
as for #2: Optimizing for alpha=0 and alpha=255 can make a huge difference, depending on the percentage of pixels with alpha-blending (ie pixels where 0 > alpha < 255), so unless most of your pixels will have an alpha-blending this kind of optimization can almost double the framerate:

 public unsafe void OverlayImage3(Bitmap overlay, Bitmap background, Bitmap output)
 {
    Rectangle lrEntire = new Rectangle(new Point(), background.Size);

    BitmapData bdBack = background.LockBits(lrEntire, 
               ImageLockMode.ReadOnly, background.PixelFormat);
    BitmapData bdOverlay = overlay.LockBits(lrEntire, 
               ImageLockMode.ReadOnly, overlay.PixelFormat);
    BitmapData bdOut = output.LockBits(lrEntire, 
               ImageLockMode.WriteOnly, output.PixelFormat);

    byte* pBack    = (byte*)bdBack.Scan0;
    byte* pOverlay = (byte*)bdOverlay.Scan0;
    byte* pOut     = (byte*)bdOut.Scan0;

    for (int luiToProcess = (bdBack.Height * bdBack.Stride) >> 2; 
                             luiToProcess > 0; luiToProcess--)
    {
        //get each pixel component
        byte red   = *(pBack + 2); 
        byte green = *(pBack + 1); 
        byte blue  = *(pBack + 0); 

        byte oalpha = *(pOverlay + 3);
        byte ored   = *(pOverlay + 2); 
        byte ogreen = *(pOverlay + 1); 
        byte oblue  = *(pOverlay + 0);

        //get each pixel color component

        byte rOut, gOut, bOut;
        if (oalpha == 255) 
        {   rOut = ored;  gOut = ogreen;    bOut = oblue;   }
        else if (oalpha == 0)
        {   rOut = red;   gOut = green;     bOut = blue;    }
        else
        {
            rOut = (byte)((red * (255 - oalpha) + (ored * oalpha)) / 255);
            gOut = (byte)((green * (255 - oalpha) + (ogreen * oalpha)) / 255);
            bOut = (byte)((blue * (255 - oalpha) + (oblue * oalpha)) / 255);
        }

        *(pOut + 3) = 0xff;
        *(pOut + 2) = rOut;
        *(pOut + 1) = gOut;
        *(pOut + 0) = bOut;

        //move to the next pixel
        pBack += 4;   pOverlay += 4;  pOut += 4;
    }

A few more numbers:

OverlayImage3 with 5% of all pixel having alpha blending
QVGA_size (320x240) 1.282,1 fps
NTSC_size (720x480) 320,5 fps
HR_size (1280x720) 114,3 fps
HD_size (1920x1080) 52,1 fps
OverlayImage3 with 60% of all pixel having alpha blending
QVGA_size (320x240) 917,4 fps
NTSC_size (720x480) 256,4 fps
HR_size (1280x720) 98,5 fps
HD_size (1920x1080) 46,7 fps
OverlayImage3 with 95% of all pixel having alpha blending
QVGA_size (320x240) 714,3 fps
NTSC_size (720x480) 220,8 fps
HR_size (1280x720) 84,2 fps
HD_size (1920x1080) 36,6 fps

DrawImage does profit from lack of alpha-blending, too:

DrawImage with 5% of all pixel having alpha blending
QVGA_size (320x240) 584,8 fps
NTSC_size (720x480) 220,8 fps
HR_size (1280x720) 100,0 fps
HD_size (1920x1080) 41,8 fps
DrawImage with 95% of all pixel having alpha blending
QVGA_size (320x240) 534,8 fps
NTSC_size (720x480) 200,4 fps
HR_size (1280x720) 73,3 fps
HD_size (1920x1080) 33,6 fps

Point #3, parallel processing will help aditionally, obviously, depending on your hardware.

Conclusion: I don't know your current resolution, but going from SD to HD will take 5-6x longer across all tests, so if you only just can do 25fps now you will need more than the code above; you'll need parallel processing, I'd say..

brilliant, appreciate your thorough analysis and testing. I am running multiple videos at the same time, so speed is really important. Graphics.DrawImage() unfortunately just doesn't cut it here, and I get strange behavior sometimes using it. — Simon, Oct 26 '14 at 01:27

How can I speed this routine up?

1 Answers1