Unsafe Pointer iteration and Bitmap - why is UInt64 faster?

Question

I have been doing some unsafe bitmap operations and have found out that increasing the pointer less times can lead to some big performance improvements. I am not sure why is that so, even though you do lot more bitwise operations in the loop, it still is better to do less iterations on the pointer.

So for example instead of iterating over 32 bit pixels with a UInt32 iterate over two pixels with UInt64 and do twice the operations in one cycle.

The following does it by reading two pixels and modifying them (of course it will fail with images with odd width, but its just for testing).

    private void removeBlueWithTwoPixelIteration()
    {
        // think of a big image with data
        Bitmap bmp = new Bitmap(15000, 15000, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
        TimeSpan startTime, endTime;

        unsafe {

            UInt64 doublePixel;
            UInt32 pixel1;
            UInt32 pixel2;

            const int readSize = sizeof(UInt64);
            const UInt64 rightHalf = UInt32.MaxValue;

            PerformanceCounter pf = new PerformanceCounter("System", "System Up Time"); pf.NextValue();

            BitmapData bd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), System.Drawing.Imaging.ImageLockMode.ReadWrite, bmp.PixelFormat);
            byte* image = (byte*)bd.Scan0.ToPointer();

            startTime = TimeSpan.FromSeconds(pf.NextValue());

            for (byte* line = image; line < image + bd.Stride * bd.Height; line += bd.Stride)
            {
                for (var pointer = line; pointer < line + bd.Stride; pointer += readSize)
                {
                    doublePixel = *((UInt64*)pointer);
                    pixel1 = (UInt32)(doublePixel >> (readSize * 8 / 2)) >> 8; // loose last 8 bits (Blue color)
                    pixel2 = (UInt32)(doublePixel & rightHalf) >> 8; // loose last 8 bits (Blue color)
                    *((UInt32*)pointer) = pixel1 << 8; // putback but shift so A R G get back to original positions
                    *((UInt32*)pointer + 1) = pixel2 << 8; // putback but shift so A R G get back to original positions
                }
            }

            endTime = TimeSpan.FromSeconds(pf.NextValue());

            bmp.UnlockBits(bd);
            bmp.Dispose();

        }

        MessageBox.Show((endTime - startTime).TotalMilliseconds.ToString());

    }

The following code does it pixel by pixel and is around 70% slower than the previous:

    private void removeBlueWithSinglePixelIteration()
    {
        // think of a big image with data
        Bitmap bmp = new Bitmap(15000, 15000, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
        TimeSpan startTime, endTime;

        unsafe
        {

            UInt32 singlePixel;

            const int readSize = sizeof(UInt32);

            PerformanceCounter pf = new PerformanceCounter("System", "System Up Time"); pf.NextValue();

            BitmapData bd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), System.Drawing.Imaging.ImageLockMode.ReadWrite, bmp.PixelFormat);
            byte* image = (byte*)bd.Scan0.ToPointer();

            startTime = TimeSpan.FromSeconds(pf.NextValue());

            for (byte* line = image; line < image + bd.Stride * bd.Height; line += bd.Stride)
            {
                for (var pointer = line; pointer < line + bd.Stride; pointer += readSize)
                {
                    singlePixel = *((UInt32*)pointer) >> 8; // loose B
                    *((UInt32*)pointer) = singlePixel << 8; // adjust A R G back
                }
            }

            endTime = TimeSpan.FromSeconds(pf.NextValue());

            bmp.UnlockBits(bd);
            bmp.Dispose();

        }

        MessageBox.Show((endTime - startTime).TotalMilliseconds.ToString());
    }

Could someone clarify why is incrementing the pointer a more costly operation than doing a few bitwise operations?

I am using .NET 4 framework.

Could something like this be true for C++?

NB. 32 bit vs 64 bit the ratio of the two methods is equal, however both ways are like 20% slower on 64 vs 32 bit?

EDIT: As suggested by Porges and arul this could be because of decreased number of memory reads and branching overhead.

EDIT2:

After some testing it seems that reading from memory less time is the answer:

With this code assuming the image width is divisible by 5 you get 400% faster:

[StructLayout(LayoutKind.Sequential,Pack = 1)]
struct PixelContainer {
    public UInt32 pixel1;
    public UInt32 pixel2;
    public UInt32 pixel3;
    public UInt32 pixel4;
    public UInt32 pixel5;
}

Then use this:

            int readSize = sizeof(PixelContainer);

            // .....

            for (var pointer = line; pointer < line + bd.Stride; pointer += readSize)
            {
                multiPixel = *((PixelContainer*)pointer);
                multiPixel.pixel1 &= 0xFFFFFF00u;
                multiPixel.pixel2 &= 0xFFFFFF00u;
                multiPixel.pixel3 &= 0xFFFFFF00u;
                multiPixel.pixel4 &= 0xFFFFFF00u;
                multiPixel.pixel5 &= 0xFFFFFF00u;
                *((PixelContainer*)pointer) = multiPixel;
            }

This is on a 64 bit Windows 7, but I tried with 32 bit binary and 64 bit, the performance increase is the same with the first method... Actually doing it with a 64 bit process is slower in both cases. — Marino Šimić, Apr 28 '11 at 02:53

arul · Accepted Answer · 2011-04-28T03:18:38.953

5

This is a technique known as loop unrolling. The main performance benefit should come from reducing the branching overhead.

As a side note, you could speed it up a bit by using a bitmask:

*((UInt64 *)pointer) &= 0xFFFFFF00FFFFFF00ul;

edited Apr 28 '11 at 03:18

answered Apr 28 '11 at 03:11

arul

13,998
1
57
77

thanks for the link and the suggestion of bitmasking, it would probably be * ((UInt64 * )pointer) &= 0xFFFFFF00FFFFFF00ul; because the pointer is a byte* – Marino Šimić Apr 28 '11 at 03:16
I'll accept your answer because loop unrolling is the key here. I tested with a struct containing 5 pixels and it is much faster + your technique for bitmasking. – Marino Šimić Apr 28 '11 at 03:38

score 2 · Answer 2 · answered Apr 28 '11 at 03:02

2

It's not the incrementing the pointer that is slower, but reading from memory. With 32-bit units, you're doing twice as many reads.

You should find it faster again if you write once instead of twice in the 64-bit version.

answered Apr 28 '11 at 03:02

porges

30,133
4
83
114

You mean the pointer dereferencing operation is slow? So I should do as little as possible of them? – Marino Šimić Apr 28 '11 at 03:07
1

Reading from a pointer isn't really slow, it's just the slowest part of that loop, so doing less reads makes it faster :) – porges Apr 28 '11 at 04:04

Unsafe Pointer iteration and Bitmap - why is UInt64 faster?

2 Answers2

Linked