How is Win32 Bitmap rendering faster than pixels?

Question

Win32 bitmaps are (a lot) faster to draw compared to SetPixelV or another function such as. How does this work, if at the end the computer will be drawing pixels for the bitmap?

`SetPixelV` imposes a lot of function call overhead to draw a single pixel. — Jerry Coffin, Jan 28 '16 at 22:43
[BitBlt](https://msdn.microsoft.com/en-us/library/dd183370.aspx) is usually hardware accelerated, where the CPU merely has to issue a few commands to the hardware, that performs the block transfer of memory. — IInspectable, Jan 28 '16 at 22:45
Generally fast graphics updating happens from copying large blocks of pixels at once; frequently by using hardware specifically designed to do that. — mah, Jan 28 '16 at 22:45

score 5 · Accepted Answer · answered Jan 28 '16 at 23:05

Suppose you have a pixel. This pixel has color components A B and C. The surface you are drawing to has color components X Y and Z.

So first you need to check if they match. If they don't match, costs go up. Assume they match.

Next, you need to do bounds checking -- did the caller give you something stupid? Some comparisons, additions and multiplications.

Next, you need to find where the pixel is. This is some of multiplications and additions.

Now, you have to access the source data and the destination data and write it.

If you are working a scanline at a time, almost all of that overhead above can be done once. You can calculate what part of the scanline falls in bounds or not with only a bit more overhead than doing one pixel. You can find where the scanline writes in the destination with again only a bit more overhead than one pixel. You can check color space conversions with the same overhead as one pixel.

The big difference is that instead of copying one pixel, you copy in a block.

As it happens, computer are really good at copying blocks of things. There are built-in instructions on some CPUs, some memory systems can do it without the CPU being involved (CPU says "copy X to Y", then can do other things; and memory-to-memory bandwidth might be higher than memory-to-CPU-to-memory). Even if you are round-tripping through the CPU, there are SIMD instructions that let you work on 2, 4, 8, 16 or even more units of data at the same time, so long as you work on them in the same way using a limited instruction set.

In some cases, you can even offload work to the GPU -- if both source and destination scanline are on the GPU, you can say "yo GPU, you handle it", and the GPU is even more specialized for doing that kind of task.

The very first bit of optimization -- only having to do checks once per scanline instead of once per pixel -- can easily give you a 2x to ~10x speedup. The second -- more efficient blitting -- another 4x to ~20x faster. Doing everything on the GPU can be ~2x to 100x faster.

The final thing is the overhead of actually calling the function. Usually this is minor; but when calling SetPixel 1 million times (a 1000 x 1000 image, or a modest sized screen) it adds up.

For an HD display with 2 million pixels, 60 times per second is 120 million pixels manipulated per second. A single threaded program on a 3 GHz machine only has room to run ~25 instructions per pixel if you want to keep up with the screen, and that assumes nothing else happens (which is unlikely). On a 4k monitor you are down to 6 instructions per pixel.

With that many pixels being played with, shaving off every instruction you can makes a big difference.

Multipliers pulled out of nowhere. I've written some conversion of per-pixel operations to per-scanline operations that have shown impressive speedups, however, and ditto for CPU to GPU loads, and have seen SIMD give impressive speedups.

score 2 · Answer 2 · edited Jan 28 '16 at 22:56

Repeated calls to a function like SetPixelV are slow because it must translate a co-ordinate into a memory offset each time, and is also potentially doing some colour translation on the fly.

A simple "set pixel" function might look like this (without bounds-tests, colour translation or anything fancy):

size_t offset = y * bytes_per_scanline + x * bytes_per_pixel;
for(size_t i = offset; i < offset + bytes_per_pixel; i++) 
    target[i] = source[i];

Bitmaps, on the other hand, are generally drawn via a process known as blitting. This is essentially a direct copy from one memory location to another. To achieve this in Windows, you create a device context for your bitmap that is compatible with the target context. That ensures the memory can be copied without translation. It may also provide for hardware-accelerated copies which are even faster.

A simple "copy" blit might look like this:

size_t nbytes = bytes_per_scanline * height;
for(size_t i = 0; i < nbytes; i++)
    target[i] = source[i];

This involves no co-ordinate lookups, and will be very efficient in terms of memory cache accesses. There are much faster ways to copy chunks of memory, and the above example is simply to illustrate.

sorry if it is a bit late and if this is off topic, but what do you mean by "There are much faster ways to copy chunks of memory, and the above example is simply to illustrate."? Can you give me one of them? I will be moving to other platforms soon and therefore might need to implement them myself. — , Feb 21 '16 at 19:06
That copy loop example is copying a single byte at a time. If the compiler cannot determine any guarantees on size at compile time, or decide on alternative code-paths at runtime, then it will be a crappy loop. A better optimisation is to copy in chunks. Then there is less loop counting and more actual copying. Vector technologies like MMX, SSE, and AVX provide larger register sizes (64-bit, 128-bit, 256-bit, 512-bit) and make for faster copying. Combined with loop unrolling, you can get higher performance. All this is down to stuff you actually know (or can guarantee) about your data. — paddy, Feb 21 '16 at 22:48

How is Win32 Bitmap rendering faster than pixels?

2 Answers2