6

I am trying to recreate very simple GDI+ functions, such as scaling and rotating an image. The reason is that some GDI functions can't be done on multiple threads (I found a work around using processes but didn't want to get into that), and processing thousands of images on one thread wasn't nearly cutting it. Also my images are grayscale, so a custom function would only have to worry about one value instead of 4.

No matter what kind of function I try to recreate, even when highly optimized, it is always SEVERAL times slower, despite being greatly simplified compared to what GDI is doing (I am operating on a 1D array of bytes, one byte per pixel)

I thought maybe the way I was rotating each point could be the difference, so I took it out completely, and basically had a function that goes through each pixel and just sets it to what it already is, and that was only roughly tied with the speed of GDI, even though GDI was doing an actual rotation and changing 4 different values per pixel.

What makes this possible? Is there a way to match it using your own function?

Frobot
  • 1,224
  • 3
  • 16
  • 33
  • How are you going through each pixel? – James Dec 14 '15 at 11:23
  • 2
    [Graphics Device Interface](https://en.wikipedia.org/wiki/Graphics_Device_Interface) is supposed to be fast. It's written in native C/C++ and it may even use hardware function of graphic adapter to draw e.g. line. That would be way faster than your per-pixel iteration in C#. You can try to achieve nearly same performance if you learn how to use those functions as well (e.g. managed DirectX). – Sinatr Dec 14 '15 at 11:23
  • It is just a for loop that runs through an array of bytes. each byte in the array represents the intensity at a pixel. I thought that GDI doesn't touch the GPU at all. If it does, then that would definitely explain it, but I have read that it doesn't. – Frobot Dec 14 '15 at 11:26
  • 1
    Great question Frobot. Hope someone who knows answers. – Chui Tey Dec 14 '15 at 11:29
  • 1
    Bitmap manipulations have fundamental O(n^2) complexity. That goes up pretty fast, a modest n=1000 is already a million operations. Only brute force can help to keep the Oh small enough. GDI+ certainly cuts corners to get there, it is for example not pixel-perfect. And is likely to use hand-tuned SIMD code. Only Microsoft knows, source is not public. – Hans Passant Dec 14 '15 at 11:46
  • @Frobot: I doubt, that writing your own "GDI+" using C# is a way to go. Since it is unclear, what answer do you want to get, I won't close question as a dupe, but here's a link, that can help you with parallel processing: http://stackoverflow.com/questions/3719748/parallelizing-gdi-image-resizing-net – Dennis Dec 14 '15 at 11:49
  • Unclear is one of the options to close. – Matthew Whited Dec 14 '15 at 11:51
  • I think I made it quite clear what I am asking in the last line - "what makes GDI so fast, and is there a way to match its speed writing your own function". And I also stated that I know about the work around using processes to multi thread GDI functions, but in theory multi threading my own function should have been faster, and since it wasn't, I'm asking this question. – Frobot Dec 14 '15 at 20:28
  • I think you should show us some code. For example, you're saying that you're using an array. Arrays are slow if the bounds checking cannot be optimized away. You'll get way faster code with pointers. – Dan Byström Dec 14 '15 at 21:56
  • This is one thing I was thinking about. I think one of the biggest differences is that I can't really use pointers with c#, or at least not in the way GDI can. And having to make a check at every pixel to make sure it is in bounds might add up to be one piece of the puzzle. I have made many changes to my code and it no longer reflects what I posted. It shouldn't be hard to revert it back so I can show exactly how I am doing it. I'll post the code shortly – Frobot Dec 14 '15 at 22:47
  • 1
    ` I thought that GDI doesn't touch the GPU at all` - [not true](https://msdn.microsoft.com/en-us/library/windows/hardware/ff566559(v=vs.85).aspx). – 500 - Internal Server Error Dec 15 '15 at 00:46
  • @Frobot: you can use pointers pretty well in C#. Here's a blog post I wrote once when I made some changes to code like yours run 25 times faster: http://danbystrom.se/2008/12/14/improving-performance/ – Dan Byström Dec 15 '15 at 07:34
  • I posted my code using pointers and some other optimizations I found. If anyone knows any ways to further improve, please share. Also, the GDI function for rotating an image done in a loop has my GPU at 0% usage and maxes out my CPU, so I think blaming it on the GPU might be out of the question. – Frobot Dec 20 '15 at 00:28
  • 1
    This question is a little bit akin to saying, "I know Ferrari has been producing sports cars for decades and have spent many millions of dollars on R&D, but I've made a better and simpler one in my garage and it's no where near as fast. What could I have done wrong?" – Enigmativity Dec 20 '15 at 01:06
  • @Enigmativity thank you for your help (sarcasm). I'm just asking what methods they used to get their car to go so fast. And thanks to some people giving ideas I have gotten my garage project car nearly 10 times faster. What's wrong with that? – Frobot Dec 20 '15 at 01:10
  • @Frobot - You're absolutely right. I was being a tad sarcastic. I do think that you need to parallelize your computations to have any hope that they'll run at GDI+ speeds. Whether that is something like using DirectX or some sort of third-party GPU library I can't say, but that would be the direction I look. – Enigmativity Dec 20 '15 at 01:22

2 Answers2

3

The GDI+ code is written in C/C++, or possibly even partially in assembly. Some GDI+ calls may use GDI, an old and well optimized API. You will find it difficult to match the performance, even if you know all the pixel manipulation tricks.

Frank Hileman
  • 1,159
  • 9
  • 19
  • Yup. Basically it "cheats" by calling out to less managed, more efficient APIs and OS ops under the hood. `List.Sort()` does the same thing. ;) – Haney Dec 14 '15 at 23:03
  • Yes I think it boils down to being a highly optimized API built by professionals using some tricks, along with not being managed. I will still be able to beat its speed using a custom function across multiple threads, just not by as much as expected. Thanks for everyone's input – Frobot Dec 14 '15 at 23:29
  • @Haney: Out of curiosity, what does `List.Sort()` call out to? - I am not aware that the native Windows API provides a sorting function. – 500 - Internal Server Error Dec 15 '15 at 00:10
  • @500-InternalServerError jogging my memory... Been a few years since I dug into it, but if I recall correct it calls a C lib that does the QuickSort w/ pointers – Haney Dec 15 '15 at 04:39
  • I added an answer with my code. If anyone knows a way to further improve, please share – Frobot Dec 20 '15 at 00:26
  • It should also be noted that most of the GDI+ calls run on "Ring 0" on the CPU, along with the OS, so there is limited security context jumps. We can't write code at this level - I think we're on "Ring 3". My memory might be a bit fuzzy on all that, but I think I'm close. – Enigmativity Dec 20 '15 at 01:05
2

I am adding my own answer along with my code to help anyone else who may be looking to do this.

From a combination of pointers and using an approximation of Sine and Cosine instead of calling an outside function for the rotation, I have come pretty darn close to reaching GDI speeds. No outside functions are called at all.

It still takes about 50% more time than GDI, but my earlier implementation took over 10 times longer than GDI. And when you consider multi threading, this method can be 10 times faster than GDI. This function can rotate a 300x400 picture in 3 milliseconds on my machine.

Keep in mind that this is for grayscale images and each byte in the input array represents one pixel. If you have any ideas to make it faster please share!

private unsafe byte[] rotate(byte[] input, int inputWidth, int inputHeight, int cx, int cy, double angle)
    {
        byte[] result = new byte[input.Length];

        int
            tx, ty, ix, iy, x1, y1;
        double
            px, py, fx, fy, sin, cos, v;
        byte a, b;

        //Approximate Sine and Cosine of the angle
        if (angle < 0)
            sin = 1.27323954 * angle + 0.405284735 * angle * angle;
        else
            sin = 1.27323954 * angle - 0.405284735 * angle * angle;
        angle += 1.57079632;
        if (angle > 3.14159265)
            angle -= 6.28318531;
        if (angle < 0)
            cos = 1.27323954 * angle + 0.405284735 * angle * angle;
        else
            cos = 1.27323954 * angle - 0.405284735 * angle * angle;
        angle -= 1.57079632;


        fixed (byte* pInput = input, pResult = result)
        {
            byte* pi = pInput;
            byte* pr = pResult;

            for (int x = 0; x < inputWidth; x++)
                for (int y = 0; y < inputHeight; y++)
                {
                    tx = x - cx;
                    ty = y - cy;
                    px = tx * cos - ty * sin + cx;
                    py = tx * sin + ty * cos + cy;
                    ix = (int)px;
                    iy = (int)py;
                    fx = px - ix;
                    fy = py - iy;

                    if (ix < inputWidth && iy < inputHeight && ix >= 0 && iy >= 0)
                    {
                        //keep in array bounds
                        x1 = ix + 1;
                        y1 = iy + 1;
                        if (x1 >= inputWidth)
                            x1 = ix;
                        if (y1 >= inputHeight)
                            y1 = iy;

                        //bilinear interpolation using pointers
                        a = *(pInput + (iy * inputWidth + ix));
                        b = *(pInput + (y1 * inputWidth + ix));
                        v = a + ((*(pInput + (iy * inputWidth + x1)) - a) * fx);
                        pr = (pResult + (y * inputWidth + x));
                        *pr = (byte)(v + (((b + ((*(pInput + (y1 * inputWidth + x1)) - b) * fx)) - v) * fy));
                    }
                }
        }

        return result;
    }
Frobot
  • 1,224
  • 3
  • 16
  • 33
  • I see a couple of very minor tweaks you could do, like moving `tx = x - cx;` and the two related terms `tx * cos` and `tx * sin` out of the inner loop nesting, but the latter would require more temps, so you would have to test if something like that is worthwhile. I also wonder if switching to non-short-circuit Boolean evaluation in your `if` may give a slight boost. But overall I think you are close to what can be achieved at this level (unless there's a completely different approach that I'm unaware of). – 500 - Internal Server Error Dec 20 '15 at 01:18
  • One other thing you could try, use a [transformation matrix](https://en.wikipedia.org/wiki/Transformation_matrix) to apply the rotation to the image. You can also use the NuGet package [System.Numerics.Vectors](https://www.nuget.org/packages/System.Numerics.Vectors) to get hardware accelerated versions of some of the Matrix methods to make it even faster. – Scott Chamberlain Dec 20 '15 at 03:30