Is there a chance to make the bilinear interpolation faster?

Question

First I want to provide you with some context.

I have two kind of images I need to merge. The first image is the background image with the format 8BppGrey and a resolution of 320x240. The second image is the forground image with the format 32BppRGBA and a resolution of 64x48.

Update The github repo with an MVP is at the bottom of the question.

To do it I resize the second image with bilinear interpolation to the same size as the first one and then use blending to merge both to one image. Blending only happens when the alpha value of the second image is greater then 0.

I need to do it as fast as possible so my idea was to combine the resize and merge / blend process.

To achieve this I used the resize function from the writeablebitmapex repository and added merging / blending.

Everything works as expected but I want to decrease the execution time.

This are the current debug timings:

// CPU: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz

MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 5 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 6 ms
MediaServer: Resizing took 6 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.

Do I have any chance to increase the performance and lower the execution time of the resize / merge / blend process?

Are there some parts I maybe can parallelize?

Do I maybe have a chance to use some processor features?

A huge performance hit is the nested loop but I have no idea how I could write it better.

I would like to reach 1 or 2 ms for the whole process. Is this even possible?

Here's the modified visual c++ function I use.

pd is the backbuffer of the writeable bitmap I use to display the result in wpf. The format I use is the default 32BppRGBA.
pixels is the int[] array of the 64x48 32BppRGBA image
widthSource and heightSource is the size of the pixels image
width and height is the target size of the output image
baseImage is the int[] array of the 320x240 8BppGray image

VC++ code:

unsigned int Resize(int* pd, int* pixels, int widthSource, int heightSource, int width, int height, byte* baseImage)
{
    unsigned int start = clock();

    float xs = (float)widthSource / width;
    float ys = (float)heightSource / height;

    float fracx, fracy, ifracx, ifracy, sx, sy, l0, l1, rf, gf, bf;
    int c, x0, x1, y0, y1;
    byte c1a, c1r, c1g, c1b, c2a, c2r, c2g, c2b, c3a, c3r, c3g, c3b, c4a, c4r, c4g, c4b;
    byte a, r, g, b;

    // Bilinear
    int srcIdx = 0;

    for (int y = 0; y < height; y++)
    {
        for (int x = 0; x < width; x++)
        {
            sx = x * xs;
            sy = y * ys;
            x0 = (int)sx;
            y0 = (int)sy;

            // Calculate coordinates of the 4 interpolation points
            fracx = sx - x0;
            fracy = sy - y0;
            ifracx = 1.0f - fracx;
            ifracy = 1.0f - fracy;
            x1 = x0 + 1;
            if (x1 >= widthSource)
            {
                x1 = x0;
            }
            y1 = y0 + 1;
            if (y1 >= heightSource)
            {
                y1 = y0;
            }

            // Read source color
            c = pixels[y0 * widthSource + x0];
            c1a = (byte)(c >> 24);
            c1r = (byte)(c >> 16);
            c1g = (byte)(c >> 8);
            c1b = (byte)(c);

            c = pixels[y0 * widthSource + x1];
            c2a = (byte)(c >> 24);
            c2r = (byte)(c >> 16);
            c2g = (byte)(c >> 8);
            c2b = (byte)(c);

            c = pixels[y1 * widthSource + x0];
            c3a = (byte)(c >> 24);
            c3r = (byte)(c >> 16);
            c3g = (byte)(c >> 8);
            c3b = (byte)(c);

            c = pixels[y1 * widthSource + x1];
            c4a = (byte)(c >> 24);
            c4r = (byte)(c >> 16);
            c4g = (byte)(c >> 8);
            c4b = (byte)(c);

            // Calculate colors
            // Alpha
            l0 = ifracx * c1a + fracx * c2a;
            l1 = ifracx * c3a + fracx * c4a;
            a = (byte)(ifracy * l0 + fracy * l1);

            // Write destination
            if (a > 0)
            {
                // Red
                l0 = ifracx * c1r + fracx * c2r;
                l1 = ifracx * c3r + fracx * c4r;
                rf = ifracy * l0 + fracy * l1;

                // Green
                l0 = ifracx * c1g + fracx * c2g;
                l1 = ifracx * c3g + fracx * c4g;
                gf = ifracy * l0 + fracy * l1;

                // Blue
                l0 = ifracx * c1b + fracx * c2b;
                l1 = ifracx * c3b + fracx * c4b;
                bf = ifracy * l0 + fracy * l1;

                // Cast to byte
                float alpha = a / 255.0f;
                r = (byte)((rf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
                g = (byte)((gf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
                b = (byte)((bf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));

                pd[srcIdx++] = (255 << 24) | (r << 16) | (g << 8) | b;
            }
            else
            {
                // Alpha, Red, Green, Blue                          
                pd[srcIdx++] = (255 << 24) | (baseImage[srcIdx] << 16) | (baseImage[srcIdx] << 8) | baseImage[srcIdx];
            }
        }
    }

    unsigned int end = clock() - start;
    return end;
}

Github repo

C or C++? You're writing C++, but on a first glance, the code pretty much looks like C. Decide for one language. — , Nov 21 '17 at 15:07
To be honest. I am pretty new to c/c++. Where did I mix things up? Is there somewhere a performance hit with my function? — datoml, Nov 21 '17 at 15:11
Your CPU supports SSE4 and AVX2 - if you're prepared to learn a little about SIMD programming then you should be able to vectorize this code and get a reasonable performance improvement. (Two things to check first though: (a) are you compiling with optimisation enabled ? (b) is your compiler auto-vectorizing any of the above code already ?) — Paul R, Nov 21 '17 at 15:13
You mix up things by tagging **both** C and C++. They are different languages. Microsoft of course adds to that confusion by only providing "Visual C++" which happens to have a (poor and somewhat hidden) C compiler as well, so people often end up writing "C code in C++" .... — , Nov 21 '17 at 15:15
One suggestion is calculate 255 << 24 outside of the loops and store in a variable — Grantly, Nov 21 '17 at 15:16
@Grantly: Surely the compiler will fold such constant expressions. — M Oehm, Nov 21 '17 at 15:17
@Grantly it's a constant expression. It ends up as a single integer constant in the compiled code. — , Nov 21 '17 at 15:20
@PaulR The optimization is disabled in my debug configuration. Where can I check if auto-vectorizing is enabled? — datoml, Nov 21 '17 at 15:21
@datoml: in that case you probably don't need to worry about vectorization (manual or automatic) - just try running with the release build with optimisation enabled - you should see a significant speed improvement. — Paul R, Nov 21 '17 at 15:22
Multithreading can speed up image processing greatly as well. But you have to consider things like cache locality and avoiding race conditions. It's a bit architecture dependent but worth doing. As a separate matter you should consider converting RGB values, which, for 8 bit values, implies a gamma >1, to linear space first, for best resampling. — doug, Nov 21 '17 at 16:32
Marom's answer is good, but in your case, you could do away with the interpolation altoogether. Your compression ratio is 1:25 -- one pixel in the small image represents 25 pixels of the big image. Either consider all 25 pixels and average them or just pick one that is representative. You try to pick one and interpolate its neighbours, but that will hardly make a difference, especially since `fracx` will always be 0.0 in your case. You can achieve wither with just integer arithmetic, thus avoiding costly floating-point operations and type conversions. — M Oehm, Nov 21 '17 at 17:21
@MOehm Thanks for this tip. Would you be so kind to provide an example for it? — datoml, Nov 22 '17 at 07:27
@PaulR Hey :). I used the release version and the execution time increased to 10ms :(. — datoml, Nov 22 '17 at 08:39
@datoml: I can do that, but I've seen a mistake in my reasoning: I first thought that you wanted to scale down the big image, but I now see that it's the other way round. (In that case, you could probably save some time by not extracting the source pixels in every loop, because they will often be the same. Have to look into it, but have no time right now.) — M Oehm, Nov 22 '17 at 08:39
@datoml: Was it the same version of the code ? And the same test conditions ? Is optimisation enabled in the release build ? — Paul R, Nov 22 '17 at 08:41
@PaulR yes. the same code in debug is executed in 4ms , in release it takes 10ms. — datoml, Nov 22 '17 at 08:42
@datoml: hard to comment further without seeing more detail as to how you're building, running and timing this. Maybe you could put together a [mcve] ? — Paul R, Nov 22 '17 at 08:43
@PaulR Sure. Was thinking about that too. I created a github repo with an example and provide you some link to it. — datoml, Nov 22 '17 at 08:44
@PaulR Here's the github repo. https://github.com/datoml/bilinear-interpolation-wpf-cpp — datoml, Nov 22 '17 at 10:08
@datoml: too many Windows dependencies for me. Ideally you need a test harness just for the function you want to optimise, and ideally this should be portable (since there is nothing inherently Windows-specific in the function). — Paul R, Nov 22 '17 at 10:13
@PaulR Oh snap... I hate microsoft. I came a cross a post that says that the compiler of the managed c++ is totally bad with optimiziation etc. So I transfered this function to my native c++ library and used the managed part only as a wrapper for my c# code. Now the whole process is done in 1ms :O. AWESOME. — datoml, Nov 22 '17 at 10:53
A bit off-topic, but the `srcIdx++` is undefined behavior. You need to increment after the assignment. — Pedro, Dec 04 '17 at 13:12

score 3 · Answer 1 · edited Dec 04 '17 at 10:59

3

One action that may speed up your code is to avoid type conversions from integer to float and vice versa. This can be achieved by having an int value in the suitable range instead of floats on range 0..1

Something like this:

for (int y = 0; y < height; y++)
{
    for (int x = 0; x < width; x++)
    {
        int sx1 = x * widthSource ;
        int x0 = sx1 / width;
        int fracx = (sx1 % width) ; // range 0..width - 1

which turns into something like

        l0 = (fracx * c2a + (width - fracx) * c1a) / width ;

And so on. A bit tricky but doable

edited Dec 04 '17 at 10:59

Pedro

842
6
16

answered Nov 21 '17 at 15:41

marom

5,064
10
14

Thank you for this tip. I will try to refactor my code to this to avoid this kind of type conversions. I will report back. – datoml Nov 22 '17 at 07:28
Is this intended that this line `l0 = (fracx * c1a + (width - fracx) * c2a) / width ;` results in the warning `warning C4244: '=': conversion from 'int' to 'float', possible loss of data`? – datoml Nov 22 '17 at 07:35
I tried to replace my code parts with yours but the output image is broken then. The colors of the scaled image are wrong and kind of deformed. – datoml Nov 22 '17 at 08:43
@datoml: the intention is that `l0` is of type `int` as well. – Pedro Dec 01 '17 at 12:09
@marom: Good idea, but that's not how one usually applies integer arithmetics optimization, as you always want to get rid of the integer division, which is a really slow instruction. Therefore it is not surprising that your version is even slower than the version using floating point arithmetics. There is also rounding missing. – Pedro Dec 01 '17 at 12:17
@datoml: The reason that you experienced the result to be wrong is that compared to your code, `c1a` and `c2a` were switched in the code presented by @marom. Also `sx` should be `x0` to match your code. I edited this answer, and I hope it will get accepted. – Pedro Dec 01 '17 at 12:47
@PedramAzad thank you! I'll have a look at it on Monday and will report back. – datoml Dec 01 '17 at 12:57

score 0 · Answer 2 · answered Nov 22 '17 at 11:16

Thank you for all the help but the problem was the managed c++ project. I transfered the function now to my native c++ library and used the managed c++ part only as a wrapper for the c# application.

After the compiler optimization the function is now finished in 1ms.

Edit:

I will mark my own answer for now as the solution because the optimization from @marom leads to a broken image.

Pedro · Answer 3 · 2017-12-04T10:29:38.997

The common way to speedup a resize operation with bilinear interpolation is to:

Exploit the fact that x0 and fracx are independent from the row and that y0and fracy are independent from the column. Even though you haven't pulled out the computation of y0 and fracy out of the x-loop, compiler optimization should take care of that. However, for x0 and fracx, one needs to pre-compute the values for all columns and store them in an array. Complexity for computing x0 and fracx becomes O(width) compared to O(width*height) without pre-computation.
Do the whole processing with integers by replacing floating point arithmetics by integer arithmetics, thereby using shift operations instead of integer divisions.

For better readability, I did not implement the pre-computation of x0 and fracx in the following code. Pre-computation is straight-forward anyways.

Note that FACTOR = 2048 is the max you can do with 32-bit signed integers here (2048 * 2048 * 255 is just fine). For higher precision, you should switch to int64_t and then increase FACTOR and SHIFT, respectively.

I placed the border check into the inner loop for better readability. For an optimized implementation one should remove it by iterating in both loops just before this case happens and add special handling for the border pixels.

In case someone is wondering what the + (FACTOR * FACTOR / 2) is for, it is for rounding in conjunction with the subsequent division.

Finally note that (FACTOR * FACTOR / 2) and 2 * SHIFT are evaluated at compile time.

#define FACTOR      2048
#define SHIFT       11

const int xs = (int) ((double) FACTOR * widthSource / width + 0.5);
const int ys = (int) ((double) FACTOR * heightSource / height + 0.5);

for (int y = 0; y < height; y++)
{
    const int sy = y * ys;
    const int y0 = sy >> SHIFT;
    const int fracy = sy - (y0 << SHIFT);

    for (int x = 0; x < width; x++)
    {
        const int sx = x * xs;
        const int x0 = sx >> SHIFT;
        const int fracx = sx - (x0 << SHIFT);

        if (x0 >= widthSource - 1 || y0 >= heightSource - 1)
        {
            // insert special handling here
            continue;
        }

        const int offset = y0 * widthSource + x0;

        target[y * width + x] = (unsigned char)
            ((source[offset] * (FACTOR - fracx) * (FACTOR - fracy) +
            source[offset + 1] * fracx * (FACTOR - fracy) +
            source[offset + widthSource] * (FACTOR - fracx) * fracy +
            source[offset + widthSource + 1] * fracx * fracy +
            (FACTOR * FACTOR / 2)) >> (2 * SHIFT));
    }
}

For clarification, to match the variables used by the OP, for instance, in the case of the alpha channel it is:

a = (unsigned char)
    ((c1a * (FACTOR - fracx) * (FACTOR - fracy) +
    c2a * fracx * (FACTOR - fracy) +
    c3a * (FACTOR - fracx) * fracy +
    c4a * fracx * fracy +
    (FACTOR * FACTOR / 2)) >> (2 * SHIFT));

Thank you for this answer. Is this code the skeleton for the optimized code? Do I need to add the coloring in the special handling block? — datoml, Dec 04 '17 at 09:45
It's code for bilinear interpolation for an 8-bit image/channel of any kind (grayscale, one of RGB, etc.). Adding more channels is straight forward. Of course, `sy,y0,fracy` and `sx,x0,fracx` are valid for all channels. — Pedro, Dec 04 '17 at 09:49
More precisely, it is code for a resize with bilinear interpolation, optimized with integer arithmetics. — Pedro, Dec 04 '17 at 10:05
Ok. So when I use this code with int* target, int* source the target image should be the source image with the new size? — datoml, Dec 04 '17 at 10:07
Oh ok. My output and input image is 32BppRGBA. So I need to extend the code to support this format? — datoml, Dec 04 '17 at 10:17
No. Consider your alpha channel. It is: `c1a = source[offset]`, `c2a = source[offset + 1]`, `c3a = source[offset + widthSource]`, `c4a = source[offset + widthSource + 1]`, `a = target[y * width + x]`. Compare to your code, and you will see. — Pedro, Dec 04 '17 at 10:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/160416/discussion-between-datoml-and-pedram-azad). — datoml, Dec 04 '17 at 10:51

Is there a chance to make the bilinear interpolation faster?

3 Answers3