Calculating sum of 2D array is slow

Question

I'm building an Android app using RenderScript. I want to create a 2D array that contains a sum of all the ints in the rectangle from the origin up till (x, y).

As stated here, RenderScript is a C99-derived language. Any function with the RS_KERNEL keyword will be called once for every pixel by the framework. No guarantees are made in which order this will be called. From Java/Kotlin generateGridPixelCount is called by providing it a piece of allocated memory. Every item is set by the returned value accordingly.

So, for example:

   Input                Output
                  -----------------
                  | 0 | 0 | 0 | 0 |
-------------     -----------------
| 1 | 1 | 1 |     | 0 | 1 | 2 | 3 | 
-------------     -----------------
| 0 | 1 | 1 |     | 0 | 1 | 3 | 5 |
-------------     -----------------
| 1 | 1 | 1 |     | 0 | 2 | 5 | 8 |
-------------     -----------------

For small arrays this is doable, but when the array is as big as the screen (1080x2280), the function I've written takes ages.

static bool shouldCheckPixel(uint x, uint y) {
    return (x + y) % (width / stepSize) == 0 || (x - y) % (width / stepSize) == 0;
}

int RS_KERNEL generateGridPixelCount(uint x, uint y) {
    int pixelCount = 0;

    for (int dx = 0; dx < x; dx++) {
        for (int dy = 0; dy < y; dy++) {
            if (shouldCheckPixel(dx, dy)) {
                pixelCount++;
            }
        }
    }

    return pixelCount;
}

However, if pixelCount++; is removed or the if statement is removed, it finishes almost instantly. It's when it's combined when it gets slow.

Previous RS_KERNEL function loop over the output. This can be reversed:

void RS_KERNEL generateGridPixelCount(uchar4 input, uint x, uint y) {
   if (shouldCheckPixel(x, y)) {
      for (int x = 0; x < dx; x++) {
         for (int y = 0; y < dy; y++) {
            rsAtomicInc(*pixelArray[x][y]);
         }
      }
   }
}

What's the best way to calculate this array?

Small point: if you move the one-liner in the function into the body of the loop that calls it, you stand a better chance of the compiler optimizing `width / stepSize` which is performed for every pixel. Or you precalculate a look-up array for the (not very many) values of `x + y` and `x - y` (with an offset?) needed. — Weather Vane, Sep 12 '19 at 19:29
@EugeneSh. `shouldCheckPixel` was kept to show the input array is not just an array filled with ones. `shouldCheckPixel` could also change in the future. — Michiel, Sep 12 '19 at 19:51
@Michiel: Using a function like `shouldCheckPixel` is a poor proxy for testing an array value, especially since it contains a division , which is a notorious performance issue. A better model would be to write random data to the array and measure that. After that is working, SIMD code might speed it up. — Eric Postpischil, Sep 12 '19 at 20:01
@WeatherVane moving the one-liner doesn't seem to have an huge effect unfortunately. The result of this function is actually the look-up table already, preferably ran once in the lifetime of the app. Still, I'll look into it! — Michiel, Sep 12 '19 at 20:02
Your description of the output arrays shows cumulative sums, but your sample code shows just a *de novo* calculation of one sum. Cumulative sums are much better computed by building on previous results. You should should show code that better models the desired result. Show full code that produces the desired results, including an actual `shouldCheckPixel` function that will be used and/or representative sample data. — Eric Postpischil, Sep 12 '19 at 20:03
Using a technique that calls a routine independently for each pixel is a huge mistake. The cumulative sums are serially dependent. The mathematics required is only O(n), where n is the number of pixels, but processing each independently as the sum of inputs pixels is O(n*n)—a million times more work than necessary. — Eric Postpischil, Sep 12 '19 at 21:52

score 0 · Answer 1 · answered Sep 13 '19 at 07:55

Eric Postpischil pushed me in the right direction; the following snippet is what I use now.

Kotlin:

script.forEach_generateShouldCheckPixelLookup(shouldCheckPixelLookup)
script.invoke_generateShouldCheckPixelSum()

RenderScript:

int RS_KERNEL generateShouldCheckPixelLookup(uint x, uint y) {
    if (shouldCheckPixel(x, y)) {
        return 1;
    } else {
        return 0;
    }
}

void generateShouldCheckPixelSum() {
    for (int y = 1; y < height + 1; y++) {
        for (int x = 1; x < width + 1; x++) {
            int top     = rsGetElementAt_int(shouldCheckPixelSum, x,     y - 1);
            int left    = rsGetElementAt_int(shouldCheckPixelSum, x - 1, y);
            int topLeft = rsGetElementAt_int(shouldCheckPixelSum, x - 1, y - 1);

            int current = rsGetElementAt_int(shouldCheckPixelLookup, x, y);
            int value = top + left - topLeft + current;

            rsSetElementAt_int(shouldCheckPixelSum, value, x, y);
        }
    }
}

generateShouldCheckPixelLookup is called once for every pixel and stores it result in shouldCheckPixelLookup. Then generateShouldCheckPixelSum is called once and run once.

My idea was that since RenderScript tries to run every call per pixel in parallel, this will always be faster than running a single function call once. However, generateShouldCheckPixelSum is as quick as de slowest generateGridPixelCount from my original question.

Calculating sum of 2D array is slow

1 Answers1