Optimizing renderscript summation of row and column cells

Question

Any advise in optimizing the following code? The code first grayscales, inverts and then thresholds the image (code not included, because it is trivial). It then sums the elements of each row and column (all elements are either 1 or 0). It then finds the row and column index of the row and column with the highest value.

The code is supposed to find the centroid of the image and it works, but I want to make it faster

I'm developing for API 23, so a reduction kernel can not be used.

Java snippet:

private int[] sumValueY = new int[640];
private int[] sumValueX = new int[480];

rows_indices_alloc = Allocation.createSized( rs, Element.I32(rs), height, Allocation.USAGE_SCRIPT);
col_indices_alloc = Allocation.createSized( rs, Element.I32(rs), width, Allocation.USAGE_SCRIPT);

public RenderscriptProcessor(RenderScript rs, int width, int height)
{
   mScript.set_gIn(mIntermAllocation);

   mScript.forEach_detectX(rows_indices_alloc);
   mScript.forEach_detectY(col_indices_alloc);

   rows_indices_alloc.copyTo(sumValueX);
   col_indices_alloc.copyTo(sumValueY);
 }

Renderscript.rs snippet:

#pragma version(1)
#pragma rs java_package_name(org.gearvrf.renderscript)
#include "rs_debug.rsh"
#pragma rs_fp_relaxed

const int mImageWidth=640;
const int mImageHeight=480;

int32_t maxsX=-1;
int32_t maxIndexX;

int32_t maxsY=-1;
int32_t maxIndexY;

rs_allocation gIn;

void detectX(int32_t v_in, int32_t x, int32_t y) {

    int32_t sum=0;

    for ( int i = 0; i < (mImageWidth); i++) {

       float4 f4 = rsUnpackColor8888(rsGetElementAt_uchar4(gIn, i, x));
       sum+=(int)f4.r;
    }

    if((sum>maxsX)){

        maxsX=sum;
        maxIndexX = x;
    }
}

void detectY(int32_t v_in, int32_t x, int32_t y) {

     int32_t sum=0;

     for ( int i = 0; i < (mImageHeight); i++) {

        float4 f4 = rsUnpackColor8888(rsGetElementAt_uchar4(gIn, x, i));
        sum+=(int)f4.r;
     }

     if((sum>maxsY)){
         maxsY=sum;
         maxIndexY = x;
     }

}

Any help would be appreciated

score 0 · Answer 1 · answered Apr 02 '17 at 17:59

float4 f4 = rsUnpackColor8888(rsGetElementAt_uchar4(gIn, x, i));
sum+=(int)f4.r;

This converts from int to float and then back to int again. I think you can simplify by just doing this:

sum += rsGetElementAt_uchar4(gIn, x, i).r;

I don't know exactly how your previous stages work because you haven't posted them, but you should try generating packed values to read here. So either put your grayscale channels in .rgba or use a single channel format and then use rsAllocationVLoad_uchar4 to fetch 4 values at once.

Also, try combining previous stages with this one, if you don't need the intermediate results of those calculations it may be cheaper to do the memory load once and then do those transformations in registers.

You might also play with how many values your threads operate on. You could try having each kernel processing width/2, width/4, width/8 elements and see how they perform. This will give GPUs more threads to play with especially on lower-resolution images but with the trade off of having more reduction steps.

You also have a multiple-writers race condition on the maxsX/maxsY and maxIndexX/maxIndexY variables. All those writes need to use atomics if you care about the exact right answer. I think maybe you posted the wrong code because you don't store to the *_indices_alloc but you copy from them at the end. So, actually you should store all the sums to those and then use either a single threaded function or a kernel with atomics to get the absolute max and max index.

Optimizing renderscript summation of row and column cells

1 Answers1