why the ScriptIntrinsicBlur is faster than my method?

Question

i use the Renderscript to do the gaussian blur on a image. but no matter what i did. the ScriptIntrinsicBlur is more more faster. why this happened? ScriptIntrinsicBlur is using another method? this id my RS code:

#pragma version(1)
#pragma rs java_package_name(top.deepcolor.rsimage.utils)

//aussian blur algorithm.

//the max radius of gaussian blur
static const int MAX_BLUR_RADIUS = 1024;

//the ratio of pixels when blur
float blurRatio[(MAX_BLUR_RADIUS << 2) + 1];

//the acquiescent blur radius
int blurRadius = 0;

//the width and height of bitmap
uint32_t width;
uint32_t height;

//bind to the input bitmap
rs_allocation input;
//the temp alloction
rs_allocation temp;

//set the radius
void setBlurRadius(int radius)
{
    if(1 > radius)
        radius = 1;
    else if(MAX_BLUR_RADIUS < radius)
        radius = MAX_BLUR_RADIUS;

    blurRadius = radius;


    /**
    calculate the blurRadius by Gaussian function
    when the pixel is far way from the center, the pixel will not contribute to the center
    so take the sigma is blurRadius / 2.57
    */
    float sigma = 1.0f * blurRadius / 2.57f;
    float deno  = 1.0f / (sigma * sqrt(2.0f * M_PI));
    float nume  = -1.0 / (2.0f * sigma * sigma);

    //calculate the gaussian function
    float sum = 0.0f;
    for(int i = 0, r = -blurRadius; r <= blurRadius; ++i, ++r)
    {
        blurRatio[i] = deno * exp(nume * r * r);
        sum += blurRatio[i];
    }

    //normalization to 1
    int len = radius + radius + 1;
    for(int i = 0; i < len; ++i)
    {
        blurRatio[i] /= sum;
    }

}

/**
the gaussian blur is decomposed two steps:1
1.blur in the horizontal
2.blur in the vertical
*/
uchar4 RS_KERNEL horizontal(uint32_t x, uint32_t y)
{
    float a, r, g, b;

    for(int k = -blurRadius; k <= blurRadius; ++k)
    {
        int horizontalIndex = x + k;

        if(0 > horizontalIndex) horizontalIndex = 0;
        if(width <= horizontalIndex) horizontalIndex = width - 1;

        uchar4 inputPixel = rsGetElementAt_uchar4(input, horizontalIndex, y);

        int blurRatioIndex = k + blurRadius;
        a += inputPixel.a * blurRatio[blurRatioIndex];
        r += inputPixel.r * blurRatio[blurRatioIndex];
        g += inputPixel.g * blurRatio[blurRatioIndex];
        b += inputPixel.b * blurRatio[blurRatioIndex];
    }

    uchar4 out;

    out.a = (uchar) a;
    out.r = (uchar) r;
    out.g = (uchar) g;
    out.b = (uchar) b;

    return out;
}

uchar4 RS_KERNEL vertical(uint32_t x, uint32_t y)
{
    float a, r, g, b;

    for(int k = -blurRadius; k <= blurRadius; ++k)
    {
        int verticalIndex = y + k;

        if(0 > verticalIndex) verticalIndex = 0;
        if(height <= verticalIndex) verticalIndex = height - 1;

        uchar4 inputPixel = rsGetElementAt_uchar4(temp, x, verticalIndex);

        int blurRatioIndex = k + blurRadius;
        a += inputPixel.a * blurRatio[blurRatioIndex];
        r += inputPixel.r * blurRatio[blurRatioIndex];
        g += inputPixel.g * blurRatio[blurRatioIndex];
        b += inputPixel.b * blurRatio[blurRatioIndex];
    }

    uchar4 out;

    out.a = (uchar) a;
    out.r = (uchar) r;
    out.g = (uchar) g;
    out.b = (uchar) b;

    return out;
}

1. How were you doing your testing. 2. On what hardware/emulator are you testing on. 3. If on device - consider that ODM may implement ScriptIntrinsics with additional hardware resources not available to app developers. — Morrison Chang, Sep 05 '16 at 02:40
i test in a real phone by a image(293x220). my method cost about 120ms — 惊奇漫画, Sep 05 '16 at 09:16
what is the mean of ODM? i test in a real phone by a image(293x220),blur radius is 20. my method cost about 120ms. the ScriptIntrinsicBlur cost about 25ms .i found that the copyTo() method cost too much time(ScriptIntrinsicBlur use the method too, but it cost little time). by the way where can i find the RS source code about the ScriptIntrinsicBlur? — 惊奇漫画, Sep 05 '16 at 09:28
120ms for such a small image? That seems impossibly slow, are you sure you are measuring the time correctly? In any case I would always expect the framework implementation to be faster since it can be optimized for the exact hardware of the phone. — Xaver Kapeller, Sep 05 '16 at 10:01
in the mac os i saw it was 293*220. but in the android studio when it run it is 586*440? what's going on? — 惊奇漫画, Sep 05 '16 at 12:37
Do you have the image stored in one of the "drawable-xxxx" folders directly inside the app? Images in these folders are automatically resized to fit the screen dpi. To avoid this resizing, store them in "drawable-nodpi" folder instead. More info on: http://stackoverflow.com/questions/27280119/images-in-the-drawable-folder-are-resized-automatically — monoeci, Sep 05 '16 at 20:37
thanks so much i store just in the drawable folder. i think the image's real size is 586*440.but in the mac retina display it will be changed — 惊奇漫画, Sep 06 '16 at 01:45

monoeci · Accepted Answer · 2016-09-05T09:56:25.863

Renderscript intrinsics are implemented very differently from what you can achieve with a script of your own. This is for several reasons, but mainly because they are built by the RS driver developer of individual devices in a way that makes the best possible use of that particular hardware/SoC configuration, and most likely makes low level calls to the hardware that is simply not available at the RS programming layer.

Android does provide a generic implementation of these intrinsics though, to sort of "fall back" in case no lower hardware implementation is available. Seeing how these generic ones are done will give you some better idea of how these intrinsics work. For example, you can see the source code of the generic implementation of the 3x3 convolution intrinsic here rsCpuIntrinsicConvolve3x3.cpp.

Take a very close look at the code starting from line 98 of that source file, and notice how they use no for loops whatsoever to do the convolution. This is known as unrolled loops, where you add and multiply explicitly the 9 corresponding memory locations in the code, thereby avoiding the need of a for loop structure. This is the first rule you must take into account when optimizing parallel code. You need to get rid of all branching in your kernel. Looking at your code, you have a lot of if's and for's that cause branching -- this means the control flow of the program is not straight through from beginning to end.

If you unroll your for loops, you will immediately see a boost in performance. Note that by removing your for structures you will no longer be able to generalize your kernel for all possible radius amounts. In that case, you would have to create fixed kernels for different radii, and this is exactly why you see separate 3x3 and 5x5 convolution intrinsics, because this is just what they do. (See line 99 of the 5x5 intrinsic at rsCpuIntrinsicConvolve5x5.cpp).

Furthermore, the fact that you have two separate kernels doesn't help. If you're doing a gaussian blur, the convolutional kernel is indeed separable and you can do 1xN + Nx1 convolutions as you've done there, but I would recommend putting both passes together in the same kernel.

Keep in mind though, that even doing these tricks will probably still not give you as fast results as the actual intrinsics, because those have probably been highly optimized for your specific device(s).

thank you so much. your answer give me a lot of help.thanks.by the way if i unroll the looper then i would'n blur the radius that i wanted! the 1024 radius is'n be unrolled. or i use parallel in a parallel is there any method? — 惊奇漫画, Sep 05 '16 at 12:46
Right! Doing unrolled loops for a large radius is not practical. However, for a large radius, there is another trick you can try: Multiple blurs of a smaller radius are equivalent to a single blur of large radius. See for example: http://computergraphics.stackexchange.com/questions/256/is-doing-multiple-gaussian-blurs-the-same-as-doing-one-larger-blur . So an image blurred once with radius R = blurring 4 times with radius R/2. You might be able to use that property to compose large blurs from smaller more efficient blurs. You will have to do some tests to see if it's actually faster though... — monoeci, Sep 05 '16 at 20:30
it is a great idea. i have use this method before with jni in android.i find a article about:a gaussian blur equals three box blur(web:http://blog.ivank.net/fastest-gaussian-blur.html).it works in a linear compute, i will try it in a parallel compute. — 惊奇漫画, Sep 06 '16 at 01:51

why the ScriptIntrinsicBlur is faster than my method?

1 Answers1