Simple GLSL convolution shader is atrociously slow

Question

I'm trying to implement a 2D outline shader in OpenGL ES2.0 for iOS. It is insanely slow. As in 5fps slow. I've tracked it down to the texture2D() calls. However, without those any convolution shader is undoable. I've tried using lowp instead of mediump, but with that everything is just black, although it does give another 5fps, but it's still unusable.

Here is my fragment shader.

    varying mediump vec4 colorVarying;
    varying mediump vec2 texCoord;

    uniform bool enableTexture;
    uniform sampler2D texture;

    uniform mediump float k;

    void main() {

        const mediump float step_w = 3.0/128.0;
        const mediump float step_h = 3.0/128.0;
        const mediump vec4 b = vec4(0.0, 0.0, 0.0, 1.0);
        const mediump vec4 one = vec4(1.0, 1.0, 1.0, 1.0);

        mediump vec2 offset[9];
        mediump float kernel[9];
        offset[0] = vec2(-step_w, step_h);
        offset[1] = vec2(-step_w, 0.0);
        offset[2] = vec2(-step_w, -step_h);
        offset[3] = vec2(0.0, step_h);
        offset[4] = vec2(0.0, 0.0);
        offset[5] = vec2(0.0, -step_h);
        offset[6] = vec2(step_w, step_h);
        offset[7] = vec2(step_w, 0.0);
        offset[8] = vec2(step_w, -step_h);

        kernel[0] = kernel[2] = kernel[6] = kernel[8] = 1.0/k;
        kernel[1] = kernel[3] = kernel[5] = kernel[7] = 2.0/k;
        kernel[4] = -16.0/k;  

        if (enableTexture) {
              mediump vec4 sum = vec4(0.0);
            for (int i=0;i<9;i++) {
                mediump vec4 tmp = texture2D(texture, texCoord + offset[i]);
                sum += tmp * kernel[i];
            }

            gl_FragColor = (sum * b) + ((one-sum) * texture2D(texture, texCoord));
        } else {
            gl_FragColor = colorVarying;
        }
    }

This is unoptimized, and not finalized, but I need to bring up performance before continuing on. I've tried replacing the texture2D() call in the loop with just a solid vec4 and it runs no problem, despite everything else going on.

How can I optimize this? I know it's possible because I've seen way more involved effects in 3D running no problem. I can't see why this is causing any trouble at all.

"*I've tried replacing the texture2D() call in the loop with just a solid vec4 and it runs no problem*" What does that mean? Did it get faster? Did it not change performance? What happened? — Nicol Bolas, Sep 18 '12 at 03:38
"*I can't see why this is causing any trouble at all.*" You're doing *ten texture accesses* per shader invocation, and you don't see what could be causing a problem? Also, you accessing the center texel twice. — Nicol Bolas, Sep 18 '12 at 03:39
I get a solid 60fps without the texture lookups (excluding the final one). As I said, it's not optimized, but there's no way to avoid those texture calls. The filter couldn't work otherwise. But I've seen plenty of games, mobile and not, that use effects based on convolution filters, and they don't seem to be having any issue. Unless there's some trick to avoid them? — user1137704, Sep 18 '12 at 03:45

score 53 · Answer 1 · answered Sep 18 '12 at 16:40

I've done this exact thing myself, and I see several things that could be optimized here.

First off, I'd remove the enableTexture conditional and instead split your shader into two programs, one for the true state of this and one for false. Conditionals are very expensive in iOS fragment shaders, particularly ones that have texture reads within them.

Second, you have nine dependent texture reads here. These are texture reads where the texture coordinates are calculated within the fragment shader. Dependent texture reads are very expensive on the PowerVR GPUs within iOS devices, because they prevent that hardware from optimizing texture reads using caching, etc. Because you are sampling from a fixed offset for the 8 surrounding pixels and one central one, these calculations should be moved up into the vertex shader. This also means that these calculations won't have to be performed for each pixel, just once for each vertex and then hardware interpolation will handle the rest.

Third, for() loops haven't been handled all that well by the iOS shader compiler to date, so I tend to avoid those where I can.

As I mentioned, I've done convolution shaders like this in my open source iOS GPUImage framework. For a generic convolution filter, I use the following vertex shader:

 attribute vec4 position;
 attribute vec4 inputTextureCoordinate;

 uniform highp float texelWidth; 
 uniform highp float texelHeight; 

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     gl_Position = position;

     vec2 widthStep = vec2(texelWidth, 0.0);
     vec2 heightStep = vec2(0.0, texelHeight);
     vec2 widthHeightStep = vec2(texelWidth, texelHeight);
     vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight);

     textureCoordinate = inputTextureCoordinate.xy;
     leftTextureCoordinate = inputTextureCoordinate.xy - widthStep;
     rightTextureCoordinate = inputTextureCoordinate.xy + widthStep;

     topTextureCoordinate = inputTextureCoordinate.xy - heightStep;
     topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep;
     topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep;

     bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep;
     bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep;
     bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep;
 }

and the following fragment shader:

 precision highp float;

 uniform sampler2D inputImageTexture;

 uniform mediump mat3 convolutionMatrix;

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     mediump vec4 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate);
     mediump vec4 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate);
     mediump vec4 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate);
     mediump vec4 centerColor = texture2D(inputImageTexture, textureCoordinate);
     mediump vec4 leftColor = texture2D(inputImageTexture, leftTextureCoordinate);
     mediump vec4 rightColor = texture2D(inputImageTexture, rightTextureCoordinate);
     mediump vec4 topColor = texture2D(inputImageTexture, topTextureCoordinate);
     mediump vec4 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate);
     mediump vec4 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate);

     mediump vec4 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2];
     resultColor += leftColor * convolutionMatrix[1][0] + centerColor * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2];
     resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2];

     gl_FragColor = resultColor;
 }

The texelWidth and texelHeight uniforms are the inverse of the width and height of the input image, and the convolutionMatrix uniform specifies the weights for the various samples in your convolution.

On an iPhone 4, this runs in 4-8 ms for a 640x480 frame of camera video, which is good enough for 60 FPS rendering at that image size. If you just need to do something like edge detection, you can simplify the above, convert the image to luminance in a pre-pass, then only sample from one color channel. That's even faster, at about 2 ms per frame on the same device.

Great example. tl;dr: **avoid dependent texture reads**. Endeavor, also, to test separable convolutions by rendering in two passes in order to reduce the number of the fetches (though for such an example of 9 it would not reduce to less than half that, so in this case a two-pass approach might be a bad idea) — Steven Lu, Sep 14 '13 at 05:25
@StevenLu - There's a surprisingly sharp falloff in performance once you get beyond 9 texture reads or so in a single pass on many of these GPUs. Splitting this into two passes can have a nonlinear impact on performance, compared to the number of samples in a single pass. I've tested, and running this in a single pass is much, much slower than separating the kernel, even for this small a number of samples. — Brad Larson, Sep 15 '13 at 19:44
Awesome, thanks for weighing in. So, the pixel fillrate for "light" fragment programs can handle the extra load due to extra passes? I read somewhere the iPhone4 can fill its screen 7 times over to maintain 60fps. This works out to a bit under 2ms per full screen pass. Does that sound about right? — Steven Lu, Sep 15 '13 at 22:56
Is there some way to simultaneously fetch a region of a texture, instead of a single pixel? — Alex Gonçalves, Apr 28 '15 at 18:36
@AlexGonçalves - In a fragment shader? No, texture2D() only samples a single pixel at a time. — Brad Larson, Apr 28 '15 at 18:45
I was hoping to apply an unsharp mask filter to my image using a 5x5 kernel. Would 25 texture reads be too expensive? (apart from also increasing the number of lines dramatically). — Crearo Rotar, May 24 '18 at 17:44
@CrearoRotar - On modern devices, it probably won't make much of a difference performance-wise between a 3x3 and 5x5 convolution. You won't be able to use varyings like I do above, because you'll exceed the maximum number of varyings supported most iOS hardware. For an unsharp mask, I might recommend using a separable Gaussian blur, followed by a custom shader to mix pixels, like I do [here](https://github.com/BradLarson/GPUImage/blob/master/framework/Source/GPUImageUnsharpMaskFilter.m). That can help reduce the number of samples over large areas and speed up the process. — Brad Larson, May 24 '18 at 18:42
Hold on, isn't this wrong if this was being used for image processing on a quad. Instead of sampling the neighbors of each pixel by moving the offsets for the neighbors to the vertex shader you are no longer sampling the neighbor. That is instead of sample_pos_x = (texture_width * normalized_pos_between_0to1) + neighbor_offset. The above code computes sample_pos_x = (texture_width + nieghbor_offset) * normalized_pos_between_0to1. — Sushanth Rajasankar, Jul 10 '21 at 17:40

score 6 · Answer 2 · answered Sep 18 '12 at 05:04

The only way I know of reducing time taken in this shader is by reducing the number of texture fetches. Since your shader samples textures from equally spaced points about the center pixels and linearly combines them, you could reduce the number of fetches by making use of the GL_LINEAR mode availbale for texture sampling.

Basically instead of sampling at every texel, sample in between a pair of texels to directly get a linearly weighted sum.

Let us call the sampling at offset (-stepw,-steph) and (-stepw,0) as x0 and x1 respectively. Then your sum is

sum = x0*k0 + x1*k1

Now instead if you sample in between these two texels, at a distance of k0/(k0+k1) from x0 and therefore k1/(k0+k1) from x1, then the GPU will perform the linear weighting during the fetch and give you,

y = x1*k1/(k0+k1) + x0*k0/(k1+k0)

Thus sum can be calculated as

sum = y*(k0 + k1) from just one fetch!

If you repeat this for the other adjacent pixels, you will end up doing 4 texture fetches for each of the adjacent offsets, and one extra texture fetch for the center pixel.

The link explains this much better

Simple GLSL convolution shader is atrociously slow

2 Answers2

Linked