How to make performant AOT blur with variable kernel size?

Question

What would be an effective single threaded scheduling for this type of code? I'm trying to define blur but with a variable kernel size in AOT. I tried https://github.com/halide/Halide/issues/180 solution but I can't figure a good way to schedule it that would get me the same performance as making kernel size a GeneratorParam and pre compiling with different values.

Here is a snippet with the GeneratorParam:

// GeneratorParam<int32_t> kernelSize{"kernelOffset", 1};
int32_t kernelSize = 2*kernelOffset + 1;
{
   Halide::Expr sum = input(x, y);
   for (int i=1;i<kernelSize;i++) {
      sum = sum + Halide::cast<uint16_t>(input(x, y+i));
   }
   blur_y(x, y) = sum/kernelSize;
}
{
   Halide::Expr sum = blur_y(x, y);
   for (int i=1;i<kernelSize;i++) {
      sum = sum + blur_y(x+i, y);
   }
   blur_x(x, y) = sum/kernelSize;
}

...

// And the schedule


blur_x.compute_root();
blur_y.compute_at(blur_x, y);
output.vectorize(x, 16);

And using https://github.com/halide/Halide/issues/180 solution

Halide::RDom box (0, kernelSize, "box");
blur_y(x, y) = Halide::undef<uint16_t>();
{
    Halide::RDom ry (yMin+1, yMax-yMin, "ry");
    blur_y(x, yMin) = Halide::cast<uint16_t>(0);
    blur_y(x, yMin) += Halide::cast<uint16_t>(input(x, yMin+box))/kernelSize;
    blur_y(x, ry) = blur_y(x, ry-1) + input_uint16(x, ry+kernelOffset-1)/kernelSize - input_uint16(x, ry-1-kernelOffset)/kernelSize;
}

blur_x(x, y) = Halide::undef<uint16_t>();
{
    Halide::RDom rx (xMin+1, xMax-xMin, "rx");
    blur_x(xMin, y) = Halide::cast<uint16_t>(0);
    blur_x(xMin, y) += blur_y(xMin+box, y)/kernelSize;
    blur_x(rx, y) = blur_x(rx-1, y) + blur_y(rx+kernelOffset, y)/kernelSize - blur_y(rx-1-kernelOffset, y)/kernelSize;
}

can you write what schedule you tried and what was its runtime speed as compared to no explicit scheduling? — Ruppesh Nalwaya, Jun 20 '17 at 17:38
The version with fixed kernelSize runs at 6.56348e-05s with no schedule and 1.97556e-05s with the specified schedule. The variable kernel version runs at 1.10882s with no schedule and 0.0026933s with the same schedule as the other. — Fran6co, Jun 21 '17 at 15:40

score 3 · Accepted Answer · answered Jun 28 '17 at 19:02

The only way to get the same speed between fixed and variable radius is to use the specialize scheduling directive to generate fixed code for specific radii. If you can JIT and are blurring lots of pixels at the same radii, it may be profitable to JIT a specific filter for a given radius.

Generally really fast, arbitrary radius, blurs use adaptive approaches in which large radii are handled by something like iterative box filtering, intermediate levels use separable convolution and very small radii may use non-separable convolution. The blur is often done in multiple passes combining multiple approaches.

How to make performant AOT blur with variable kernel size?

1 Answers1