Why is Halide::minimum, autoschedule or no, extremely slow on CPU, but not GPU but still not good?

Question

Trying to a simple regional min filter, and my results compare accuracy wise to a non-Halide reference implementation that I am using. BUT... this Halide code runs extremely slow on CPU (like 1 minute) and more than 60x faster than that on GPU. Still are slower than the reference implementation which is using OpenCV erode as its building block.

Auto-scheduling the CPU code doesn't seem to help. Actually it seems to hurt as auto-scheduling the output function results in a 1-minute run time, while just doing some barely cogent tiling results in a better yet still unacceptable 10 second run time.

Very weird because normally the auto-scheduler is awesome.

This is all not counting the time for the JIT compilation to complete.

Here's a sample of the type of stuff I'm trying to do, nothing exotic right? I'm running this on 4k by 3k 32bit images.

//since this is a reducing process, we need to specify what happens on the boundaries

Halide::Func clampedInput("clamped");
clampedInput = Halide::BoundaryConditions::repeat_edge(inputA, std::vector<std::pair<Halide::Expr, Halide::Expr> >{ std::make_pair<Halide::Expr>(0, Halide::cast<int>(imageWidth) -1), std::make_pair<Halide::Expr>(0, Halide::cast<int>(imageHeight)-1) });

//setup the reduction domain

Halide::Expr diameter = 2 * Halide::cast<int>(radius) + 1;
Halide::RDom r(-Halide::cast<int>(radius), diameter, -Halide::cast<int>(radius), diameter);

//perform the reduction

outputFun(x, y, c) = Halide::minimum(clampedInput(x + r.x, y + r.y, c));

My hardware is not junk either -- actually its some fairly expensive stuff. Any wisdom on why this is slow for me and how it could be sped up would be appreciated.

Why is Halide::minimum, autoschedule or no, extremely slow on CPU, but not GPU but still not good?

0 Answers0