I've been trying to recreate a hand tuned c function via halide. It is a a series of histograms done on vertical scanlines of the source image. As such I'm using an 1 dimension RDom to iterate the source image.
RDom reductionY(0, input.height());
parade(x,y,c) = Halide::cast<uint16_t>(0);
parade(x, input(x, reductionY, c), c) += Halide::cast<uint16_t>(1);
To increase locality, I'm wrapping the rdom in another func so I can schedule it with compute_at.
wrapper(x,y,c) = parade(x, y, c);
parade.update(0).reorder(c, reductionY, x);
parade.update(0).split(x, x_outer, x_inner, THREADWIDTH);
parade.compute_at(wrapper, x_outer);
This (plus some vectorization/parallelization I've stripped out for this question) closely matches my hand tuned original. One thing the original benefits from that I can't figure out how to schedule, is to prefetch the first read of each vertical line from input in the update(0) stage. If I schedule
parade.update(0).prefetch(inputParam, x_inner, 3);
it seems to prefetch every pixel to be read? My hope is to issue a single prefetch to the first pixel to be read.