0

I've been trying to recreate a hand tuned c function via halide. It is a a series of histograms done on vertical scanlines of the source image. As such I'm using an 1 dimension RDom to iterate the source image.

   RDom reductionY(0, input.height());

   parade(x,y,c) = Halide::cast<uint16_t>(0);
   parade(x, input(x, reductionY, c), c) += Halide::cast<uint16_t>(1);

To increase locality, I'm wrapping the rdom in another func so I can schedule it with compute_at.

   wrapper(x,y,c) = parade(x, y, c);

   parade.update(0).reorder(c, reductionY, x);
   parade.update(0).split(x, x_outer, x_inner, THREADWIDTH);

   parade.compute_at(wrapper, x_outer);

This (plus some vectorization/parallelization I've stripped out for this question) closely matches my hand tuned original. One thing the original benefits from that I can't figure out how to schedule, is to prefetch the first read of each vertical line from input in the update(0) stage. If I schedule

   parade.update(0).prefetch(inputParam, x_inner, 3);

it seems to prefetch every pixel to be read? My hope is to issue a single prefetch to the first pixel to be read.

1 Answers1

0

On first glance, it doesn't seem that the code you posted is complete: parade is computed at the x_outer dimension of wrapper, but wrapper has never been split to create such a dimension. Seeing the exact code would help, and you may also find both print_loop_nest and compiling to a lowered statement file useful in seeing the exact structure and figuring out where you want the prefetch to be executed.

Quickly, though, I don't believe prefetches can be issued for only a subset of the used data—logically, they apply to the whole block of the data to be used at a given granularity. Do you observe poor performance due to prefetching the whole column rather than a single pixel? Explicitly prefetching a single pixel seems likely to help only insofar as it may trigger the hardware prefetcher to speculatively fetch the whole column.

If this is a case where a known-better approach is not representable in the current Halide model, however, you should share it with the halide-dev list or as an issue on GitHub with a simple reproducer for your target platform (x86?).

jrk
  • 2,896
  • 1
  • 22
  • 35