I'm implementing an algorithm using Halide while comparing hand-tuned(using CUDA) version of same algorithm. Acceleration of the Halide implementation mostly went well, but still slower a bit than hand-tuned version. So I tried to see exact execution time of each Func using nvvp(nvidia visual profiler). By doing that, I figured out that hand-tuned implementation overlaps multiple function's(they're similar) execution which is implemented as a Func in Halide implemetation. Cuda's Stream technology is used to do it.
I would like to know whether I can do similar exploitation of GPU in Halide or not.
I appreciate for reading.