Printed loops nesting for Halide::sum is not equivalent for optimal as written in tutorial.
This code provides separate loops for zero initialization and summation.
Halide::Func f("f");
Halide::Var x("x");
Halide::RDom r(0, 3);
f(x) = Halide::sum(r + x);
f.print_loop_nest();
f.realize(10);
output:
produce f:
for x:
produce sum:
for x:
sum(...) = ...
for x:
for r4:
sum(...) = ...
consume sum:
f(...) = ...
Can fuse this loops or it does not impact on performance? Thanks!
Update: Fuse like this:
produce f:
for x:
produce sum:
for x:
sum(...) = ...
for r4:
sum(...) = ...
consume sum:
f(...) = ...