0

I have generator that uses a .vectorize(x, 8) in its schedule. The issue that I face is that, if my output buffer width is not a power of 8, I'll get accesses outside the buffer! I can certainly clamp the input x, y to the size of the image, but I'm wondering if there's any way to do this with the Output<Func> in my generator. Perhaps I'm not looking at the problem the right way?

class BasicGenerator : public Generator<BasicGenerator>
{
public:
    Var x, y;

    Input<Func> input { "input", UInt(8), 2 }
    Output<Func> output { "output", UInt(8), 2 }

    void generate()
    {
        output(x, y) = input(x, y);
    }

    void schedule()
    {
        output.vectorize(x, 8).parallel(y);
    }
};
Philippe Paré
  • 4,279
  • 5
  • 36
  • 56
  • 1) Show what you've tried. 2) Why is this tagged C? It doesn't look like C... – Dmitri Apr 28 '17 at 15:22
  • 1) It's not really a question of trying something, if you know halide I don't think you need anything else . 2) Sorry meant to tag C++ – Philippe Paré Apr 28 '17 at 15:23
  • 1
    Show your actual code, not half of a line of code where you assume the problem is. – paddy Apr 28 '17 at 15:30
  • What arguments are you passing to the generator when you run it? Set `HL_DEBUG_CODEGEN=1` in the environment and when you run the generator, you will get a (rather verbose) representation of what it is trying to execute. – Khouri Giordano Apr 28 '17 at 15:51

1 Answers1

3

The vectorize directive takes a TailStrategy argument. This controls how the end of the vectorized extent is handled. The behavior you are describing seems to be RoundUp, which is the default for reductions. The default for non-reductions is ShiftInwards. RoundUp imposes a constraint that the width is a multiple of the vectorization width. (Note, "multiple of 8" is not the same as "power of 8" as written above.) ShiftInwards imposes a constraint that the width is at least the vectorization size. ShiftInwards results in a small amount of redundant computation at the end of a loop and thus cannot be used for reductions as they are not idempotent. (I.e. repeating part of the computation can change the result.)

There is also a GuardWithIf tail strategy. This is safe in all situations but tends to result in the code being scalarized and thus loses performance. We have plans to use vector predication to make this work better, though it is not clear this will pan out on all architectures.

There are two other mechanisms to know about. The first is BoundaryConditions. This is what you are thinking of in mentioning the clamp. (At their core, the BoundaryConditions functions are based on clamp, but they do some other things to help the compiler out and should make the code a lot clearer.) Think of BoundaryConditions as a correctness issue, not a performance one. What do you want your algorithm to do when there is not enough input for a given output? Once you've decided on the right thing, it can be implemented via BoundaryConditions, or in some cases simply ignored as it is not allowed to happen.

BoundaryConditions generally come at some cost in performance. It is hopefully fairly minimal in common use, but it turns out to be hard to make them free on a lot of hardware.

The second mechanism is using specialize in the schedule. This allows handling cases that are the right size fast while dropping back to slower but correct code for the cases that are not. Generally you'd write something like:

f.specialize(input.width() % 8 == 0).vectorize(x, 8);
Zalman Stern
  • 3,161
  • 12
  • 18
  • That specialize isn't quite right, but in the right direction. Keep checking the lowered statement with `HL_DEBUG_CODEGEN=1` to be sure Halide is doing what you think you're telling it to do. – Khouri Giordano Apr 28 '17 at 16:29
  • @KhouriGiordano can you explain why the specialize is not quite right? – Philippe Paré Apr 28 '17 at 17:31
  • @zalman-stern Unfortunately I'm getting an error when using `specialize(input.width() % 8 == 0).vectorize(x, 8)`, it's telling me: `Argument passed to specialize must be of type bool`... – Philippe Paré Apr 28 '17 at 18:27
  • Wouldn't you want it to vectorize 2^20 of 2^20+1 values? I would think vectorize is good for `input.width() >= 8`. If you're doing this work on a CPU, `TailStrategy::GuardWithIf` will work well for any width. – Khouri Giordano May 01 '17 at 17:35
  • I'm not sure why that is not compiling for you. I believe the specialize call above is correct in the types.The example I gave was intended to illustrate targeting vectorize only at exactly sized cases. Generally one will also need to guarantee the min is aligned as well as the extent for this to result in the minimal vectorized code. One can combine specialize and scheduling in many ways, including cascading specialize calls. – Zalman Stern May 04 '17 at 16:25