The vectorize directive takes a TailStrategy
argument. This controls how the end of the vectorized extent is handled. The behavior you are describing seems to be RoundUp
, which is the default for reductions. The default for non-reductions is ShiftInwards
. RoundUp
imposes a constraint that the width is a multiple of the vectorization width. (Note, "multiple of 8" is not the same as "power of 8" as written above.) ShiftInwards
imposes a constraint that the width is at least the vectorization size. ShiftInwards
results in a small amount of redundant computation at the end of a loop and thus cannot be used for reductions as they are not idempotent. (I.e. repeating part of the computation can change the result.)
There is also a GuardWithIf
tail strategy. This is safe in all situations but tends to result in the code being scalarized and thus loses performance. We have plans to use vector predication to make this work better, though it is not clear this will pan out on all architectures.
There are two other mechanisms to know about. The first is BoundaryConditions
. This is what you are thinking of in mentioning the clamp. (At their core, the BoundaryConditions
functions are based on clamp
, but they do some other things to help the compiler out and should make the code a lot clearer.) Think of BoundaryConditions
as a correctness issue, not a performance one. What do you want your algorithm to do when there is not enough input for a given output? Once you've decided on the right thing, it can be implemented via BoundaryConditions
, or in some cases simply ignored as it is not allowed to happen.
BoundaryConditions
generally come at some cost in performance. It is hopefully fairly minimal in common use, but it turns out to be hard to make them free on a lot of hardware.
The second mechanism is using specialize
in the schedule. This allows handling cases that are the right size fast while dropping back to slower but correct code for the cases that are not. Generally you'd write something like:
f.specialize(input.width() % 8 == 0).vectorize(x, 8);