What do we know about the unrolling capabilities of nvcc when encountering #pragma unroll
directive? How sophisticated is it? Has anyone experimented with more and more complex loop structures to see what it gives up on?
For example,
#pragma unroll
for(int i = 0; i < constexpr_value; i++) { foo(i); }
will surely unroll (up to a rather large trip count, see this answer). What about:
#pragma unroll
for(int i = 0; i < runtime_variable_value and i < constexpr_value; i++) {
foo(i);
}
The loop trip count is not known here, but it has a constant upper bound, and complete unrolling of the loop can be performed, with some conditional jumps.
And then, what about:
template <typename T>
constexpr T simple_min(const T& x, const T& y) { return x < y ? x : y; }
#pragma unroll
for(int i = 0; i < simple_min(runtime_variable_value, constexpr_value); i++) {
foo(i);
}
which should compile to the same thing as the above?
Note: If you intend to answer "conduct your own experiments", then - I intend to do that, at least for my example, and look at the PTX if nobody knows the general answer already, in which case I'll partially-answer this question. But I would prefer something more authoritative and based on wider experience.