2

Imagine following code:

for (int i = 0; i < 8; ++i) {
    // ... some code
}

I want this loop to be unrolled in MSVC. In CLang I can add #pragma unroll before loop. But how to do same in MSVC?

I understand that anyway compilers often will unroll this loop for me even without any pragmas. But I want to be really sure about this, I want to unroll it always.

Of cause one way to force unrolling is to use recursive call of templated unrolling function with passed-in functor, like in following code:

Try it online!

template <int N, int I = 0, typename F>
inline void Unroll(F const & f) {
    if constexpr(I < N) {
        f.template operator() <I> ();
        Unroll<N, I + 1>(f);
    }
}

void f_maybe_not_unrolled() {
    int volatile x = 0;
    for (int i = 0; i < 8; ++i)
        x = x + i;
}

void f_forced_unrolled() {
    int volatile x = 0;
    Unroll<8>([&]<int I>{ x = x + I; });
}

But is it possible to force unroll in MSVC without such more difficult code like above?

Also is it possible for CLang to really force unrolling, I'm thinking that #pragma unroll just gives a hint to CLang (or I'm not right), maybe there is something like #pragma force_unroll, is there?

Also I want to unroll just this single loop, I don't need solution like passing command line arguments to force unrolling ALL possible loops.

Note: For me is not really crucial for code to be really forced unrolled in all 100% cases. I just need it to happen in most cases. Basically I just want to find out for MSVC same like CLang's #pragma unroll which on average make compiler more likely to unroll loop than without using pragma.

Arty
  • 14,883
  • 6
  • 36
  • 69
  • 1
    Note that even the `f_forced_unrolled` isn't actually forced. An optimizing compiler may still say "hey, that linear code looks like a loop, let's make it one". If you want assembly, write assembly. – MSalters May 19 '21 at 09:17
  • @MSalters Yes that's a fact. But at least right now if you look at assembly of Try-it-online godbolt link above then you'll see that unforced function was unrolled in CLang but not unrolled in MSVC. But force-unrolled function was both unrolled in clang and msvc. So it means at least on average my Unrolling function gives more unrolling than without it. Also I think it is possible to really force unroll by using `I` index in my Unroll function in some templated constexpr context, it will mean that compiler will not be able to make a loop out of it because `I` index was used as constexpr. – Arty May 19 '21 at 09:25
  • 1
    True, for this specific bit of code, the optimizer will typically will predict the unrolled version is faster, on common CPU's. But this depends on code size and thus cache impact. With respect to the `constexpr`, that does not matter at all for my argument. The loop creation could happen in the code generation phase, when the original C++ tokens are long forgotten. – MSalters May 19 '21 at 09:29
  • @MSalters At least if not possible to really force then I want at least unrolling to happen with higher probability. And right now regular loop is not unrolled by MSVC, while Unroll-ed loop is unrolled. Actually I just wan to find our if MSVC has same like CLang's `#pragma unroll`, for me it will be enough for now on average. Do you know if MSVC has any such pragma? – Arty May 19 '21 at 09:33
  • 1
    This feels a little bit like an XY problem. IME, the VC++ optimizer is *very good* at unrolling. If it decides not to unroll something, it's because it's determined that it would be a bad idea perf-wise (e.g. it feels it might blow the I-cache unnecessarily), and it's probably right. Are you interested in unrolling purely as a perf thing, or is there some deeper hack underlying your desire for it? – Sneftel May 19 '21 at 09:42
  • @Sneftel Right now optimization only. But I wanted to know if there is such mechanism of forcing this, just for the future, not for current program. – Arty May 19 '21 at 09:44

2 Answers2

4

You can't directly. The closest #pragma is #pragma loop(...), and that doesn't have an unroll option. The big hammer here is Profile Guided Optimization - profile your program, and MSVC will know how often this loop runs.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • In MSVC, if I do PGO (Profile Guided Optimization), then compiler will do for sure Unrolling if it thinks it should be done? I just wonder if MSVC has automatical unrolling at all. Because CLang usually does a lot of unrolling if you read generated assembly code. – Arty May 31 '22 at 14:26
  • 1
    @Arty: As Sneftel pointed out last year, VC++ is generally very good at it. This is really an old optimization, the more interesting option is vectorization. Why do 4 instructions sequentially when you can do 4 in parallel? Again, PGO is a lot of work but gives the compiler maximum insight. – MSalters May 31 '22 at 14:30
1

This is much more simpler with fold expressions:

template<size_t N, typename Fn>
#if defined(__cpp_concepts)
    requires (N >= 1) && requires( Fn fn ) { { fn.template operator ()<(size_t)N - 1>() } -> std::convertible_to<bool>; }
#endif
inline bool unroll( Fn fn )
{
    auto unroll_n = [&]<size_t ... Indices>( std::index_sequence<Indices ...> ) -> bool
    {
        return (fn.template operator ()<Indices>() && ...);
    };
    return unroll_n( std::make_index_sequence<N>() );
}

This becomes really powerful if you want to do loop-unrolling with that:

template<std::size_t N, typename RandomIt, typename UnaryFunction>
#if defined(__cpp_concepts)
    requires std::random_access_iterator<RandomIt>
    && requires( UnaryFunction fn, std::iter_value_t<RandomIt> elem ) { { fn( elem ) } -> std::same_as<bool>; }
#endif
inline RandomIt unroll_for_each( RandomIt begin, RandomIt end, UnaryFunction fn )
{
    RandomIt &it = begin;
    if constexpr( N > 1 )
        for( ; it + N <= end && unroll<N>( [&]<size_t I>() { return fn( it[I] ); } ); it += N );
    for( ; it < end; ++it )
        fn( *begin );
    return it;
}

The peculiarity with that is that the it + N <= end check is done for N iterations and not for each iteration. The check for the unroll return values might get eliminated if the lambda for each element always returns true.
I optimized Fletcher's hash with that and got a speedup of 60%, resulting in about 18GB/s, with an unrolling factor of five on my Zen1-CPU.

  • Thanks! Great answer. In your line `return (fn.template operator ()() && ...);` I see you expect lambda to return `bool`? What if my lambda doesn't return anything (i.e. void)? How shall I rewrite this line then? Probably with comma instead of `&&`, like following `(fn.template operator ()(), ...);`, is it right way to rewrite it for void? – Arty Oct 09 '22 at 19:44
  • Yes, that's also possible with a comma. But for me it would be best to always return true which is optimized away. –  Nov 01 '22 at 17:42