Pros/cons of different methods of loop unrolling using template metaprogramming

Question

I'm interested in general solutions for loop unrolling at compile time (I'm using this in a SIMD setting where each function call takes a specific number of clock cycles and multiple calls can be performed in parallel, so I need to tune the number of accumulators to minimise wasted cycles -- adding additional accumulators and manually unrolling yields significant improvements, but is laborious).

Ideally I'd like to be able to write things like

unroll<N>(f,args...); // with f a pre-defined function
unroll<N>([](...) { ... },args...); // using a lambda

and generate the following:

f(1,args...);
f(2,args...);
...
f(N,args...);

So far I have three different template metaprogram solutions, and am wondering what are the advantages/disadvantages of the different approaches, especially regarding how the compiler will inline the function calls.

Approach 1 (recursive function)

template <int N> struct _int{ };

template <int N, typename F, typename ...Args>
inline void unroll_f(_int<N>, F&& f, Args&&... args) {      
    unroll_f(_int<N-1>(),std::forward<F>(f),std::forward<Args>(args)...);
    f(N,args...);
}
template <typename F, typename ...Args>
inline void unroll_f(_int<1>, F&& f, Args&&... args) {
    f(1,args...);
}

Call syntax example:

int x = 2;
auto mult = [](int n,int x) { std::cout << n*x << " "; };

unroll_f(_int<10>(),mult,x); // also works with anonymous lambda
unroll_f(_int<10>(),mult,2); // same syntax when argument is temporary

Approach 2 (recursive constructor)

template <int N, typename F, typename ...Args>
struct unroll_c {
    unroll_c(F&& f, Args&&... args) {            
        unroll_c<N-1,F,Args...>(std::forward<F>(f),std::forward<Args>(args)...);
        f(N,args...);
    };
};
template <typename F, typename ...Args>
struct unroll_c<1,F,Args...> {
    unroll_c(F&& f, Args&&... args) {
        f(1,args...);
    };
};

Call syntax is pretty ugly:

unroll_c<10,decltype(mult)&,int&>(mult,x); 
unroll_c<10,decltype(mult)&,int&>(mult,2); // doesn't compile

and the type of the function must be specified explicitly if using an anonymous lambda, which is awkward.

Approach 3 (recursive static member function)

template <int N>
struct unroll_s {
    template <typename F, typename ...Args>
    static inline void apply(F&& f, Args&&... args) {
        unroll_s<N-1>::apply(std::forward<F>(f),std::forward<Args>(args)...);        
        f(N,args...);
    }
    // can't use static operator() instead of 'apply'
};
template <>
struct unroll_s<1> {
    template <typename F, typename ...Args>
    static inline void apply(F&& f, Args&&... args) {
        f(1,std::forward<Args>(args)...);
    }
};

Call syntax example:

unroll_s<10>::apply(mult,x);
unroll_s<10>::apply(mult,2);

In terms of syntax this third approach seems the cleanest and clearest, but I'm wondering if there may be differences in how the three approaches are treated by the compiler.

Most modern day compilers can and will implement loop unrolling where they find it helpful. Don't try to outsmart the compiler, just write correct, readable code and let the compiler optimizations do their job. — Cory Kramer, Aug 11 '15 at 23:22
Pros and cons will be mostly opinion based, thus the question is off-topic. — πάντα ῥεῖ, Aug 11 '15 at 23:27
@CoryKramer: At least not before profiling shows neccessity and effectiveness of trying to apply such an "optimization". Just to cover the far less than 1% of special cases too. — Deduplicator, Aug 11 '15 at 23:30
@CoryKramer Fully efficient loop unrolling cannot be performed (in general) by the compiler unless it knows the value of N at compile time. — Nir Friedman, Aug 11 '15 at 23:38
If you know the value of N at compile time, you can just use a for loop over a regular std::array. The size of array is known to the compiler and it will do appropriate unrolling. Usually fully unrolled is not a good idea, once you pass a certain unroll factor the code bloat outweighs the saved branches. — Nir Friedman, Aug 11 '15 at 23:42
Actually most compilers are much more likely to fail to unroll the second inner loop. I came across a case where the compiler generates shitty code for a 1x(6N+3)*128*512 loop, and by trial and error I found the optimal unrolling to be 12x(2N+1)x32x512: unrolling 128 by 4, and it surprisingly works for all N=1:7. — user3528438, Aug 11 '15 at 23:51
@CoryKramer As I mentioned in the question, I have manually unrolled various loops in my particular application setting and found significant performance benefits (with gcc at -O2), hence there is a practical purpose to this. — j_h, Aug 12 '15 at 01:33
@πάνταῥεῖ I'm mainly looking for pros and cons from the perspective of the efficiency of the resulting code (especially with respect to whether the function calls can be inlined), so was hoping for a technical answer rather than just opinions. Suggestions of alternative techniques are also welcome. — j_h, Aug 12 '15 at 01:36
Technical answers would be dependent on your actual Compiler implementation IMHO. — πάντα ῥεῖ, Aug 12 '15 at 01:42

Dietmar Kühl · Answer 1 · 2015-08-12T06:35:48.473

First off, compilers tend to know quite well when it is opportune to unroll loops. That is, I'm not suggesting to explicitly unroll loops. On the other hand, the index can be used as index into a type-map in which case it is kind of necessary to unroll things to generate the versions with the different types.

My personal approach would be to avoid the recursion, though, and rather have the unrolling be handled by an index expansion. Here is simple demo of a version which is nicely called and used. The same technique to pass the number of arguments could be used with a recursive approach as in your example. I think the notation is preferable:

#include <iostream>
#include <utility>
#include <initializer_list>

template <typename T> struct unroll_helper;
template <std::size_t... I>
struct unroll_helper<std::integer_sequence<std::size_t, I...> > {
    template <typename F, typename... Args>
    static void call(F&& fun, Args&&... args) {
        std::initializer_list<int>{(fun(I, args...), 0)...};
    }
};

template <int N, typename F, typename... Args>
void unroll(F&& fun, Args&&... args)
{
    unroll_helper<std::make_index_sequence<N> >::call(std::forward<F>(fun), std::forward<Args>(args)...);
}

void print(int index, int arg) {
    std::cout << "print(" << index << ", " << arg << ")\n";
}

int main()
{
    unroll<3>(&print, 17);
}

@m.s.: it a left-over from building the solution. I'll remove it. — Dietmar Kühl, Aug 12 '15 at 06:35

Pros/cons of different methods of loop unrolling using template metaprogramming

1 Answers1