I'm interested in general solutions for loop unrolling at compile time (I'm using this in a SIMD setting where each function call takes a specific number of clock cycles and multiple calls can be performed in parallel, so I need to tune the number of accumulators to minimise wasted cycles -- adding additional accumulators and manually unrolling yields significant improvements, but is laborious).
Ideally I'd like to be able to write things like
unroll<N>(f,args...); // with f a pre-defined function
unroll<N>([](...) { ... },args...); // using a lambda
and generate the following:
f(1,args...);
f(2,args...);
...
f(N,args...);
So far I have three different template metaprogram solutions, and am wondering what are the advantages/disadvantages of the different approaches, especially regarding how the compiler will inline the function calls.
Approach 1 (recursive function)
template <int N> struct _int{ };
template <int N, typename F, typename ...Args>
inline void unroll_f(_int<N>, F&& f, Args&&... args) {
unroll_f(_int<N-1>(),std::forward<F>(f),std::forward<Args>(args)...);
f(N,args...);
}
template <typename F, typename ...Args>
inline void unroll_f(_int<1>, F&& f, Args&&... args) {
f(1,args...);
}
Call syntax example:
int x = 2;
auto mult = [](int n,int x) { std::cout << n*x << " "; };
unroll_f(_int<10>(),mult,x); // also works with anonymous lambda
unroll_f(_int<10>(),mult,2); // same syntax when argument is temporary
Approach 2 (recursive constructor)
template <int N, typename F, typename ...Args>
struct unroll_c {
unroll_c(F&& f, Args&&... args) {
unroll_c<N-1,F,Args...>(std::forward<F>(f),std::forward<Args>(args)...);
f(N,args...);
};
};
template <typename F, typename ...Args>
struct unroll_c<1,F,Args...> {
unroll_c(F&& f, Args&&... args) {
f(1,args...);
};
};
Call syntax is pretty ugly:
unroll_c<10,decltype(mult)&,int&>(mult,x);
unroll_c<10,decltype(mult)&,int&>(mult,2); // doesn't compile
and the type of the function must be specified explicitly if using an anonymous lambda, which is awkward.
Approach 3 (recursive static member function)
template <int N>
struct unroll_s {
template <typename F, typename ...Args>
static inline void apply(F&& f, Args&&... args) {
unroll_s<N-1>::apply(std::forward<F>(f),std::forward<Args>(args)...);
f(N,args...);
}
// can't use static operator() instead of 'apply'
};
template <>
struct unroll_s<1> {
template <typename F, typename ...Args>
static inline void apply(F&& f, Args&&... args) {
f(1,std::forward<Args>(args)...);
}
};
Call syntax example:
unroll_s<10>::apply(mult,x);
unroll_s<10>::apply(mult,2);
In terms of syntax this third approach seems the cleanest and clearest, but I'm wondering if there may be differences in how the three approaches are treated by the compiler.