Problem statement
I have been experimenting with strategies to resolve loops at compile time with C++. The algorithmic problem I aim to solve is as follows:
given integer 4-tuple I = (i1,i2,i3,i4) all between 0 and 3(d-1) such that i1+i2+i3+i4 = 3(d-1), find all J = (j1,j2,j3,j4), K = (k1,k2,k3,k4) and L = (l1,l2,l3,l4) between 0 and d-1, summing to d-1, such that I = J + K + L.
Once I have found one such J,K,L, I have a mathematical expression to compute. So the flow of the program is:
input I
for J <= I
for K <= I - J s.t. I - J - K >= 0
L = I-J-K
compute<J,K,L>(data)
endfor
endfor
I have no trouble walking the set of multi-indices of interest and can do it with simple for loops (6 nested) free of breaks or continues. I have previously observed runtime looping of this sort to be too slow for my requirements; benchmark below illustrates this. At the opposite end, I have computed by hand the indices at play for d = 2 and a routine carrying out a sequence of compute<J,K,L>
with hard-coded J,K,L is very fast. I am now looking for a solution using compile-time logic to conserve speed comparable to hand-coded solutions while allowing more generic implementations (for any d, though tuples are always 4-tuples).
Attempted solution
In a previous question, I asked about compile-time for loops. The solution was as follows:
[&]<int... jj1>(std::integer_sequence<int, jj1...>){([&]<int j1>(){
// loop body with j1 as the constexpr index
}.template operator()<jj1>(),...);}(std::make_integer_sequence<int,N>{});
with N the bound. Since then, I have found an alternative method using Boost::hana. That is using hana::while_:
hana::while_(hana::less_equal.than(hana::int_c<N>), 0_c, [&](auto jj1){
constexpr int j1 = jj1;
// loop body
return jj1+1_c;});
From reading the documentation, it was my understanding that hana is a template meta-programming library with exclusively compile-time logic. Using such hana::while_
loops, my program now looks like 6 of these loops and the single runtime instruction at the deepest level:
hana::while_(auto j1_c)
hana::while_(auto j2_c)
hana::while_(auto j3_c)
hana::while_(auto k1_c)
hana::while_(auto k2_c)
hana::while_(auto k3_c)
constexpr this;
constexpr that;
not_constexpr_function<this,that>(dynamicdata);
// 6 hana::while_ termination
Each of those hana::while
is like the above, I wrote them more succintly for clarity. There is only one line in the entire program which is not constexpr: the call using dynamic data.
I implemented this, an alternate version using the other constexpr for loop construction (labelled integer_seq), a version with exclusively runtime logic (label runtime) and a hand-written routine where I copy pasted printf results from the hana version (or any other) and hardcoded the indices in the call to compute
. To be clear, this means I just unrolled the loops by hand, and the computations are done all in the same order. Observed speeds are as follows:
gcc -Ofast -flto ; Linux | clang -Ofast -flto ; MacOS | |
---|---|---|
Runtime | 673,963 /s | 357,125 /s |
Hana::while_ | 3,567,962 /s | 1,048,172 /s |
integer_seq | 1,373,362 /s | compile fails |
Hand-written | 12,743,705 /s | 3,485,922 /s |
EDIT: I had originally posted handcoded times for a routine ported from Fortran with a different order of operations. I rewrote it in the exact order of operations as this C++ Boost::hana implementation and it happens to be faster by about 1.5x (probably fewer cache misses).
Hana::while is faster than runtime logic by a factor 3~5 but still slower than hand-written by a factor ~3. This is a lot of overhead for a supposedly compile-time only function. The fact that the inner function is templated by the indices makes me think the logic is really operating at compile time. But then, why the x3 slowdown?
EDIT2: I was asked about optimization flags. The initial table was generated using either gcc on Linux or clang on MacOS (and a different machine) with -Ofast and -flto. I hadn't enabled -march=native though, so here goes (only relevant ones):
gcc -Ofast -flto -march=native ; Linux | clang -Ofast -flto -march=native; MacOS | |
---|---|---|
Hana::while_ | 4,224,602 /s | 1,980,352 /s |
Hand-written | 14,318,259 /s | 4,385,473 /s |
This improves times slightly more in favour of hana::while_, but it still lags behind by 2~3x with respect to hand-written (which actually is now generated code).
EDIT3: @Jesper_Juhl mentions the dangers of -Ofast. It also seems to benefit intensive floating point operations the most, and this option won't be used in production. So let's try the more realistic -O3 and see if Hana catches up:
gcc -O3 -flto -march=native ; Linux | clang -O3 -flto -march=native; MacOS | |
---|---|---|
Hana::while_ | 4,160,842 /s | 2,003,093 /s |
Hand-written | 9,221,753 /s | 5,218,938 /s |
The hand-written routine is really line after line of floating point computations, so it's only natural it suffered so much from going from -Ofast to -O3. Curiously, hana::while_ was barely slowed down. This suggests that -Ofast optimizations are not being applied to the same degree. So it's either that they're not applied to the computational kernel at all (at least some of the heuristics), or that there are inter-call optimizations that cannot be done in this context. Clang also randomly decided to make the hand-written about 20% faster with -O3 than with -Ofast, this is also very puzzling. I recompiled all over again and reran with -Ofast and I do obtain about 4.3M/s versus 5.2M with -O3.
Question
How is hana::while_ introducing runtime overhead when all the logic is happening at compile time? I thought code within these constructs would be equivalent to unrolled loops with the indices as good as hard-coded.
Is it that calling a templated function with many different index combinations is creating so many functions in the program that it is somehow slowed down? Or that the compiler has trouble optimizing these function calls?
Is there a better way to go about this problem? I've since written a code generating routine, the resulting code is very fast but it'll be more troublesome to maintain. I'd prefer if I could find a purely-C++ and within-program solution to this performance problem.
In summary: how can I use C++ meta-programming concepts (compile time computations) to achieve the same performance as hand-written code?