C++: efficient compile-time looping and Boost::hana::while_() runtime cost

Question

Problem statement

I have been experimenting with strategies to resolve loops at compile time with C++. The algorithmic problem I aim to solve is as follows:

given integer 4-tuple I = (i1,i2,i3,i4) all between 0 and 3(d-1) such that i1+i2+i3+i4 = 3(d-1), find all J = (j1,j2,j3,j4), K = (k1,k2,k3,k4) and L = (l1,l2,l3,l4) between 0 and d-1, summing to d-1, such that I = J + K + L.

Once I have found one such J,K,L, I have a mathematical expression to compute. So the flow of the program is:

input I
for J <= I
    for K <= I - J s.t. I - J - K >= 0
         L = I-J-K
         compute<J,K,L>(data)
    endfor
endfor

I have no trouble walking the set of multi-indices of interest and can do it with simple for loops (6 nested) free of breaks or continues. I have previously observed runtime looping of this sort to be too slow for my requirements; benchmark below illustrates this. At the opposite end, I have computed by hand the indices at play for d = 2 and a routine carrying out a sequence of compute<J,K,L> with hard-coded J,K,L is very fast. I am now looking for a solution using compile-time logic to conserve speed comparable to hand-coded solutions while allowing more generic implementations (for any d, though tuples are always 4-tuples).

Attempted solution

In a previous question, I asked about compile-time for loops. The solution was as follows:

[&]<int... jj1>(std::integer_sequence<int, jj1...>){([&]<int j1>(){
   // loop body with j1 as the constexpr index
}.template operator()<jj1>(),...);}(std::make_integer_sequence<int,N>{});

with N the bound. Since then, I have found an alternative method using Boost::hana. That is using hana::while_:

hana::while_(hana::less_equal.than(hana::int_c<N>), 0_c, [&](auto jj1){
  constexpr int j1 = jj1;
  // loop body 
return jj1+1_c;});

From reading the documentation, it was my understanding that hana is a template meta-programming library with exclusively compile-time logic. Using such hana::while_ loops, my program now looks like 6 of these loops and the single runtime instruction at the deepest level:

hana::while_(auto j1_c) 
    hana::while_(auto j2_c) 
        hana::while_(auto j3_c) 
            hana::while_(auto k1_c) 
                hana::while_(auto k2_c) 
                    hana::while_(auto k3_c) 
                        constexpr this;
                        constexpr that;
                        not_constexpr_function<this,that>(dynamicdata);
// 6 hana::while_ termination

Each of those hana::while is like the above, I wrote them more succintly for clarity. There is only one line in the entire program which is not constexpr: the call using dynamic data.

I implemented this, an alternate version using the other constexpr for loop construction (labelled integer_seq), a version with exclusively runtime logic (label runtime) and a hand-written routine where I copy pasted printf results from the hana version (or any other) and hardcoded the indices in the call to compute. To be clear, this means I just unrolled the loops by hand, and the computations are done all in the same order. Observed speeds are as follows:

	gcc -Ofast -flto ; Linux	clang -Ofast -flto ; MacOS
Runtime	673,963 /s	357,125 /s
Hana::while_	3,567,962 /s	1,048,172 /s
integer_seq	1,373,362 /s	compile fails
Hand-written	12,743,705 /s	3,485,922 /s

EDIT: I had originally posted handcoded times for a routine ported from Fortran with a different order of operations. I rewrote it in the exact order of operations as this C++ Boost::hana implementation and it happens to be faster by about 1.5x (probably fewer cache misses).

Hana::while is faster than runtime logic by a factor 3~5 but still slower than hand-written by a factor ~3. This is a lot of overhead for a supposedly compile-time only function. The fact that the inner function is templated by the indices makes me think the logic is really operating at compile time. But then, why the x3 slowdown?

EDIT2: I was asked about optimization flags. The initial table was generated using either gcc on Linux or clang on MacOS (and a different machine) with -Ofast and -flto. I hadn't enabled -march=native though, so here goes (only relevant ones):

	gcc -Ofast -flto -march=native ; Linux	clang -Ofast -flto -march=native; MacOS
Hana::while_	4,224,602 /s	1,980,352 /s
Hand-written	14,318,259 /s	4,385,473 /s

This improves times slightly more in favour of hana::while_, but it still lags behind by 2~3x with respect to hand-written (which actually is now generated code).

EDIT3: @Jesper_Juhl mentions the dangers of -Ofast. It also seems to benefit intensive floating point operations the most, and this option won't be used in production. So let's try the more realistic -O3 and see if Hana catches up:

	gcc -O3 -flto -march=native ; Linux	clang -O3 -flto -march=native; MacOS
Hana::while_	4,160,842 /s	2,003,093 /s
Hand-written	9,221,753 /s	5,218,938 /s

The hand-written routine is really line after line of floating point computations, so it's only natural it suffered so much from going from -Ofast to -O3. Curiously, hana::while_ was barely slowed down. This suggests that -Ofast optimizations are not being applied to the same degree. So it's either that they're not applied to the computational kernel at all (at least some of the heuristics), or that there are inter-call optimizations that cannot be done in this context. Clang also randomly decided to make the hand-written about 20% faster with -O3 than with -Ofast, this is also very puzzling. I recompiled all over again and reran with -Ofast and I do obtain about 4.3M/s versus 5.2M with -O3.

Question

How is hana::while_ introducing runtime overhead when all the logic is happening at compile time? I thought code within these constructs would be equivalent to unrolled loops with the indices as good as hard-coded.

Is it that calling a templated function with many different index combinations is creating so many functions in the program that it is somehow slowed down? Or that the compiler has trouble optimizing these function calls?

Is there a better way to go about this problem? I've since written a code generating routine, the resulting code is very fast but it'll be more troublesome to maintain. I'd prefer if I could find a purely-C++ and within-program solution to this performance problem.

In summary: how can I use C++ meta-programming concepts (compile time computations) to achieve the same performance as hand-written code?

It's more or less impossible to tell without seeing the real code and generated assembly. You will need to reduce your code to a minimal amount necessary to reproduce your observed behaviour. — Passer By, May 20 '23 at 00:57
The real code is really 6 nested hanna::while_ and a 3x3 double precision matrix determinant computation with data fetched from a large array using the indices. I have printed out the indices computed in the hana::while_, hand copied them to a new function, and manually entered the indices in the large array to pass to the det33 routine. That is the "hand" routine in the table above and is about 2.5x faster than the hana::while_ looping method. As the order of operations is the same, the issue lies in the looping itself. I'll post the entire excerpt but I'm not sure it'll help much. — Sardine, May 20 '23 at 01:27
Step one (actually step zero) is to make sure that you compile your code with compiler optimizations *enabled*. Step 1 is then to enable LTO (Link Time Optimization). After that, if your code is still not fast enough - then we can get into looking at algorithms and compile time vs run time code. — Jesper Juhl, May 20 '23 at 01:59
I might have edited it away, but I mentioned compiling with -Ofast, and I use -flto as well. I'll add it back to the question. EDIT: Actually, it's in the table, I didn't mention -flto. — Sardine, May 20 '23 at 02:01
Side note: You mention "gcc -Ofast" in your question. Be aware that `-Ofast` is *dangerous*. It ignores some rules of the standard to make the code faster - but ignoring those rules means that the code may have incorrect results in some cases. Don't use `-Ofast` unless you know what it does - the regular optimization levels are just fine in 99% of cases and don't invoke undefined results. — Jesper Juhl, May 20 '23 at 02:03
That is true, I mostly use it because the floating-point re-arranging optimizations are useful to know if slowness is due to the particular order of operations in large math computations or to the overall flow of the program (cache misses, branching, etc). I'll try with O3 for a fairer comparison. — Sardine, May 20 '23 at 02:11

score 0 · Answer 1 · answered May 19 '23 at 20:27

This became too long for a comment and doesn't answer the original question.

I would suggest you first need better algorithms, and only then better implementations. In your case, for a single dimension of your 4-tuple, you are looking for the integer partition into exactly three components.

i1 = j1 + k1 + l1

This problem is solved in general in the book Combinatorial Algorithms by Kreher and Stinson. They also provide an algorithm and code which solves this problem, in particular see the routine Algorithm 3.4 from their book (you find both the book and code freely available online).

Your task is then to transform this algorithm into compile time meta-programming language. Let's assume you have it available as PART(N,k), you can apply it as in the following:

First, you restrict yourself only to ordered tuples i1<=i2<=i3<=i4, and the same for J,K,L. You can generate all those by iterating through PART(3*(d-1),4). The general partitions are obtained by permuting through the 4 components.
For any component of the tuple i_n, you generate the partitions PART(i_n,3) and thus obtain j_n, k_n and l_n.
The ordered solution is then obtained by combining all the components into the 4-tuple.

This might be more complex, but is nearly independent of the tuple length, whereas your solution scales polynomially in the tuple length.

There is that, and the additional constraint that each 4-tuple, say J also verifies |J|_1 = d-1. So in combinatorial terms, each (j_n,k_n,l_n) is a partition of i_n, but also each (x_1,x_2,x_3,x_4) is a partition of d-1, with x = j,k,l. Generating the tuples up to permutations would be cheaper, but I actually need all ordered combinations. So no savings there. Tuple length does not vary. My algorithm is walking the set without repetition nor interruption (continues, breaks). So as far as the sequence of operations goes, it's optimal. I think this is a purely programmatical problem. — Sardine, May 19 '23 at 20:43

C++: efficient compile-time looping and Boost::hana::while_() runtime cost

Problem statement

Attempted solution

Question

1 Answers1