25

I have a C++ snippet below with a run-time for loop,

for(int i = 0; i < I; i++)
  for (int j = 0; j < J; j++)
    A( row(i,j), column(i,j) ) = f(i,j);

The snippet is called repeatedly. The loop bounds 'I' and 'J' are known at compile time (I/J are the order of 2 to 10). I would like to unroll the loops somehow using templates. The main bottleneck is the row() and column() and f() functions. I would like to replace them with equivalent metaprograms that are evaluated at compile-time, using row<i,j>::enum tricks.

What I'd really love is something that eventually resolves the loop into a sequence of statements like:

A(12,37) = 0.5;
A(15,23) = 0.25;
A(14,45) = 0.25;

But I'd like to do so without wrecking the for-for structure too much. Something in the spirit of:

TEMPLATE_FOR<i,0,I>
  TEMPLATE_FOR<j,0,J>
     A( row<i,j>::value, column<i,j>::value ) = f<i,j>::value

Can boost::lambda (or something else) help me create this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
RAC
  • 4,979
  • 3
  • 21
  • 10

9 Answers9

12

A good compiler should do unrolling for you. For instance, in gcc compiling with the -O2 option turns on loop unrolling.

If you try to do it yourself manually, unless you measure things carefully and really know what you are doing, you are liable to end up with slower code. For example, in your case with manual unrolling you are liable to prevent the compiler from being able to do a loop interchange or stripmine optimization (look for --floop-interchange and -floop-strip-mine in the gcc docs)

T.E.D.
  • 44,016
  • 10
  • 73
  • 134
  • [OP] It's not the overhead of the loops I am trying to optimize away, it's more the evaluation of row(), column() and f(). They are currently regular, free functions. I want to guarantee they are evaluated at compile time instead of run time. – RAC Jun 23 '09 at 13:51
  • 2
    If you make those routines "inline", that might give the optimizer enough extra info to do that itself. – T.E.D. Jun 23 '09 at 13:59
  • 2
    There's no reason why it shouldn't be able to inline those, as long as the full function definitions are visible at the call site. – jalf Jun 23 '09 at 14:04
  • I will try, but row() and column() are actually kinda complicated (and recursive) functions. If the compiler resolves it all the way down to a constant, I'd be very happy - OTOH if it just drops in the function body I don't think I'll see much performance increase. The legacy code that I am replacing with these loops literally is a snippet that reads like "A(12,37) = 0.5;A(15,23) = 0.25;A(14,45) = 0.25;", where someone had done all the row(), column() mapping by hand. I'd like the replacement automate the process but still hit that performance level. – RAC Jun 23 '09 at 14:07
  • 1
    why are the functions so complicated? That seems like a better place to try to optimize. – jalf Jun 23 '09 at 14:09
  • The row() and column() functions are "nomenclature"/"numbering" routines for finite element basis functions over simplices (tetrahedra/triangles) of arbitrary order. There's a lot of recursion, both w.r.t. order and w.r.t. dimension. I couldn't figure out a smarter way to do it (it was tough enough just to get it correct), but maybe there is. – RAC Jun 23 '09 at 14:19
  • Not to double post too much, but it's really the recursive nature of the row(), column() functions that is making me seek this solution to begin with! TMP is so good at recursion, that I know I can do TMP versions of row<>::value and column<>::value and eliminate their runtime calculation altogether. But once you have these TMP row things inside the loop, your i,j must be compile time constants too, so I'd have to unroll the for loops using a TMP mechanism too. – RAC Jun 23 '09 at 14:25
  • Is there any chance you'll be able to cache/reuse the indices computed by the row/column functions? If the same indices are calculated multiple times, that could give you a big saving too. But yeah, sounds like it gets a bit more complex than I'd thought initially. :) – jalf Jun 23 '09 at 15:01
  • 5
    IF the row and column functions are that complicated you're going to struggle to implement them in templates. You can't just run C++ code at compile time, you have to write it all again using cryptic compiler error messages as a debugger. You'd be better of running another program as part of your build that generates all the values and spits out a .h file with an array of them in. – Matthew Monaghan Jun 23 '09 at 15:29
8

This is the way to do it directly:

template <int i, int j>
struct inner
{
  static void value()
  {
    A(row<i,j>::value, column<i,j>::value) = f<i,j>::value;
    inner<i, j+1>::value();
  }
};

template <int i> struct inner<i, J> { static void value() {} };

template <int i>
struct outer
{
  static void value()
  {
    inner<i, 0>::value();
    outer<i+1>::value();
  }
};

template <> struct outer<I> { static void value() {} };

void test()
{
  outer<0>::value();
}

You can pass A through as a parameter to each of the values if necessary.

Here's a way with variadic templates that doesn't require hard coded I and J:

#include <utility>

template <int j, class Columns>
struct Inner;

template <class Columns, class Rows>
struct Outer;

template <int j, int... i>
struct Inner<j, std::index_sequence<i...>>
{
  static void value() { (A(column<i, j>::value, row<i, j>::value), ...); }
};


template <int... j, class Columns>
struct Outer<std::index_sequence<j...>, Columns>
{
  static void value() { (Inner<j, Columns>::value(), ...); }
};

template <int I, int J>
void expand()
{
  Outer<std::make_index_sequence<I>, std::make_index_sequence<J>>::value();
}

void test()
{
  expand<3, 5>();
}

(snippet with generated assembly: https://godbolt.org/g/DlgmEl)

James Hopkin
  • 13,797
  • 1
  • 42
  • 71
  • Thanks for the example - the code doesn't quite compile. I think the two specialisations are wrong, presumably `template <> struct outer { static void value() {} };` should be `template <> struct outer<0> { static void value() {} };`. I'm not sure about the other one, perhaps `template struct inner { static void value() {} };`? (I'm guessing they're meant to terminate the recursion, it's quite fun to remove them entirely and watch the compile taking forever!) – pjcard Apr 20 '16 at 09:36
  • Just checked - it compiles and works as expected (depending on some values implied by the question). Here's a full working snippet: https://godbolt.org/g/bvngt9 – James Hopkin Apr 20 '16 at 11:06
  • These days it would be much easier with variadics. – James Hopkin Apr 20 '16 at 11:07
  • "Just checked - it compiles and works as expected" Haha, you added a line :) `const int I = 3, J = 5;` Removing it gives the errors I saw, but I guess it's terminating at an upper value, rather than 0 as I assumed. In any case, thank you for both examples, they were very useful. – pjcard Apr 21 '16 at 09:03
6

You could use Boost MPL.

An example of loop unrolling is on this mpl::for_each page.

for_each< range_c<int,0,10> >( value_printer() );

It doesn't seem that it's all evaluated at compile time, but it may be a good starting point.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
CiscoIPPhone
  • 9,457
  • 3
  • 38
  • 42
4

Check out Template Metaprograms and the bubble sort implementations.

Nick Dandoulakis
  • 42,588
  • 16
  • 104
  • 136
4

I would say it is a false good-idea.

In C++ this : row<i,j>::value
means you will have as many differents row<>() functions than you have i * j. You don't want this because it will increase the size of the code and do a lot of instruction cache misses.

I observed this when I was doing template functions to avoid a single boolean check.

If is a short function just inline it.

Ben
  • 7,372
  • 8
  • 38
  • 46
  • The idea was to let the compiler do the math, not the processor. For a 10 by 10 matrix, that's 200 functions being evaluated at compile time, and replaced by adresses in the final binary. That should still fit the cache, wouldn't it? – xtofl Jun 23 '09 at 14:16
  • Yes it will, but you'll have to load all thoses functions inside the cache for a single use instead of using 200 times the same one. And after the run the i-cache will be filled with useless code. @RAC However I may be wrong it cache issues are always tricky let us know if you can get an improvment. – Ben Jun 23 '09 at 14:23
  • 7
    with row being a class type, and value being a static const int, i don't see how it would bloat the executable. If debug is stripped, there should be no code generated for row::value, i suspect – Johannes Schaub - litb Jun 23 '09 at 14:36
  • If you unroll the loop, ie if you compute the addresses at compile time the code WILL be bigger not matter how you do it. – Ben Jun 23 '09 at 14:51
  • 1
    True, Ben, which is why you usually don't unroll the entire loop, but only a few iterations. but if you want to compute all these indices at compile-time, I think the entire loop has to be unrolled – jalf Jun 23 '09 at 15:02
3

You could use Boost.Mpl to implement the whole thing at compile-time, but I'm not sure it'll be faster. (Mpl essentially re-implements all the STL algorithms as compile-time metaprogramming templates)

The problem with that approach is that you end up unrolling and inlining a lot of code, which may thrash the instruction cache and eat up memory bandwidth that could have been saved. That may produce huge, bloated and slow code.

I would probably probably rather trust the compiler to inline the functions that make sense. As long as the row and column function definitions are visible from the loop, the compiler can trivially inline the calls and unroll as many iterations as it deems beneficial.

jalf
  • 243,077
  • 51
  • 345
  • 550
1

f would need to return a double - that can't be done at compile time.

James Hopkin
  • 13,797
  • 1
  • 42
  • 71
  • 1
    Admittedly this is a part that is still a little murky to me. However, f() is always a fraction and I can do TMP's for it's numerator and denominator independently, then static_cast to doubles and do the ratio in floating pt. (I hope). But very good point. – RAC Jun 23 '09 at 14:37
1

If you are willing to modify the syntax a bit you can do something like this:

template <int i, int ubound>
struct OuterFor {
    void operator()() {
        InnerFor<i, 0, J>()();
        OuterFor<i + 1, ubound>()();
    }
};

template <int ubound>
struct OuterFor <ubound, ubound> {
    void operator()() {
    }
};

In InnerFor, i is the outer loops counter (compile time constant), j is the inner loops counter (initially 0 - also compile time constant), so you can evaluate row as a compile time template.

Its a bit more complicated, but as you say, row(), col(), and f() are your complicated parts anyways. At least try it and see if the performance is worth it. It may be worth it to investigate other options to simplify your row(), etc functions.

Greg Rogers
  • 35,641
  • 17
  • 67
  • 94
  • I really like where this solution could be headed - could you provide an example that actually calls/invokes/instantiates the loops? Also, how easily can this be extended to deeper nested loops? A four-deep nested for loop is not out of the question. (loops over polarization state, x, y and z). – RAC Jun 23 '09 at 14:40
1

I've never tried to do this so take this idea with a grain of salt...

It seems like you could use Boost.Preprocessor to do the loop unrolling (specifically the BOOST_PP_FOR and BOOST_PP_FOR_r macros) and then use templates to generate the actual constant expression.

jwfearn
  • 28,781
  • 28
  • 95
  • 122
Ferruccio
  • 98,941
  • 38
  • 226
  • 299
  • 1
    After some thought, I don't think this will work. The catch is that the I and J used to terminate the loop are not really constants, instead they are deduced from (not shown) template arguments. The snippet I posted earlier is really wrapped up inside a template { ... } block, and I and J are computed from ORDER using other TMP's. IIRC the preprocessor is invoked before template expansion takes place, which is going to mess things up I think. I could be wrong though. – RAC Jun 23 '09 at 17:13