adapting a non-constexpr integral value to a non-type template parameter, and code bloat

Question

Consider a function object F taking a constexpr size_t argument I

struct F
{
    template <size_t I>
    constexpr size_t operator()(size <I>) const { return I; }
};

wrapped within a type size <I>, where (for brevity)

template <size_t N>
using size = std::integral_constant <size_t, N>;

Of course, we could pass I directly but I want to emphasize that it is constexpr by using it as a template argument. Function F is dummy here but in reality it could do a variety of useful stuff like retrieving information from the Ith element of a tuple. F is assumed to have the same return type regardless of I. I could be of any integral type but is assumed non-negative.

Problem

Given a constexpr size_t value I, we can call F by

F()(size <I>());

Now, what if we want to call F with a non-constepr size_t value i? Consider the following:

constexpr size_t L = 10;
idx <F, L> f;
for (size_t i = 0; i < L; ++i)
    cout << f(i) << " ";

(Why would I need this? To give some context, I am in fact trying to build a composite iterator into a container view that represents a sequence of "joined" (concatenated) heterogeneous containers. This would give the ability to say something like join(a, b) = c; where arrays join(a, b) and c are of equal length. However, i is iterator state so cannot be constexpr, yet sub-iterators are stored in a tuple and need to be accessed by a constexpr index. Individual value_type's are roughly consistent so that the joined view can take on their common_type type, but sub-containers and consequently sub-iterators are of different types.)

Solution

Here, I have come up with struct idx <F, L>, which adapts function F for this purpose, assuming the input argument is less than L. This actually compiles fine giving output

0 1 2 3 4 5 6 7 8 9

and here is a live example.

idx works by recursively decomposing input i into a binary representation and reconstructing a constexpr counterpart N:

template <typename F, size_t L, size_t N = 0, bool = (N < L)>
struct idx
{
    template <size_t R = 1>
    inline constexpr decltype(F()(size <0>()))
    operator()(size_t I, size <R> = size <R>()) const
    {
        return I%2 ? idx <F, L, N+R>()(--I/2, size <2*R>()) :
               I   ? idx <F, L, N  >()(  I/2, size <2*R>()) :
               F()(size <N>());
    }
};

where R represents a power of 2 at the current iteration. To avoid infinite template instantiation, a specialization is given for N >= L, returning F()(size <0>()) as a dummy value:

template <typename F, size_t L, size_t N>
struct idx <F, L, N, false>
{
    template <size_t R>
    inline constexpr decltype(F()(size <0>()))
    operator()(size_t I, size <R>) const { return F()(size <0>()); }
};

In fact, this method is a generalization of the more common idiom with a Boolean argument:

bool b = true;
b ? f <true>() : f <false>();

where f is a function taking a bool as a template argument. In this case it is evident that all two possible versions of f are instantiated.

Question

Although this works and its run-time complexity is presumably logarithmic in i, I am concerned with the compile-time implications, like:

how many combinations of idx and its template operator() are instantiated for this to work at run time for any input i that is not known at compile time? (I understand "all possible" again, but how many?)
is it really possible to inline operator()?
is there any alternative strategy or variant that is "easier" to compile?
should I forget about this idea as an instance of pure code bloat?

Notes

Here are the compile times (in seconds) and executable sizes (in KB) I have measured for different values of L:

 L      Clang(s)    GCC(s)    Clang(KB)    GCC(KB)
 10       1.3       1.7          33         36
 20       2.1       3.0          48         65
 40       3.7       5.8          87        120
 80       7.0      11.2         160        222
160      13.9      23.4         306        430
320      28.1      47.8         578        850
640      56.5     103.0        1126       1753

So, although it appears roughly linear in L, it is quite long and frustratingly large.

Attempting to force operator() inline fails: probably ignored by Clang (executable even larger), while GCC reports recursive inlining.

Running nm -C on the executable e.g. for L = 160, shows 511/1253 different versions of operator() (with Clang/GCC). These are all for N < L so it appears the terminating specialization N >= L does get inlined.

PS. I would add tag code-bloat but the system won't let me.

In my application (in the context I give), there's no reason why one wouldn't want to concatenate e.g. 105 arrays, especially if we consider multi-dimensional ones eventually... — iavr, Feb 21 '14 at 19:13

Yakk - Adam Nevraumont · Accepted Answer · 2014-02-21T20:03:19.507

3

I call this technique the magic switch.

The most efficient way I know of to do this is to build your own jump table.

// first, index list boilerplate.  Does log-depth creation as well
// needed for >1000 magic switches:
template<unsigned...Is> struct indexes {typedef indexes<Is...> type;};
template<class lhs, class rhs> struct concat_indexes;
template<unsigned...Is, unsigned...Ys> struct concat_indexes<indexes<Is...>, indexes<Ys...>>{
    typedef indexes<Is...,Ys...> type;
};
template<class lhs, class rhs>
using ConcatIndexes = typename concat_indexes<lhs, rhs>::type;

template<unsigned min, unsigned count> struct make_indexes:
    ConcatIndexes<
        typename make_indexes<min, count/2>::type,
        typename make_indexes<min+count/2, (count+1)/2>::type
    >
{};
template<unsigned min> struct make_indexes<min, 0>:
    indexes<>
{};
template<unsigned min> struct make_indexes<min, 1>:
    indexes<min>
{};
template<unsigned max, unsigned min=0>
using MakeIndexes = typename make_indexes<min, max-min>::type;

// This class exists simply because [](blah){code}... `...` expansion
// support is lacking in many compilers:
template< typename L, typename R, unsigned I >
struct helper_helper {
    static R func( L&& l ) { return std::forward<L>(l)(size<I>()); }
};
// the workhorse.  Creates an "manual" jump table, then invokes it:
template<typename L, unsigned... Is>
auto
dispatch_helper(indexes<Is...>, L&& l, unsigned i)
-> decltype( std::declval<L>()(size<0>()) )
{
  // R is return type:
  typedef decltype( std::declval<L>()(size<0>()) ) R;
  // F is the function pointer type for our jump table:
  typedef R(*F)(L&&);
  // the jump table:
  static const F table[] = {
    helper_helper<L, R, Is>::func...
  };
  // invoke at the jump spot:
  return table[i]( std::forward<L>(l) );
};
// create an index pack for each index, and invoke the helper to
// create the jump table:
template<unsigned Max, typename L>
auto
dispatch(L&& l, unsigned i)
-> decltype( std::declval<L>()(size<0>()) )
{
  return dispatch_helper( MakeIndexes<Max>(), std::forward<L>(l), i );
};

which requires some static setup, but is pretty fast when run.

An assert that i is in bounds may also be useful.

live example

edited Feb 21 '14 at 20:03

answered Feb 21 '14 at 19:25

Yakk - Adam Nevraumont

262,606
27
330
524

Thanks, this is indeed a very different alternative. This would be definitely faster at run time, at the cost of some storage space. GCC has indeed a problem with the line containing the lambda, I'd have to adjust. And Clang compiles in 4.7 seconds for `L=1<<10` (1024) but above that in crashes during compilation. Any idea? (in the approach of my own answer, I have managed even `L=1<<14` (16K) in 44 seconds). – iavr Feb 21 '14 at 19:56
@iavr rewritten. Recursion depth reduced to reasonable and should compile faster. `helper_helper` added for the fact that lambda`...` seems rarely supported. An array of X function pointers with X functions backing them seems much less storage than X lg X actual functions. :) – Yakk - Adam Nevraumont Feb 21 '14 at 19:58
Thanks again, this tree-like log-depth implementation of `make_indexes` is something I had in mind doing at some point anyway (but I always left for later). GCC works now. For this last version, Clang/GCC give 13.7/40.9 seconds for `L=1<<14` so it beats my solution in compilation time :-) However, the binary is 1,9MB vs 839KB for my solution. – iavr Feb 21 '14 at 20:22
@iavr stripped symbols? The arguments to one of my functions gets pretty ridiculous. Making `dispatch_helper` `static` and avoiding embedding that symbol might help? Also changing `make_indexes` to always use `typedef blah type` instead of sometimes inheritance might also reduce symbol bloat... not sure. – Yakk - Adam Nevraumont Feb 21 '14 at 20:25
You're right, I forgot about that. With stripping, I get 746/694 KB with your/my solution. – iavr Feb 21 '14 at 20:28
Are there chances the lambdas get inlined? A while ago, I used a similar jump table generator technique (for a discriminated union, aka "variant"), which *effectively* created a class with a set of overloaded member functions for each type of the variant, and in this case, the compiler didn't inline. – CouchDeveloper Feb 21 '14 at 20:59
@CouchDeveloper there are compilers that can `inline` a call over a function pointer -- I've seen gcc do it. I don't know if it would be able to handle the above, and almost certainly only if the array index was known at compile time. – Yakk - Adam Nevraumont Feb 21 '14 at 21:02
@Yakk I've got the very first version integrated into my "join-iterator" and it works, so thanks again! In my case, the array index depends on iterator state so is not known at compile time, and I don't expect the call to be inlined. But my solution was also not inlined (as recursive) plus it took logarithmic-time computation to get the index, so the lookup+call is still better I guess. The most annoying thing is that functions to be called are member functions so I had to adapt for extra arguments, pass `*this` everywhere as `t`, and use `t.f()` instead of just `f()`. – iavr Feb 22 '14 at 01:19

score 1 · Answer 2 · answered Feb 21 '14 at 19:35

1

If your solution have cap on maximum possible value (say 256) you can use macro magic and switch statement to simplify it:

#define POS(i) case (i): return F<(i)>(); break;
#define POS_4(i) POS(i + 0) POS(i + 1) POS(i + 2) POS(i + 3)
#define POS_16(i) POS_4(i + 0) POS_4(i + 4) POS_4(i + 8) POS_4(i + 12)

int func(int i)
{
    switch(i)
    {
        POS_16(0)
    }
}

Another possible solution is (with C++11) use variadic templates:

template<int I>
struct FF
{
    static int f() { return I; }
};


template<typename... I>
int func(int i)
{
    constexpr int (*Func[])() = { I::f... };
    return Func[i]();
}

int main(int argc, char** argv)
{
    func<FF<0>,FF<1>>(1);
}

answered Feb 21 '14 at 19:35

Yankes

1,958
19
20

I'd rather stay as far away as possible from the macro magic :-) The 2nd solution is interesting and looks similar (or the same) to @Yakk's solution, right? – iavr Feb 21 '14 at 20:38
@iavr yes, my `constexpr int (*Func[])() = { I::f... };` is them same trick that @Yakk use: `static const F table[] = { helper_helper::func... }`. Rest is noise. – Yankes Feb 21 '14 at 21:10

score 0 · Answer 3 · answered Feb 21 '14 at 19:25

0

I'll take the obvious position here and ask if "I want to emphasize that it is constexpr by using it as a template argument" is worth this cost and if:

struct F
{
    constexpr size_t operator()(size_t i) const { return i; }
    template <size_t I>
    constexpr size_t operator()(size <I>) const { return (*this)(I); }
};

would not be a much simpler solution.

answered Feb 21 '14 at 19:25

Casey

41,449
7
95
125

This is simple but not a solution. I still need to adapt a non-constexpr input to the constexpr call. – iavr Feb 21 '14 at 20:35

score 0 · Answer 4 · answered Feb 21 '14 at 19:34

This is not exactly an answer and my question still stands, yet I have found a workaround that gives an impressive boost in compilation. It is a minor tweak of the solution given in the question, where parameter R is moved from operator() outside to structure idx, and the terminating condition now includes both R and N:

template <
    typename F, size_t L,
    size_t R = 1, size_t N = 0, bool = (R < 2 * L && N < L)
>
struct idx //...

The entire code is given in this new live example.

This approach apparently cuts down a huge number of unnecessary specialization combinations for R. Compile times and executable sizes drop dramatically. For instance, I have been able to compile in 10.7/18.7 seconds with Clang/GCC for L = 1<<12 (4096), yielding an executable of 220/239 KB. In this case, nm -C shows 546/250 versions of operator().

adapting a non-constexpr integral value to a non-type template parameter, and code bloat

4 Answers4

Linked