Most cpu-efficient way to use std:: algorithms with arguments to a variadic function template?

Question

Say you have a variadic function template which takes a functor and a sequence of homogeneous types, and you want to use std::accumulate to fold the sequence, like so:

template<typename BinaryFuncType, typename... ArgTypes>
do_something(const BinaryFuncType& f, const ArgTypes&... objects)
{
    // ...
    // use std::accumulate to fold 'objects' using 'f'
    // ...
}

Is it possible to pass the variadic arguments (objects) to the range algorithm (std::accumulate) directly, i.e., without incurring the costs of copying the objects (or references) to an iterable container?

If this were really "high-performance code" then I would have to assume that there are hundreds of thousands of numbers, which, imo, makes it rather absurd to pass them as separate parameters to a single function (why would they not be in a container already?) — Brian Bi, Feb 18 '14 at 02:28
Why do you have strange limitations on high performance code? Do we also have to avoid using the letter z? — Yakk - Adam Nevraumont, Feb 18 '14 at 02:28
@BrianBi: This question is academic, in general the most efficient way to use variadic function arguments in a range algorithm. I added the "assume high performance" so people wouldn't respond with "are you sure this is necessary? is this premature optimization?" etc. — etherice, Feb 18 '14 at 02:31
Passing arguments by hand means you will have at most a dozen numbers to work with. Performance is a non-issue in that case. If you need to optimize a computation, you should design a suitable data structure and process data by the tens of thousands. — kuroi neko, Feb 18 '14 at 02:31
@Yakk: This question is academic. In general, is there a way to use variadic function arguments in a range algorithm without copying the arguments into an iterable container. — etherice, Feb 18 '14 at 02:32
@kuroineko: This question is academic. Please see previous comments. — etherice, Feb 18 '14 at 02:32
@kuroineko: that's not a sound argument anyway... some tools and domain-specific languages spit out C++ source code. *pat* — Tony Delroy, Feb 18 '14 at 02:38
@TonyD A general purpose code generator and its efficiency are soon parted :) — kuroi neko, Feb 18 '14 at 02:42
Are the numbers cheap to copy? Move? Are they large or small? Roughly how many? What optimization capabilities does the compiler have working with arrays, references, member function pointers, static const data? Do we have stack layout abi information? How branchy is the function? Does indirection to contiguous data cause performance problems? If this was a practical question, it would have **meat** behind it to sink teeth into. As an academic one, there is nothing much here to talk about that is useful. — Yakk - Adam Nevraumont, Feb 18 '14 at 02:47
@Yakk: Assume the objects cannot be moved and would be expensive to copy. Assume that `do_something` is called 10 billion times in a tight loop from another function which we have no control over, and that around a dozen of the objects are passed in each time. — etherice, Feb 18 '14 at 02:56
If you care about performance and the copying, you have only to make the function `inline` and enable optimisation and you can expect arguments for `accumulate` will be directly assembled into any `double d[] = { numbers... };` local variable anyway. Have you written some code that shows that that's not the case? Otherwise the questions a waste of time.... — Tony Delroy, Feb 18 '14 at 02:59
@TonyD: `d[] = { numbers... };` would take care of allocation, but the objects would still have to be copied. Assume that copying is an expensive operation for these objects. See my previous comment. — etherice, Feb 18 '14 at 03:01
If you have no control over a stupid design, then your time would be better spent convincing the inept designers to change their minds about it. You could probably come up with a faster version in PHP with an algorithm overhaul. — kuroi neko, Feb 18 '14 at 03:03
@etherice: "...still...copied" - maybe not: that's the point of inlining - they may be able to be constructed in-place, depending on where the values originate from. If you're passing in references to variables at disjoint memory addresses, then of course you'll need to have them copied or explicitly arrange a "gathering" by-reference iterable for accumulate per Brian's answer, but if they're constants or temporaries in the calling context then it should optimise nicely. Could even write code that detected and used copy vs reference_wrapper, using traits. — Tony Delroy, Feb 18 '14 at 03:23
@kuroineko: I asked the hypothetical developers of the hypothetical code I referred to and they said they can't change it. — etherice, Feb 18 '14 at 03:24
Ali - thank you for defending the question. I was truly surprised (and disappointed) that several users chose to attack what I believed to be an interesting question worthy of discussion. — etherice, Feb 18 '14 at 23:15
@etherice Don't take those downvotes / comments personally; I believe they just misinterpreted the question. It also took me some time to understand the real question. If you look at my edit, only changing a few words was necessary. Anyway, up voted the question, and hope it gets even more upvotes! — Ali, Feb 18 '14 at 23:23
I have rolled back the meta commentary in your question @etherice. If you feel Ali's rephrasing was accurate, just replace your last two sentences with what he wrote. But the rest most certainly does not belong in a question. — Bart, Apr 18 '14 at 18:24
@Bart: The clarification Ali suggested was useful so I added it back. For the record, I also think his "meta commentary" was a useful *footnote* as it effectively discouraged people from downvoting a question they did not understand or appreciate from an academic (or high-performance computing) perspective. Personally, I think the footnote is appropriate for that reason, but I'll respect your SO reputation and leave it out. — etherice, Apr 23 '14 at 00:21
No problem @etherice. Glad you saw it and corrected. It's just that such meta-commentary doesn't belong in questions. Thanks for the update. — Bart, Apr 23 '14 at 07:44

score 5 · Answer 1 · answered Feb 18 '14 at 03:00

5

You can create a container of std::reference_wrapper objects and accumulate over that. Here's an example:

#include <iostream>
#include <functional>
#include <algorithm>
#include <array>
using namespace std;
template<typename... ArgTypes>
int sum(const ArgTypes&... numbers) {
    array<reference_wrapper<const int>, sizeof...(numbers)> A = {ref(numbers)...};
    return accumulate(A.begin(), A.end(), 0);
}
int main() {
    int x = 1, y = 2, z = 3;
    cout << sum(x, y, z) << endl; // prints 6
}

I'll leave it to you to figure out how to adapt this to a more general setting like in your question.

answered Feb 18 '14 at 03:00

Brian Bi

111,498
10
176
312

2

One of the innumerable ways C++ has to dump the trashcan on the code optimizer's desk... – kuroi neko Feb 18 '14 at 03:11
1

Well, etherice comments "*Assume that copying is an expensive operation*" - so they *must* be high- or arbitrary-precision types of considerable size, where `reference_wrapper`s would actually help. So +1! (If the template is used with inbuilt `ArgTypes` then `reference_wrapper`s are worse than useless.) – Tony Delroy Feb 18 '14 at 03:18
@TonyD: I should have stated "objects" instead of "numbers" to keep the question more abstract. I was only interested in whether there's a way to pass variadic arguments to a range algorithm without copying the data (or references to the data), and it was purely an academic curiosity. If the answer is "no, you must copy the arguments/references to an iterable container", that's fine. I was just looking for an authoritative answer. – etherice Feb 18 '14 at 03:39
2

Brian: I upvoted your answer, but it didn't *exactly* answer my question. I asked if there's a way to pass the variadic arguments **without copying data** (objects or references) into a new container. If the answer is "**no**, the objects or references **must** be copied to an iterable container", then that's fine and your solution is acceptable as the most cpu-efficient. The important thing is that I'm looking for an authoritative answer. I updated the question to clarify these points. Thanks. – etherice Feb 18 '14 at 04:02
Well, the Standard-library provided iterators only go over contiguous elements and Standard Containers, the latter's irrelevant and the former needs copying of data or references (latter per this answer). For custom iterators to access the arguments, I can only think of wrapping `varargs` which will only work for homogenous types. You can, however, use something like `int x[sizeof...(ArgTypes)] = { (result += numbers, 0)... };` to have some code (here `result +=`) evaluated with each argument: that could call a functor parameter, but won't help with iterator-expecting fns like `accumulate`. – Tony Delroy Feb 18 '14 at 04:05
1

@etherice: I think it's fairly easy to see that what you're asking is impossible. `std::accumulate` only takes a pair of iterators. There has to be some way of incrementing the start iterator so that it keeps going to the next argument. However, since there are no guarantees about the arrangement of arguments in memory, the iterator would have to "know" the locations of *all* arguments... which would turn the iterator itself into a container of pointers/references. – Brian Bi Feb 18 '14 at 04:17
1

@BrianBi: Well, if the function wasn't inlined you could declare the call-convention and create a custom iterator that accesses the arguments directly on the stack. Or I thought maybe there's some template meta-programming trick or library/language feature or idiom I wasn't aware of. Anyway, I accepted your answer as I think your reasoning is sound. – etherice Feb 18 '14 at 04:34
@etherice no references are copied: pointers are created and stored in the reference wrapper. While you can dictate that copying your numbers is expensive (infinite precision types), copying 12 pointers cannot be. – Yakk - Adam Nevraumont Feb 18 '14 at 12:26
@BrianBi: Please see the comment in Ali's answer for an explanation of why I switched the accepted status to his. I still appreciate your answer and of course gave you a +1 upvote. – etherice Feb 19 '14 at 01:05

score 4 · Accepted Answer · answered Feb 18 '14 at 13:43

Apparently yes but in a twisted way. Consider the following code:

#include <algorithm>
#include <array>
#include <cstdio>
#include <iterator>

template<typename... Ts>
int sum(Ts... numbers) {
    std::array<int,sizeof...(numbers)> list{{numbers...}};
    return std::accumulate(std::begin(list), std::end(list), 0);
}

__attribute__((noinline))
void f(int x, int y, int z) {
  std::printf("sum = %d\n", sum(x, y, z));
}

int main(int argc, char* argv[]) {
  int x = std::atoi(argv[1]);
  int y = std::atoi(argv[2]);
  int z = std::atoi(argv[3]);    
  f(x, y, z);
}

I looked at the generated assembly code. Here is what sum() is optimized into by clang, the assembly code rewritten to C by me for clarity:

int sum(int x, int y, int z) {
  int tmp = x;
  tmp += y;
  tmp += z;
  return tmp;
}

I can say that the generated assembly code is optimal! It got rid of the temporary std::array and unrolled the loop in std::accumulate().

So the answer to your question: even if you create a temporary iterable container, it can be optimized away if the compiler is smart enough and your numeric types are simple enough (built-in types or PODs). You won't pay for the creation of a temporary container or for copying the elements into the temporary container if it can be optimized away.

Sadly, gcc 4.7.2 wasn't that dexterous:

int sum(int x, int y, int z) {
  int a[3];
  a[0] = x;
  a[1] = y;
  a[2] = z;
  int tmp = x;
  tmp += y;
  tmp += z;
  return tmp;
}

Unfortunately, it did not recognize that it can get rid of the temporary array. I will check that with the latest gcc from trunk and if the problem still exists, file a bug report; it seems like a bug in the optimizer.

The approach (in terms of C++ code) is similar or identical to the other answers, but the assembly code analysis (particularly, the clang output which is certainly optimal and answers the original question) makes this the best answer overall, so I'm switching this to the accepted answer. — etherice, Feb 19 '14 at 00:59

kuroi neko · Answer 3 · 2014-02-18T06:46:23.397

Once your template is instanciated, the compiler sees a number of distinct parameters. Each argument can even be of a different type.

It's exactly as if you wanted to iterate over the arguments of fun (a, b, c, d) and expect the code optimizer to cope with layers upon layers of obfuscation.

You could go for a recursive template, but that would be as cryptic as inefficient.

You could design a template-less variadic function, but then you would have to use the <cstdarg> interface and could kiss std::accumulate goodbye.

Possibly you could use the variadic arguments as an initializer for a plain old array and use std::accumulate on it, provided you restrict the use of your shiny new toy to possibly inlineable parameters, namely a list of objects that can be converted to a single base type at compile time.

If you have big and costly objects, this method can still be used with const references to the said objects. I suppose you will spend quite a bit of time optimizing the operators involved in the accumulation computation if you want to squeeze performances out of it, but well, anything is doable with enough blood and sweat.

#include <array>
#include <numeric>
#include <type_traits>

using namespace std;


// Since we need to get back the base type, might as well check that the
// "optimized" code is not fed with junk that would require countless implicit
// conversions and prevent the compiler from inlining the stupid dummy function
// that should just act as a wrapper for the underlying array initialization.
template<class T, class...>
struct same_type
{
    static const bool value = true;
    typedef T type;
};

template<class Ta, class Tb, class... Types>
struct same_type<Ta, Tb, Types...>
{
    static const bool value = is_same<Ta,Tb>::value && same_type<Tb, Types...>::value;
    typedef Ta type;
};

// --------------------------------------------------------
// dummy function just here to make a copy of its arguments
// and pass it to std::accumulate
// --------------------------------------------------------
template<typename F, typename...Args> 
typename same_type<Args...>::type do_something(F fun, Args...args)
{
    // just a slight bit less of obfuscation
    using base_type = same_type<Args...>::type;

    // make sure all arguments have the same type
    static_assert(same_type<Args...>::value, "all do_something arguments must have the same type");

    // arguments as array
    array<base_type, sizeof...(Args)> values = { args... };
    return accumulate (values.begin(), values.end(), (base_type)0, fun);
}

// --------------------------------------------------------
// yet another glorious functor
// --------------------------------------------------------
struct accumulator {
    template<class T>
    T operator() (T res, T val)
    {
        return res + val;
    }
};

// --------------------------------------------------------
// C++11 in its full glory
// --------------------------------------------------------
int main(void)
{
    int    some_junk = do_something(accumulator(),1,2,3,4,6,6,7,8,9,10,11,12,13,14,15,16);
    double more_junk = do_something(accumulator(),1.0,2.0,3.0,4.0,6.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0);
    return some_junk+(int)more_junk;
}

I had a look at the muck generated by the latest Microsoft compiler. It did inline the double version entirely. Most of the code is busy initializing the array, the rest is a half-dozen instructions loop. Note that the loop was not unrolled either.

It did not inline the int version completely. It removed the dummy function call but generated an instance of the accumulate template.

Not surprisingly, the compiler won't bother to optimize a function if it reckons the number and size of parameters passed don't justify it, since it has no way to know this piece of code will be called a few gazillion times per second by an idiotic design.

You could certainly have a lot of fun sprinkling the code with register and inline directives and pragmas and tweaking the compiler optimization options, but it's a dead end, IMHO.

A perfect example of bad design using supposedly cutting edge technology to bonk rocks together, if you ask me.

Thanks. This approach is similar to the one posted by Brian Bi and identical to the one posted by Ali. However, Ali's answer is more concise and provides a thorough analysis of and reasoning about the assembly output (from gcc & clang), so it seemed the most appropriate to accept. I'm still giving you an upvote +1 for your help. Thank you. — etherice, Feb 19 '14 at 00:43
My pleasure. I always find it funny to see how C++ is a language easier understood by compilers than by its human worshipers :). I still advise you to enforce some kind of "all parameters of the same type" policy, or else your compiler will be unable to optimize the copy away. — kuroi neko, Feb 19 '14 at 08:01
I sometimes do enforce those kind of constraints to prevent "misuse" of a function (in terms of logic or optimization). In addition to using `boost::mpl` I have a separate TMP utils library that would allow me to write `all_::value` in the example you provided instead of all the boilerplate code. Very convenient. — etherice, Feb 19 '14 at 18:58
Ah well, I just tried to patch a not too bulky working example together, but certainly you can generalize arguments checking with a bit more work. — kuroi neko, Feb 19 '14 at 19:09
Agree - I thought your example was clear/concise, and I realize on stack overflow it is typical to only use standard library facilities (as opposed to in-house or 3rd party libraries) so the solution can apply to everyone. Thanks again. — etherice, Feb 19 '14 at 19:15

Most cpu-efficient way to use std:: algorithms with arguments to a variadic function template?

3 Answers3