8

I'm working on a complex framework which uses std::function<> as argument of many functions. By profiling i found one of the performance problem the following.

Can somebody explain me why the Loop3a is so slow? I expected that the inlining will be used and the time will be same. The same for the assembly. Is there any way to improve performance or different way? Does the C++17 makes any change in that way?

#include <iostream>
#include <functional>
#include <chrono>
#include <cmath>

static const unsigned N = 300;

struct Loop3a
{
    void impl()
    {
        sum = 0.0;
        for (unsigned i = 1; i <= N; ++i) {
            for (unsigned j = 1; j <= N; ++j) {
                for (unsigned k = 1; k <= N; ++k) {
                    sum +=  fn(i, j, k);
                }
            }
        }
    }

    std::function<double(double, double, double)> fn = [](double a, double b, double c) {
        const auto subFn = [](double x, double y) { return x / (y+1); };
        return sin(a) + log(subFn(b, c));
    };
    double sum;
};


struct Loop3b
{
    void impl()
    {
        sum = 0.0;
        for (unsigned i = 1; i <= N; ++i) {
            for (unsigned j = 1; j <= N; ++j) {
                for (unsigned k = 1; k <= N; ++k) {
                    sum += sin((double)i) + log((double)j / (k+1));
                }
            }
        }
    }

    double sum;
};


int main()
{
    using Clock = std::chrono::high_resolution_clock;
    using TimePoint = std::chrono::time_point<Clock>;

    TimePoint start, stop;
    Loop3a a;
    Loop3b b;

    start = Clock::now();
    a.impl();
    stop = Clock::now();
    std::cout << "A: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
    std::cout << "ms\n";

    start = Clock::now();
    b.impl();
    stop = Clock::now();
    std::cout << "B: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
    std::cout << "ms\n";

    return a.sum == b.sum;
}

Sample output using g++5.4 with "-O2 -std=c++14":

A: 1794ms
B: 906ms

In the profiler i can see many of this internals:

double&& std::forward<double>(std::remove_reference<double>::type&)
std::_Function_handler<double (double, double, double), Loop3a::fn::{lambda(double, double, double)#1}>::_M_invoke(std::_Any_data const&, double, double, double)
Loop3a::fn::{lambda(double, double, double)#1}* const& std::_Any_data::_M_access<Loop3a::fn::{lambda(double, double, double)#1}*>() const
Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
Radek
  • 518
  • 5
  • 12
  • std::function does type erasure and is never inlined. – Tatsuyuki Ishi Mar 17 '17 at 11:46
  • 1
    IMHO, In cases like this you are better served with a functor instead of a lambda. Then you can have a named function type instead of a `std::function` and the compiler is good at inlining functors. – NathanOliver Mar 17 '17 at 11:52
  • 1
    Does it work faster with lambda instead of `std::function`? I.e. `auto fn = [](double a, double b, double c) ...`. – ilotXXI Mar 17 '17 at 12:02
  • Additionally to the other stuff already said: This can never be inlined due to the fact that it's not const. You could change the pointer at runtime. So the compiler would have to proof that it's never modified in order to inline it which might not even be possible in some cases. – Christoph Diegelmann Mar 17 '17 at 13:22

2 Answers2

14

std::function is not a zero-runtime-cost abstraction. It is a type-erased wrapper that has a virtual-call like cost when invoking operator() and could also potentially heap-allocate (which could mean a cache-miss per call).

The compiler will most likely not be able to inline it.

If you want to store your function objects in such a way that does not introduce additional overhead and that allows the compiler to inline it, you should use a template parameters. This is not always possible, but might fit your use case.


I wrote an article that's related to the subject:
"Passing functions to functions"

It contains some benchmarks that show how much assembly is generated for std::function compared to a template parameter and other solutions.

Vittorio Romeo
  • 90,666
  • 33
  • 258
  • 416
  • 4
    Fun fact: it's the third time in an hour that I answer a question that's *unnecessarily* using `std::function`. I wish it was clearer that it's not a zero-cost abstraction and that it was named something like `std::type_erased_function`... – Vittorio Romeo Mar 17 '17 at 11:50
  • @Vittorrio: I think you can blame the cpp reference sites for that, when I was researching how to store my lambda that was what the reference pages mentioned. It turned out a function pointer was much faster, but the pages about lambdas didn't mention that you could pass a lambda to one of those. – Jason Lang Mar 17 '17 at 12:12
  • 2
    @JasonLang: That's because you can't. You can only convert a lambda to a function pointer if it is a *captureless* lambda. And the reference page notes that. – Nicol Bolas Mar 17 '17 at 15:28
6

std::function has roughly a virtual call overhead. This is small, but if your operation is even smaller it can be large.

In your case, you are looping heavily over the std::function, calling it with a set of predictible values, and probably doing next to nothing within it.

We can fix this.

template<class F>
std::function<double(double, double, double, unsigned)>
repeated_sum( F&& f ) {
  return
    [f=std::forward<F>(f)]
    (double a, double b, double c, unsigned count)
    {
      double sum = 0.0;
      for (unsigned i = 0; i < count; ++i)
        sum += f(a,b,c+i);
      return sum;
    };
}

then

std::function<double(double, double, double, unsigned)> fn =
  repeated_sum
  (
    [](double a, double b, double c) {
      const auto subFn = [](double x, double y) { return x / (y+1); };
      return sin(a) + log(subFn(b, c));
    }
  );

now repeating_function takes a double, double, double function and returns a double, double, double, unsigned. This new function calls the previous one repeatedly, each time with the last coordinate increased by 1.

We then replace impl as follows:

void impl()
{
    sum = 0.0;
    for (unsigned i = 1; i <= N; ++i) {
        for (unsigned j = 1; j <= N; ++j) {
            fn(i,j,0,N);
        }
    }
}

where we replace the "lowest level loop" with a single call to our repeating function.

This will reduce the virtual call overhead by a factor of 300, which basically makes it disappear. Basically, 50% of the time/300 = 0.15% of the time (actually 0.3%, as we reduce the time by a factor of 2 which doubles the contribution, but who is counting tenths of a percent?)

Now in the real situation you may not be calling it with 300 adjacent values. But usually there is some pattern.

What we did above was move some of the logic controlling how fn was called inside fn. If you can do this enough, you can remove the virtual call overhead from consideration.

std::function overhead is mostly ignorable unless you want to call it on the order of billions of times per second, what I call "per-pixel" operations. Replace such operations with "per-scanline" -- per line of adjacent pixels -- and the overhead stops being a concern.

This can require exposing some of the logic on how the function object is used "in a header". Careful choice of what logic you expose can make it relatively generic in my experience.

Finally, note that it is possible to inline std::function and compilers are getting better at it. But it is hard, and fragile. Relying on it at this point is not wise.


There is another approach.

template<class F>
struct looper_t {
  F fn;
  double operator()( unsigned a, unsigned b, unsigned c ) const {
    double sum = 0;
    for (unsigned i = 0; i < a; ++i)
      for (unsigned j = 0; j < b; ++j)
        for (unsigned k = 0; k < c; ++k)
          sum += fn(i,j,k);
    return sum;
  }
};
template<class F>
looper_t<F> looper( F f ) {
  return {std::move(f)};
}

now we write our looper:

struct Loop3c {
  std::function<double(unsigned, unsigned, unsigned)> fn = looper(
    [](double a, double b, double c) {
      const auto subFn = [](double x, double y) { return x / (y+1); };
      return sin(a) + log(subFn(b, c));
    }
  );
  double sum = 0;
  void impl() {
    sum=fn(N,N,N);
  }
};

which erases the entire operation of 3 dimensional looping, instead of just the trailing dimension.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • The second approach with template parameters seems to me as a solution before refactoring, even i don't like it, because i have less control what is passed in the template parameters. I wish there would be no implicit conversion and other stuff. – Radek Mar 17 '17 at 16:56
  • @Radek I don't understand. You mean the `looper_t` one? The template parameters in that case is a lambda, which you have control over because you call `looper`. `looper_t` is then converted into a `std::function`, just like a lambda in your code is converted to a `std::function`. That is what `std::function` is *for*. – Yakk - Adam Nevraumont Mar 17 '17 at 17:14
  • Thank you i got it. Months ago i only wanted to avoid wiriting `template` everywhere. You know std::function is more readable. – Radek Mar 17 '17 at 17:58
  • @Radek Yes, quite often. `std::function` is type erasure. Type erasure "hides" exactly what type is inside, at modest runtime cost. Choosing the *point of type erasure*, the point where we forget what types we are actually working with and treat them generically, can get that modest runtime cost down to nearly zero. `looper` returns something that *isn't type erased*, but can be immediately. I could have had `looper` return `std::function`, but I'm defensive about premature type erasure. – Yakk - Adam Nevraumont Mar 17 '17 at 18:36