Why is std::inner_product slower than the naive implementation?

Question

This is my naive implementation of dot product:

float simple_dot(int N, float *A, float *B) {
    float dot = 0;
    for(int i = 0; i < N; ++i) {
    dot += A[i] * B[i];
    }

    return dot;
}

And this is using the C++ library:

float library_dot(int N, float *A, float *B) {
    return std::inner_product(A, A+N, B, 0);
}

I ran some benchmark(code is here https://github.com/ijklr/sse), and the library version is a lot slower. My compiler flag is -Ofast -march=native

What happens if you change 0 to 0f in the call to inner_product? — NathanOliver, Mar 28 '17 at 20:44
You should be able to look at the library implementation as it's a template. — Richard Critten, Mar 28 '17 at 20:44
@NathanOliver: A compiler error because `f` is not a valid octal digit. — Kerrek SB, Mar 28 '17 at 20:45
Did you even compare if your two algorithms produce the same result for non-trivial inputs? — Kerrek SB, Mar 28 '17 at 20:46
wow. I re-ran my test and that 0.0f changed everything. ➜ sse git:(master) ✗ ./dot-product Generating 33554432 element vectors. simple_dot 0.0186736 library_dot 0.018313 simple_prefetch_dot 0.0499649 unroll_dot 0.0223053 sse_dot 0.0189242 avx_dot 0.0181958 avx_unroll_dot 0.0183683 avx_unroll_prefetch_dot 0.0184743 — ijklr, Mar 28 '17 at 20:48
@KerrekSB is it because it had to do the int conversion everytime? — ijklr, Mar 28 '17 at 20:49
this is the benchmark before: Generating 33554432 element vectors. simple_dot 0.0185088 library_dot 0.12566 simple_prefetch_dot 0.0496267 unroll_dot 0.0227776 sse_dot 0.0191732 avx_dot 0.0184244 avx_unroll_dot 0.018839 avx_unroll_prefetch_dot 0.0190001 — ijklr, Mar 28 '17 at 20:50
@ijklr: Yes, indeed. See the machine code I posted in my link. — Kerrek SB, Mar 28 '17 at 20:51
You should know though, that neither the naive version nor the updated use of `inner_product` actually result in good code (with GCC and `-O3 -march=native -ffast-math`). GCC gives it a decent try and manages to use `vfmadd231ps`, but it does it with only one accumulator which means the loop is still limited by FMA *latency* instead of throughput. Even ICC does not unroll enough. Clang gets the `inner_product` right, but doesn't unroll the naive loop enough. Using intrinsics you can fix it on all compilers. — harold, Mar 29 '17 at 09:33
Did you try using `std::transform_reduce` instead? You could even using the nonseq execution policy to vectorize it. — wcochran, Jul 11 '23 at 04:07

Kerrek SB · Accepted Answer · 2017-03-28T20:54:50.027

8

Your two functions don't do the same thing. The algorithm uses an accumulator whose type is deduced from the initial value, which in your case (0) is int. Accumulating floats into an int does not just take longer than accumulating into a float, but also produces a different result.

The equivalent of your raw loop code is to use the initial value 0.0f, or equivalently float{}.

(Note that std::accumulate is a very similar in this regard.)

edited Mar 28 '17 at 20:54

answered Mar 28 '17 at 20:49

Kerrek SB

464,522
92
875
1,084

Thanks. this helps! – ijklr Mar 28 '17 at 20:59
GCC does not unroll the loop and `-funroll-loops` does not break the dependency chain. Clang unrolls 4 times which is great. So at least with GCC `std::inner_product` is not optimal i.e. you have to optimize by hand anyway with reductions. – Z boson Mar 29 '17 at 09:33

Why is std::inner_product slower than the naive implementation?

1 Answers1