Same function, different performance results by using Google benchmark

Question

I was trying to familiarize myself with the google benchmark framework, and decided to run a test with the famous pre/post increments. However, I found out that within the execution of the same function, it is literally the same code, I get different results in terms of time measurements.

My test consists of three functions:

incrementA, just a for-loop with nothing special
incrementB which is a copy of incrementA
increment that calls incrementA

With these three functions, I wrote a fixture and then registered the tests.

#include <assert.h>
#include <stdint.h>

#include <benchmark/benchmark.h>

//---------------------------------------------------------------------

void incrementA(int COUNT) {
    volatile int a[COUNT+1];
    int i = 0;
    for (int j = 0; j < 1000; j++) {
        i = 0;
        for (int k = 0; k < COUNT; k++) {
            a[i++] = k + j;
        }
    }
}

void incrementB(int COUNT) {
    volatile int a[COUNT+1];
    int i = 0;
    for (int j = 0; j < 1000; j++) {
        i = 0;
        for (int k = 0; k < COUNT; k++) {
            a[i++] = k + j;
        }
    }
}

void increment(int COUNT) {
    incrementA(COUNT);
}

//---------------------------------------------------------------------

class PrePostIncrement : public ::benchmark::Fixture
{
public:
    void SetUp(const ::benchmark::State& st)
    {
        size = st.range(0);
    }

    void TearDown(const ::benchmark::State&)
    {
    }

    static void CustomArguments(benchmark::internal::Benchmark* b)
    {
        size_t minSize = 8;
        for (int i = 0; (1 << (i + minSize)) < (1 << 20); ++i)
            b->Arg(1 << (i + minSize));
    }
    int size;
};


//---------------------------------------------------------------------


#define REGISTER_TEST(IncrementFunction)                                                \
    using IncrementFunction##_Test = PrePostIncrement;                                  \
    BENCHMARK_DEFINE_F(IncrementFunction##_Test, Obj)(benchmark::State& state)          \
    {                                                                                   \
        while (state.KeepRunning())                                                     \
        {                                                                               \
            IncrementFunction(size);                                                    \
        }                                                                               \
    }                                                                                   \
    BENCHMARK_REGISTER_F(IncrementFunction##_Test, Obj)->Apply(IncrementFunction##_Test::CustomArguments)->Unit(benchmark::kMillisecond);


REGISTER_TEST(incrementA);
REGISTER_TEST(incrementB);
REGISTER_TEST(increment);

BENCHMARK_MAIN();

Compiled with:

$ g++ increment_benchmark.cpp -std=gnu++14 -march=native -pthread -O3 -I/home/user/software/benchmark/include -L/home/user/software/benchmark/build/src -Wl,-rpath=/home/user/software/benchmark/build/src -lbenchmark

and the results are inconsistent, e.g. by swapping the order of the tests, I get different results.

---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
incrementA_Test/Obj/256         0.125 ms        0.125 ms         5499
incrementA_Test/Obj/512         0.244 ms        0.244 ms         2868
incrementA_Test/Obj/1024        0.482 ms        0.482 ms         1439
incrementA_Test/Obj/2048        0.971 ms        0.971 ms          715
incrementA_Test/Obj/4096         1.91 ms         1.91 ms          361
incrementA_Test/Obj/8192         3.82 ms         3.82 ms          180
incrementA_Test/Obj/16384        7.77 ms         7.77 ms           90
incrementA_Test/Obj/32768        15.6 ms         15.6 ms           45
incrementA_Test/Obj/65536        30.5 ms         30.5 ms           23
incrementA_Test/Obj/131072       61.7 ms         61.7 ms           11
incrementA_Test/Obj/262144        122 ms          122 ms            6
incrementA_Test/Obj/524288        245 ms          245 ms            3
incrementB_Test/Obj/256         0.084 ms        0.084 ms         8246
incrementB_Test/Obj/512         0.166 ms        0.166 ms         4212
incrementB_Test/Obj/1024        0.321 ms        0.321 ms         2175
incrementB_Test/Obj/2048        0.629 ms        0.629 ms         1109
incrementB_Test/Obj/4096         1.23 ms         1.23 ms          564
incrementB_Test/Obj/8192         2.42 ms         2.42 ms          288
incrementB_Test/Obj/16384        4.84 ms         4.84 ms          142
incrementB_Test/Obj/32768        9.63 ms         9.63 ms           72
incrementB_Test/Obj/65536        20.3 ms         20.3 ms           34
incrementB_Test/Obj/131072       40.8 ms         40.8 ms           17
incrementB_Test/Obj/262144       81.7 ms         81.7 ms            8
incrementB_Test/Obj/524288        164 ms          164 ms            4
increment_Test/Obj/256          0.126 ms        0.126 ms         5551
increment_Test/Obj/512          0.244 ms        0.244 ms         2861
increment_Test/Obj/1024         0.482 ms        0.482 ms         1453
increment_Test/Obj/2048         0.958 ms        0.958 ms          721
increment_Test/Obj/4096          1.91 ms         1.91 ms          364
increment_Test/Obj/8192          3.82 ms         3.82 ms          183
increment_Test/Obj/16384         7.63 ms         7.63 ms           91
increment_Test/Obj/32768         15.2 ms         15.2 ms           46
increment_Test/Obj/65536         30.5 ms         30.5 ms           23
increment_Test/Obj/131072        61.0 ms         61.0 ms           11
increment_Test/Obj/262144         122 ms          122 ms            6
increment_Test/Obj/524288         244 ms          244 ms            3

Initially I thought that maybe the scaling strategy (powersave) was perhaps influencing the results, but after changing it to performance, the results were the same.

Just for reference, I compiled the google framework (bf585a2 [v1.5.2]) and my libs are:

$ ldd --version
ldd (Ubuntu GLIBC 2.27-3ubuntu1.2) 2.27
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
$ g++ --version
g++ (Ubuntu 9.2.1-17ubuntu1~18.04.1) 9.2.1 20191102
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I am pretty sure there are different ways of writing this same test, and I am welcome to read any suggestions, but my main interest is to know what is wrong with my code, and why I get different results.

Have you tried aligning both methods with `__attribute__((aligned(4096)))`? — mrks, Dec 23 '20 at 18:28
I bet you should use benchmark::DoNotOptimize. I do not think volatile has the similar meaning — PiotrNycz, Dec 23 '20 at 18:37
It would be great if you'd post a link to limited test case quick-bench.com that shows the results (your code times out there) — xaxxon, Dec 23 '20 at 19:36
*"by swapping the order of the tests, I get different results."* -- this is not uncommon. That's why, when comparing two approaches in the same execution, it can be useful to start with an untimed run to prime the caches. You are asking why this happens? — JaMiT, Dec 23 '20 at 19:52

Same function, different performance results by using Google benchmark

0 Answers0