Google benchmark: flush the cache for every function call without timing it

Question

I'm using Google benchmark to time a function, but I need to see how it performs when working with a "cold" cache. I know the benchmark library will run a function until the timing is steady, but I would like this steady timing to incorporate the fact that the cache is cold. This is roughly what my benchmarking looks like:

template <class ...Args>
void BM_MyFunc(benchmark::State& state, Args&&... args) {
    auto args_tuple = std::make_tuple(std::move(args)...);
    const int arg = std::get<0>(args_tuple);
    for (auto _ : state) {
        // flush cache by accessing a huge array
        for (int i = 0; i < N; i++) {
            huge_array[i] = rand();
        }
        my_func(arg);
    }
}

BENCHMARK_CAPTURE(BM_MyFunc,  my_func_with_42,  42);

The problem is that putting the cache flushing inside for (auto _ : state) means that the actual act of flushing the cache appears to be part of the timing results. If I put the cache flushing outside that loop, then the cache is only flushed once and the benchmark library treats it as a bad thing and warms up the my_func so that it's not working with a cold cache.

Is there some way to have a "per function-call" setup that doesn't contribute to the timing of said function? The documentation doesn't seem to cover this particular use case.

You can stop/restart the clock each iteration, but that introduces huge timing overhead if `my_func` is short. (Like a few hundred clock cycles, about the same order of magnitude as a cache miss). Actually even worse than that, like 300 *nanoseconds*, probably 1k clocks. [Google benchmark state.PauseTiming() and state.ResumeTiming() take a long time](//stackoverflow.com/q/56660845). Fine-grained timing is hard because it has to serialize out-of-order exec, so doesn't reflect true costs as part of surrounding code. ([Idiomatic way of performance evaluation?](//stackoverflow.com/q/60291987)) — Peter Cordes, Nov 26 '22 at 09:54
There are vastly less expensive ways to specific parts of the cache, like x86 `_mm_clflushopt( &something )` from `` if you only need to flush a few known lines. (Portably, even just `memset` on a huge array would be less expensive than calling `rand()` 8 times per cache line if you used uint64_t or something. Except memset might use NT stores to avoid cache pollution, so maybe use a `(volatile int*)` access to one int every 64 bytes.) — Peter Cordes, Nov 26 '22 at 09:57

Google benchmark: flush the cache for every function call without timing it

0 Answers0