Have Sysbench memory rndrd/rndwr tests eliminate the affect of cache?

Question

In short, my question is: Should Sysbench eliminate the effect of cache when measuring the memory read/write performance, similar to how the effect of memory is eliminated when measuring the disk performance?

If the answer is no, does that mean Sysbench cares only about the final performance, no matter with cache or not?

If the answer is yes, did Sysbench disable the cache anywhere (and I missed that), or didn't do so?

P.S. By "affect of cache", I mean: when the user-defined memory_block_size is smaller than the cache size, and the whole memory block (or a big part of it) is loaded into the CPU cache, thus the memory performance is affected by the cache.

===

And here's some background information:

I am trying to run the memory benchmark in Sysbench, and this is how Sysbench does random memory access test:

int event_rnd_read(sb_event_t *req, int tid)
{
  (void) req; /* unused */

  for (ssize_t i = 0; i <= max_offset; i++)
  {
    size_t offset = (size_t) sb_rand_default(0, max_offset);
    size_t val = SIZE_T_LOAD(buffers[tid] + offset);
    (void) val; /* unused */
  }

  return 0;
}

The MACRO SIZE_T_LOAD expanded to:

# define SIZE_T_LOAD(ptr) ck_pr_load_32((uint32_t *)(ptr))

when sizeof(size_t) is 4 bytes. ck_pr_load_32 is an atomic memory load function that can hardly be optimized by the compiler according to this link. And the max_offset is set to memory_block_size / SIZEOF_SIZE_T - 1; where the memory_block_size is in most cases set to somewhere near 4KB.

All the code above is copied from https://github.com/akopytov/sysbench/blob/master/src/tests/memory/sb_memory.c

So, as far as I can see, Sysbench is not doing anything special to eliminate the effect of cache in their memory random read test. Is that true? If yes, then is that reasonable?

score 0 · Accepted Answer · answered May 09 '23 at 17:38

It's pretty standard in memory benchmarking to make graphs of performance vs. array size, to look at how things fall off as you exceed the sizes of various levels of cache. So no, Sysbench shouldn't try to defeat cache.

If users don't want cache effects, they should specify a large buffer. Cache is part of the memory hierarchy, so it's useful to be able to measure cases where it helps.

Even if you wanted to defeat cache without using a larger buffer, there's no portable efficient way to do that which. The only thing that works well without introducing huge overhead is using a larger buffer.

On recent x86 CPUs, clflushopt after each read could evict them again, but that has to make sure the cache line is evicted from any/all cores, so it's more like a store operation. And not all x86 CPUs support it, so using it where available would make a non-level playing field for benchmarking the same buffer size on two different CPUs.

Disk storage is different for a couple reasons:

It is possible to efficiently bypass OS-level caching of memory. Because it's a software cache, not hardware.
The necessary disk size to defeat memory caching and buffering would be impractically huge, especially on big systems with a lot of RAM. Unlike with CPU caches, where at most you might have 50 to 100 MiB of L3 cache, and maybe 128 or 256 MiB of L4 cache on a few CPUs with eDRAM. (Like some PowerPC, and some Intel Broadwell / Skylake CPUs with Iris graphics.) Most systems have vastly more free RAM than their CPU-cache sizes, so it's easy to allocate a 1GiB buffer or something that reduces cache hit rate to negligible levels.

CLFLUSH is available on almost every processor I have used in the last decade. It can be slower than CLFLUSHOPT, but that should not be a problem if you are flushing caches for a benchmark. I have used it effectively for repeated accesses to a single cache line and for flushing arrays from the caches. (I agree that it should not be used by default in benchmarks, but it is extremely valuable for microbenchmark validation of performance counters.) — John D McCalpin, May 10 '23 at 14:52
@JohnDMcCalpin: Yeah, cache-control instructions are useful *outside* the timed regions of microbenchmarks for sure. Or like you say, for perf counter tests. So probably a better argument than lack of availability of `clflushopt`, I should have said that different microarchitectures might have different costs for it, and you want to time in-flight loads, not also whatever the costs of flushing are. e.g. a CPU with slow scattered `clflushopt` or slow `clflush` wouldn't be slower for real workloads with lots of cache misses. Also, comparing vs non-x86 CPUs with different cache insns... — Peter Cordes, May 10 '23 at 15:36

Have Sysbench memory rndrd/rndwr tests eliminate the affect of cache?

1 Answers1