7

I'm currently reading Ulrich Drepper's "What every programmer should know about memory". The relevant chapter as html is ħere, pdfs of the entire text are also available and easy to find.

To explain the effects of CPU cache on performance he goes through a couple of variations of walking a singly linked list. The two main scenarios he compares are sequential, each item links to its right hand neighbor, and random.

The bit I find difficult to understand is in figure 3.16 where he plots the ratio of L2 cache misses against the size of the list. For a randomly linked list the ratio is zero as long as the list fits into L2 and beyond that (at 2^19 bytes) quickly rises. So far so plausible. But then it doesn't keep rising but falls if slowly between 2^22 and 2^26. Afterwards it steeply rises again.

taken from Ulrich Drepper's "What every programmer should know about memory"

The author describes the phenomenon but doesn't really seem to explain it.

I myself can't think of any reason for this counterintuitive behavior.

Anyone able to enlighten me?

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • IIRC, this was testing on P4. Or at latest Core 2. So L2 was the last-level cache, there wasn't a larger shared cache outside that. – Peter Cordes Aug 08 '19 at 21:07
  • You do remember correctly, except one of the P4s he was working on apparently did have L3. I'm not 100% certain which P4 he used here. He says somewhere _"Unless otherwise specified, all measurements are made on a Pentium 4 machine in 64-bit mode which means the structure l with NPAD=0 is eight bytes in size."_ But later, though not in the context of figure 3.16 he says – Paul Panzer Aug 08 '19 at 21:37
  • _"This time we see the measurement from three different machines. The first two machines are P4s, the last one a Core2 processor. The first two differentiate themselves by having different cache sizes. The first processor has a 32k L1d and an 1M L2. The second one has 16k L1d, 512k L2, and 2M L3. The Core2 processor has 32k L1d and 4M L2."_ Why is it relevant? – Paul Panzer Aug 08 '19 at 21:38
  • It's most relevant in that it's different from modern CPUs ("core i7" series: Nehalem and later) where the main HW prefetch is in L2. And that IvB and later have an adaptive replacement policy in L3 that might help get some hits for working sets larger than the cache size (http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/). Although that would only affect L2 miss latency, not rate, because its in a cache farther out than L2. – Peter Cordes Aug 08 '19 at 23:26
  • @PeterCordes The article says `Unless otherwise specified, all measurements are made on a Pentium 4 machine in 64-bit mode...` and later it says it has has a 16kB L1d and 1MB L2. The `NPAD=0` curve in Fig 3.11 is identical to the one in Fig 3.15, which is used in the same section as the figure from the question. Also, the `NPAD=15` curve in Fig 3.11 is not the same as any of the curves in Fig 3.14, so the 3 machines from 3.14 are all different from the one used in the other figures. The only P4 processors with a 16kB L1d and 1MB L2 are the 90 nm Mobile Pentium 4 HT, so it must be one of them. – Hadi Brais Aug 09 '19 at 00:36
  • We don't know whether the random walk curve would be the same in every run. That ~5% fall is not really that big and may depend on the exact access pattern produced from the random number generator in that particular run. – Hadi Brais Aug 09 '19 at 00:40
  • 1
    @HadiBrais I'm not a statistician, but having 4 consecutive falls does look significant to me. For example, if you do a one-sided Spearman rank on the data points 2^22 to 2^26 you get a p<1%. I suppose one has to correct for the fact that this fall might have been observed at 6 different positions, but this still leaves p<5%. – Paul Panzer Aug 09 '19 at 01:11
  • There could be a microarchitectural explanation for that, but it's hard to tell what's happening without repeating the experiments on the same processor and checking whether that curve is reproducible. – Hadi Brais Aug 09 '19 at 01:19
  • @HadiBrais Fair enough, though I wonder whether the author would have described the feature (_"The curve has a similar form to the one in Figure 3.15: it rises quickly, declines slightly, and starts to rise again."_) had he suspected that it is a one off. By the way, impressive forensics on the identity of that P4 ;-) – Paul Panzer Aug 09 '19 at 01:26
  • It's not clear to me what the author meant by that sentence. The curve in 3.15 never declines; it just keeps going up. This suggests that the slight reduction in L2 misses shown in 3.16 does *not* observably result in reduction in cycles per element as shown in 3.15. So that sentence doesn't make much sense to me. – Hadi Brais Aug 09 '19 at 01:38
  • @HadiBrais maybe he means that that shoulder of weaker growth in 3.15 corresponds to the fall in 3.16 if one thinks of 3.15 as the result of two factors: 3.16 and a more steadily growing cost (TLB misses?). – Paul Panzer Aug 09 '19 at 02:21
  • I'm thinking that the linked list walking benchmark is latency-bound (i.e., the current element needs to be accessed before the address of the next element is determined), which means that if the number of L2 misses is reduced, cycles per element should also be reduced, even if TLB misses continue to occur. The data TLB in that processor is fully-associative and can hold up to 64 4KB entries = 256KB, which is smaller than the L2 size. In a random walk, the probably of missing in the TLB is larger than that of missing in the L2 cache. – Hadi Brais Aug 09 '19 at 12:32
  • This _could_ be due to a different memory layout / alignment (heap algorithm) for different working set or object sizes, resulting in the need to overwrite more cache locations (there's a name for that, but I forgot..). See [this](http://danluu.com/3c-conflict/) for example. – Danny_ds Aug 09 '19 at 13:56

0 Answers0