Why modern CPUs don't interleave caches?

Question

There were a few questions on SO, such as this one, about performance degradation when arrays or matrices happen to align with cache sizes. The idea how to solve it in hardware has been around for decades. Why then modern computers don't interleave caches to reduce the consequences of super-alignment?

This question appears to be off-topic because it is about CPU architecture and design, and is not programming related according to the [help] guidelines. — Ken White, Nov 27 '13 at 00:06

Leeor · Accepted Answer · 2013-11-27T16:58:00.560

2

Most modern caches are already banked, but that (like the memory banking as your link states), is meant to improve access timing and sequential access bandwidth, not solve other problems.

The question you link was solved as bad coding (traversing row-wise instead of column-wise), but in general - if you want to solve issues emerging from bad alignment in caches - you're looking for cache skewed-associativity (example paper). According to this method, the set mapping is not based on simple set bits, but instead involves some shuffle based on the tag bits - this allows better spread of data in cases where it would otherwise be conflicting over the same sets. Note that this wouldn't really help you in case you're using up your entire cache, just for corner cases where you have some "hot sets" being overused while others are left mostly untouched.

However, this is not a common practice as far as I know, because it's a very specific problem and can be easily solved in code (or through a compiler), and therefore probably not worth a HW solution.

Edit:
Did a few more searches following the question by Paul - it seems that closer caches that are latency critical aren't using this (or at least it's not being published, but I guess if it was done it would appear in optimization guides as it's important for performance tuning and easily detectable). This would probably include the L1 and the TLBs that have to be queried on any memory access.

However, according this this link, it is done at least in the L3 cache of some Intel chips: http://www.realworldtech.com/sandy-bridge/8/

There is one slice of the L3 cache for each core, and each slice can provide half a cache line (32B) to the data ring per cycle. All physical addresses are distributed across the cache slices with a single hash function. Partitioning data between the cache slices simplifies coherency, increases the available bandwidth and reduces hot spots and contention for cache addresses.

So it is used at least for large scale, and less latency critical caches.

edited Nov 27 '13 at 16:58

answered Nov 27 '13 at 00:36

Leeor

19,260
5
56
87

Yes, skewed associativity is what I meant. So why isn't this solution have been utilized? Since the problem occurred to many people, and the software solution may require some care on the side of the programmers, IMO it would make sense to implement it on HW level. – Michael Nov 27 '13 at 01:02
You'll have to ask the CPU companies for that, as I said - I can only assume that the potential gain wasn't worth it. It does stand to reason though that they would use solid benchmarks to evaluate these things, and not broken code. – Leeor Nov 27 '13 at 01:18
1

Prime modulo indexing has also been proposed. @Michael For L1 access skewed associativity can add latency (folding such into the AGU or the cache array indexing could reduce this issue) and introduces aliasing issues. Why such isn't used for L2 or TLBs is a question at the [Computer Architecture](http://area51.stackexchange.com/proposals/50430/computer-architecture/) Area51 proposal. You might find [this page](https://semipublic.comp-arch.net/wiki/Skewed_associativity) (and other parts of that wiki) interesting. – Nov 27 '13 at 02:54
@PaulA.Clayton, I'm not sure adding a simple xor on the address bits would cost so much, but if it does then it's just as critical in the TLB lookup path. Agree about the L2 though, but I see it's done by Intel in the L3 at least (see edit) – Leeor Nov 27 '13 at 16:53
@Leeor It is not clear that SandyBridge's L3 hash function is particularly complex; modulo a power of two, e.g., is still a hash function. POWER4 (and 5) had 3 L2 slices and used modulo 3 of "many" address bits to select the slice (and had reduced bandwidth to the farthest slice relative to a core), but that is more or less ordinary indexing. TLBs and post-translation L2 do not have aliasing issues and tend to benefit more from conflict reduction (the fewer indexing bits for an L1 TLB relative to L1 cache might also impact design choices). – Nov 27 '13 at 23:50
[Quoting a semi-retired computer architect](https://groups.google.com/forum/#!original/comp.arch/JT0MyFqJ1OM/4XHP8Q_CckIJ): "In the designs I am familiar with, adding the skew would harm the cycle time, greatly or entirely ameliorating any performance advantage from the reduced conflicts." (Mitch Alsup worked on x86 at AMD, SPARC at Ross, M88K at Motorola, etc.) – Nov 27 '13 at 23:54
@PaulA.Clayton, to wrap things up - I believe we're in agreement that OPs' problem requires skewing, but that it's hardly done (at least with a sufficiently decent distribution) due to complexity/timing. My claim is that it's probably not worth it just to handle a few corner cases of bad alignment that can be solved by simply rewriting the code. Oh, and computer architects never retire, they just wrap around :) – Leeor Nov 28 '13 at 00:04

score 0 · Answer 2 · answered Nov 27 '13 at 00:18

0

Interleaving solves a different problem (memory access delays). Since caches are fast, interleaving doesn't really help. For cache alignment issues, the traditional solution is to increase the associativity.

answered Nov 27 '13 at 00:18

Eric Brown

13,774
7
30
71

Why modern CPUs don't interleave caches?

2 Answers2