2

Let's assume cache lines are 64 bytes wide and I have two arrays a and b which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.

float a[16];  //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16];  //64 byte aligned

I read a[0]. My question is it faster to now read a[1] than to read b[0]? In other words, is it faster to read from the last used cache line?

Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a and b are 8192 bytes apart they end up in the same set. Will this change the answer to my question?

Another way to ask my question (which is what I really care about) is in regards to reading a matrix.

In other words which one of these two code options will be more efficient assuming matrix M fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.

float M[16][16]; //64 byte aligned

Version 1:

for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        x += M[i][j];
    }
}

Version 2:

for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        x += M[j][i];
    }
}

Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a at once with AVX (e.g. with _mm256_load_ps()). Will reading the next eight values from a be faster than reading the first eight values from b (recall that a and b are already in the cache so there will not be a cahce miss)?

Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.

Z boson
  • 32,619
  • 11
  • 123
  • 226

4 Answers4

3

With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16]; with a[0] recently loaded, a[1] in the same cache line as a[0], and b[1] not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1] and b[0] in the absence of some other factor.

One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1] or b[0].

It is also theoretically possible that a compiler that sees your code is loading both a[0] and a[1] in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.

With the two-dimensional array scenario, there should still be no difference as long as the entire array M is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:

  • An array M has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the array float M[1024][1024];, M[0][0] and M[1][0] are 4096 bytes apart and map to the same cache set.
  • As you traverse a column of the array, you access M[0][0], M[1][0], M[2][0], M[3][0], and so on. The cache line for each of these elements is loaded into cache.
  • As you continue along the column, you access M[8][0], M[9][0], and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containing M[0][0] and so on are evicted from cache.
  • When you complete the column and start the next column by reading M[0][1], the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).
Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • This is a good answer. I'm aware of the problems with critical strides. The only piece missing to me (and I should have made it more clear in my question) is the effect in L2. The tile sizes I am using form my GEMM code fit in L2 not L1 (I could make them any size but I find making them fit in L2 gives the best result). So I guess I'm still a bit confused about what happens when the matrix fits in L2 but not L1. Multi-level cache is complicated. – Z boson Dec 11 '13 at 12:36
  • Most modern CPUs also have memory disambiguation, so full store/load aliasing doesn't have to bottleneck you even if there's a partial match, you can speculate that it's safe to reorder. – Leeor Dec 11 '13 at 13:30
1

Fetching a[0] and then either a[1] or b[0] should amount to 2 cache access that hit the L1 in either case. You didn't say which uArch you're using but i'm not familiar with any mechanism that does further "caching" of the full cacheline above the L1 (anywhere in the memory unit), and I don't think such a mechanism could be feasible (at least not for any reasonable price).

Assume you read a[0] and then a[1], and would like to save the effort of accessing the L1 again for that line - your HW would have to not only keep the full cache line somewhere in the memory unit in case it's going to be accessed again (not sure how much that's a common case, so this feature is probably not the effort), but also keep it snoopable as a logical extension of your cache in case some other core tries to modify a[1] between these two reads (which x86 permits for wb memory). In fact, it could even be a store in the same thread context, and you'll have to guard against that (since most common x86 CPUs today are performing loads out of order). If you don't maintain both of these (and probably other safeguards too) - you break coherency, if you do - you've created a monster logic that does that same as your L1 already does, just to save meager 1-2 cycles of access.

However, even though both options would require the same number of cache accesses, there may be other considerations effecting their efficiency, such as L1 banking, same-set access restrictions, lazy LRU updating, etc.. All of which depend on your exact machine implementation.

If you don't focus only on memory/cache access efficiency, your compiler should be able to vectorize accesses to consecutive memory locations, which would still incur the same accesses but will be lighter on execution BW. I think that any decent compiler should be able to unroll your loops at this size, and combine the consecutive accesses into a single vector, but you may be able to help it by using option 1 (especially if there are also writes or other problematic instructions in the middle that would compilcate the job for the compiler)

Edit

Since you're also asking about fitting the matrix in the L2 - that simplifies the question - in that case using the same line(s) multiple times as in option 1 is better as it allows you to hit the L1, while the alternative is to constantly fetch from the L2, which gives you lower latency and bandwidth. This is the basic principle behind loop tiling / blocking

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • Good point about the why tiling is useful. That's why I'm using it. I'm wondering if I should use two levels of tiling? One at L1 and one at L2. Right now I only use tiles that fit in L2 and then I rearrange the matrix ([something like the transpose](http://stackoverflow.com/questions/20435621/calculating-matrix-product-is-much-slower-with-sse-than-with-straight-forward-al/20440362#20440362)) so that I can use option 1. But if I used two levels of tiling I would not have to rearrange. – Z boson Dec 12 '13 at 07:32
  • I don't see much practical point in 2 level tiling here, what's important is what your code runs on, the rest can be kept as far as needed as long as you fetch it effectively (ahead of time if possible). If the access pattern was more complicated and you'd have to load tiles several times, maybe there would be point in that. – Leeor Dec 12 '13 at 08:43
0

Spatial locality is king so version #1 is faster. A good compiler can even vectorize the reads using SSE/AVX.

The CPU rearranges reads so it doesn't matter which one is first. In out-of-order CPUs it should matter very little if the both cache lines are on the same way.

For large matrices, it is even more important to keep locality so the L1 cache remains hot (less cache misses).

egur
  • 7,830
  • 2
  • 27
  • 47
  • Maybe I should not have give the matrix suggestions. I'm writing my own GEMM code so I do the SSE/AVX myself. I use tiling so I'm interested in the answer for a tile which fits in the cache. I do e.g. eight dot products at once by reading eight values from each row and moving down rows. So I want to know if I read the next eight values from the next row makes a difference or if I should reorder the matrix so the next eight values are in the same cache line. Maybe I should explain this in the question. – Z boson Dec 11 '13 at 10:43
  • It will be faster if you did the dot product across a line. This way the CPU will prefetch the data without any hints from the code. – egur Dec 11 '13 at 11:40
  • But if the data is already in the cache there is nothing to prefetch. That's why I stated the matrix fits in the cache and is loaded into the cache. Maybe I don't understand what prefetch means. The only thing I could see could make a different is if accessing the most recent cache line is faster than accessing a new one. – Z boson Dec 11 '13 at 11:52
  • Yes, but you also talked about large matrices that fit in the L2 cache. Personally I would stay away from prefetching via hints. If you want the max out of a specific CPU model, you should experiment with different tile sizes. Note that a different CPU model may behave differently. – egur Dec 11 '13 at 12:00
  • Is there something difference in the behavior between L2 and L1 (besides their size)? I read that L2 cache cannont prefetch more than one line at a time. I'm not sure what that means but it does imply that L1 can prefetch more than one line at a time or why that matters. Could you add some information about differences between L1 and L2 to your answer? – Z boson Dec 11 '13 at 12:05
0

Although I don't know the answer to your question(s) directly (someone else may have more knowledge about processor architecture), have you tried / is it possible to find out the answer yourself by some form of benchmarking?

You can get a high resolution timer by some function such as QueryPerformanceCounter (assuming you're on Windows) or OS equivalent, then iterate the reads you want to test by x amount of times, then get the high resolution timer again to get the average time a read took.

Perform this process again for different reads and you should be able to compare average read times for different types of read, which should answer your question. That's not to say that the answer will remain the same on different processors though.

parrowdice
  • 1,902
  • 15
  • 24