Let's assume cache lines are 64 bytes wide and I have two arrays a
and b
which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.
float a[16]; //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16]; //64 byte aligned
I read a[0]
. My question is it faster to now read a[1]
than to read b[0]
? In other words, is it faster to read from the last used cache line?
Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a
and b
are 8192 bytes apart they end up in the same set. Will this change the answer to my question?
Another way to ask my question (which is what I really care about) is in regards to reading a matrix.
In other words which one of these two code options will be more efficient assuming matrix M
fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.
float M[16][16]; //64 byte aligned
Version 1:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[i][j];
}
}
Version 2:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[j][i];
}
}
Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a
at once with AVX (e.g. with _mm256_load_ps()
). Will reading the next eight values from a
be faster than reading the first eight values from b
(recall that a and b are already in the cache so there will not be a cahce miss)?
Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.