Simple loop tiling example for matrix multiplication

Question

I'm trying to understand what really happens (step by step) when I use loop tiling or blocking to multiply two matrices. F.e. I understand what the code on http://en.wikipedia.org/wiki/Loop_tiling does. However, I can't picture what happens within the cache. Let's say I want to multiply two 4x4 matrices. AxB = C.

Now I want to create 4 submatrices (2x2) for each A and B. so A = [A1 A2 ; A3 A4] and B = [B1 B2 ; B3 B4]. All Elements in the memory for C are initialized to zero. f.e. using calloc.

1) Let's assume the matrices are stored in memory in row-major ordering: row1,row2,row3,row4 ...

2) let's assume I have two cachline with 4 matrix elements each. So when performing the naive matrix multiplication for the first element in c C[0,0] i will have a memory access for A[0,0] and copy a whole matrix row into the cacheline. Then i have a second memory access for B[0,0]. Then C[0,0] = A[0,0] * B[0,0] + C[0,0]. The next step would be C[0,0] = A[0,1] * B[1,0] + C[0,0]. Since A[0,1] is in the first cache line i will have a cache hit. However, B[1,0] is not in the second cache line and i will have a memory access.

Would Loop tiling be of any help in this example? Could anyone explain (step by step) what happens within the cache and why memory accesses are reduced? If this example is not suitable, Could anyone make up one where the benefits of blocking are visible?

Thanks in advance.

An answer that goes step-by-step through the cache accesses would be tedious. But the general reason why tiling works is because it reduces the size of your *working set*, ideally to the point where it fits entirely within cache. — Oliver Charlesworth, Dec 28 '13 at 14:37
Thanks for the quick answer. So how is it saved within the cache. The block itself or still the adjacent memory elements? And if the former is true how does the cache know that is should copy the block? — user3142067, Dec 28 '13 at 14:39
The cache doesn't "know" anything. It just saves stuff that's read from memory, until that stuff is evicted by other stuff. I'd suggest reading https://en.wikipedia.org/wiki/CPU_cache first. — Oliver Charlesworth, Dec 28 '13 at 14:40
Yes I know, that's what I assumed. I know how caches work in theory, but i have difficultied picturing why blocking is able to reduce the working set. Ok you have submatrices and accessing the matrix blockwise will create compulsory misses first for each "subblock-row" (if the matrix is large enough of course). And you might then have a whole submatrix within the cache. But you will still have to access the other submatrices. So i figured, whether blocking is any good depends on the design of the cache itself (size, associativity... etc.) and the problem at hand. — user3142067, Dec 28 '13 at 14:50
Yes you do have to access the other submatrices. But once you access the 2nd one, you never have to access the 1st one again. So you get a new set of compulsory misses, but no other misses. If your submatrix is bigger than the cache (the limit being submatrix == whole matrix), you get non-compulsory cache misses as you iterate *within* each submatrix. — Oliver Charlesworth, Dec 28 '13 at 14:53
Just to be sure I understood you correctly: C1 = A1*B1 + A2*B3. For this I only have to access each submatrix once. If they happen to be small enough so that they fit within the cache i won't have conflict misses etc. However, if i calculate C2 i will have to access A1 again and depending on how big the cache is and on the cache replacement policy the cache might have already evicted A1. But when looking at the whole mm operation, blocking will exploit the fact that elements are used repeatedly and accesses them in a way to reduce "later" accesses and increase the probability of cache hits. — user3142067, Dec 28 '13 at 15:18
Yes, if you like, blocking massively increases the *temporal locality* of your access pattern. Between any two consecutive accesses to any particular matrix element, the number of *other* elements that are accessed is reduced, thus reducing (or eliminating) the possibility of eviction. — Oliver Charlesworth, Dec 28 '13 at 15:21

Simple loop tiling example for matrix multiplication

0 Answers0