I wrote a matrix-matrix(32bit floats) multiplication function in C++ using intrinsics for large matrices(8192x8192), minimum data size is 32B for every read and write operation.
I will change the algorithm into a blocking one such that it reads a 8x8 block into 8 YMM registers and do the multiplications on the target blocks rows (another YMM register as target) finally accumulating the 8 results in another register and storing into memory.
Question: Does it matter if it gets 32B chunks from non-contiguous addresses? Does it change performance drastically if it reads like:
Read 32B from p, compute, read 32B from p+8192 (this is next row of block), compute,
Read and compute until all 8 rows are done, write 32B to target matrix row p3
instead of
Read 32B from p, compute, read 32B from p+32, compute, read 32B from p+64......
I mean the reading speed of memory, not the cache.
Note: Im using fx8150 and I dont know if it can read more than 32B in single operation.