0

I wrote a matrix-matrix(32bit floats) multiplication function in C++ using intrinsics for large matrices(8192x8192), minimum data size is 32B for every read and write operation.

I will change the algorithm into a blocking one such that it reads a 8x8 block into 8 YMM registers and do the multiplications on the target blocks rows (another YMM register as target) finally accumulating the 8 results in another register and storing into memory.

Question: Does it matter if it gets 32B chunks from non-contiguous addresses? Does it change performance drastically if it reads like:

Read 32B from p, compute, read 32B from p+8192 (this is next row of block), compute,
Read and compute until all 8 rows are done,  write 32B to target matrix row p3

instead of

Read 32B from p, compute, read 32B from p+32, compute, read 32B from p+64......

I mean the reading speed of memory, not the cache.

Note: Im using fx8150 and I dont know if it can read more than 32B in single operation.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • So long as you have a consistent stride the automatic prefetch on most modern CPUs will do a good job. There may be some inefficiency due to use of incomplete cache lines, but as with everything the only real way to know for certain is to implement and benchmark both methods. – Paul R Jul 27 '13 at 20:38
  • Can it autofetch strides of 32kB far ? – huseyin tugrul buyukisik Jul 27 '13 at 20:41
  • Why do you care about memory speed and not the cache? you're on x86 so you should optimize for cache and therefore avoid the 8x8 blocks. 8x8 would be an option for 2d coherent caches found mostly in graphics processors – a.lasram Jul 27 '13 at 20:41
  • Just trying the blocking in registers. I will bench it but not sure if it worths it. – huseyin tugrul buyukisik Jul 27 '13 at 20:43
  • CPUs can have multiple active prefetches going on at once. To maximize bandwidth usage (with this amount of data, bandwidth is almost certainly going to be your bottleneck) it might be better to take advantage of this with a hybrid -- compute 2-4 contiguous regions at once in the same loop body. – Cory Nelson Jul 27 '13 at 20:46

1 Answers1

1

It will probably give you better performance to have one contiguous buffer (at the very least, it's not worse!).

How big the performance difference is will depend on a large number of factors (and of course, if you allocate a bunch of 32 byte blocks, you are quite likely to get "close-together" lumps of memory, so the caching benefit will still be there. Worst case is if every block is in a different 4KB segment of memory, but if you have a few bytes of "empty space" between each block, not that big a deal.

Like so many other performance questions, it's quite a lot to do with the exact details of what the code does, memory types, processor type, etc. The only way to REALLY find out, you will need to benchmark the different options...

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • The stride is in the order of 8192 elements (or 32kB difference). Sometimes as small as 512 elements(2kB difference) – huseyin tugrul buyukisik Jul 27 '13 at 20:40
  • How bad is accessing a different segment each time? – huseyin tugrul buyukisik Jul 27 '13 at 20:46
  • It is worse, but it's hard to say, without knowing exactly what the memory controller does, what type of memory you have (e.g. a 27-9-9-9-9 memory will do better than 32-12-12-12-12 memory). The further apart, the more likely it is that you have "open a new page" one the memory controller - typically, a page is 4KB (no, not the same as a "page" in virtual memory handling). It also depends on how long the actual math using the data takes - and even if you showed me the code, and I had the exact timings of each instruction, often it's hard to judge how much the processor will block instructions – Mats Petersson Jul 27 '13 at 20:57
  • I just tried some bugged(but at least accessing the way I showed) code, it was 100x worse, did I made serious error or a natural outcome of new paging? – huseyin tugrul buyukisik Jul 27 '13 at 21:01
  • 100x worse does seems a bit high. I'd expect around 10-100%, but not 10000% (100x). – Mats Petersson Jul 27 '13 at 21:08
  • For the '100x worse' problem: are you initializing the memory with valid floating point values? Random data will result in a some denormal values, which can have a huge performance impact (depending the flush to zero setting). Also, consider using software prefetch. Some boards even have a BIOS setting to disable hardware prefetch. The real problem with this type of tuning is that results will vary across different Intel and AMD processor models. –  Jul 28 '13 at 00:16