Dynamic memory slow down on Intel Xeon Phi

Question

i am creating a simple matrix multiplication procedure, operating on the Intel Xeon Phi architecture.The procedure looks like this (parameters are A, B, C), and the timing doesn't include initialization:

//start timing
for(int i = 0; i < size; i++){
    for(int k = 0; k < size; k++) {
        register TYPE aik = A[i][k];
        for(int j = 0; j < size; j++) {
              C[i][j] += aik * B[k][j];
        }
    }
}
//end timing

I am using restrict, aligned data and so on. However, if the matrices are allocated using dynamic memory (posix_memalign), the computation incurs in a severe slow down, i.e. for TYPE=float and 512x512 matrices takes ~0.55s in the dynamic case while in the other case ~0.25. On a different architecture (Intel Xeon E5), there is also a slow down, but it is barely noticeable (about 0.002 s).

Any help is apreciated!

What are the sizes of all these arrays? Maybe it won't fit all in the caches or spans cache-lines or something? Have you tried switching the two outer loops? — Some programmer dude, Oct 23 '14 at 16:11
Also, the [`register` storage specifier](http://en.cppreference.com/w/cpp/language/storage_duration) have been deprecated. — Some programmer dude, Oct 23 '14 at 16:14
If you want to do fast matrix multiplication, you really out to get the BLAS libraries rather than code this yourself (hint: the naive algorithm isn't the fastest way to do this!). I'm sure Intel has one highly tuned for Xeon Phi. — Ira Baxter, Oct 23 '14 at 16:22
@JoachimPileborg, every row is 2048 bytes (4 bytes per float, 512 elements). Thanks, i didn't know `register` was deprecated. In general it won't fit in the caches (i.e. 32K l1, 512k L2), however what seems very strange is the huge difference in behavior between the two different matrix storages. @IraBaxter, thanks, but i need to code it myself in sequential, since it's just a "sketch" to evaluate performances of to-be-implemented parallel solutions. — alessandrolenzi, Oct 23 '14 at 17:08
Your "to-be-implemented" parallel solutions should be compared to the best implementations that are already available. Comparing them to an algorithm that is poorly organized with respect to modern architectures may show that you can code something faster than a naive solution, but you won't get brownie points for that. — Ira Baxter, Oct 24 '14 at 02:25
Time many calls to this instead of just one so that you don't measure the page-faulting that happens when you put memory on the heap. It is possible that the overhead is much lower for automatic variable allocation. — Jeff Hammond, Oct 24 '14 at 18:05
What alignment are you using? Please share the ENTIRE code if you want detailed analysis. — Jeff Hammond, Oct 24 '14 at 18:07

amckinley · Answer 1 · 2014-10-23T17:05:24.297

What happens to the timing differences if you make the matrix a different size? (e.g. 513x513)

The reason why I ask is I think you might be seeing this effect due to exceeding cache way associativity and evicting elements of C[i][] from L2 as you loop over B in the loop over k. If B and C are aligned and the sizes are powers of 2, you might get cache super-alignment causing this issue.

If B and C are on the stack or otherwise not aligned, you don't see this effect as fewer addresses are power of 2 aligned.

score 0 · Answer 2 · answered Oct 24 '14 at 18:52

In the "non-dynamic" case, are the arrays just global variables? If so, they end up in BSS and when the ELF is loaded, the OS will initialize them to zero by default - that's how BSS works. If you allocate them dynamically, independent of what method you use (i.e. malloc, new, posix_memalign, exception is mmap(MAP_POPULATE)), you'll cause faults in the OS when you touch the memory. Fault handling is always expensive. It is relatively more expensive on the Coprocessor because you're running on a tiny little core from a single threaded performance standpoint.

Dynamic memory slow down on Intel Xeon Phi

2 Answers2