i am creating a simple matrix multiplication procedure, operating on the Intel Xeon Phi architecture.The procedure looks like this (parameters are A, B, C), and the timing doesn't include initialization:
//start timing
for(int i = 0; i < size; i++){
for(int k = 0; k < size; k++) {
register TYPE aik = A[i][k];
for(int j = 0; j < size; j++) {
C[i][j] += aik * B[k][j];
}
}
}
//end timing
I am using restrict, aligned data and so on. However, if the matrices are allocated using dynamic memory (posix_memalign), the computation incurs in a severe slow down, i.e. for TYPE=float and 512x512 matrices takes ~0.55s in the dynamic case while in the other case ~0.25. On a different architecture (Intel Xeon E5), there is also a slow down, but it is barely noticeable (about 0.002 s).
Any help is apreciated!