Efficient complex matrix-matrix multiplication: matrix shapes and memory

Question

In my code the deepest loop level contains a BLAS routine for matrix-matrix multiplication. Since this operation is the most expensive operation (regarding computation time) I would like to know what is important to make a matrix-matrix multiplication with complex matrix elements as fast as possible?

I use Fortran with ScaLAPACK. I will use the CGEMM routine.

I have the following specific questions:

Is it important how the matrices are stored in memory? At the moment, I work with a three dimensional arrays where one index is fixed in each loop cycle such that the three-dimensional arrays reduces to two-dimensional matrices. But I have the feeling that this is innefficient since then the matrix elements aren't physically close together in memory. So, is it better to copy my matrix elemts into a temporary two-dimensional array to pass it to CGEMM?
In Fortran the first array index is the fastest index. Is there an optimal way how the arrays (matrices) should be shaped in order to achieve fast matrix multiplications? For example I have to perform matrix multiplications A*B where A is a complex 200x4000 matrix and B is a complex 4000x50 matrix. So should I better create A as a 4000x200 array since then the "large" index is the fastes index? Of course then I have to tell CGEMM that A needs to be transposed in order to obtain correct results.
Are there any well know pitfalls in unsing BLAS routines for matrix-matrix multiplications? I know this is a very general question, but maybe someone knows a good document where some DOs and DONTs are summarized.

Efficient complex matrix-matrix multiplication: matrix shapes and memory

0 Answers0