In my code the deepest loop level contains a BLAS routine for matrix-matrix multiplication. Since this operation is the most expensive operation (regarding computation time) I would like to know what is important to make a matrix-matrix multiplication with complex matrix elements as fast as possible?
I use Fortran with ScaLAPACK. I will use the CGEMM routine.
I have the following specific questions:
- Is it important how the matrices are stored in memory? At the moment, I work with a three dimensional arrays where one index is fixed in each loop cycle such that the three-dimensional arrays reduces to two-dimensional matrices. But I have the feeling that this is innefficient since then the matrix elements aren't physically close together in memory. So, is it better to copy my matrix elemts into a temporary two-dimensional array to pass it to CGEMM?
- In Fortran the first array index is the fastest index. Is there an optimal way how the arrays (matrices) should be shaped in order to achieve fast matrix multiplications? For example I have to perform matrix multiplications A*B where A is a complex 200x4000 matrix and B is a complex 4000x50 matrix. So should I better create A as a 4000x200 array since then the "large" index is the fastes index? Of course then I have to tell CGEMM that A needs to be transposed in order to obtain correct results.
- Are there any well know pitfalls in unsing BLAS routines for matrix-matrix multiplications? I know this is a very general question, but maybe someone knows a good document where some DOs and DONTs are summarized.