1

While multiplying large matrices (say A and B, A.dot(B)), does numpy use spatial locality by computing the transpose of the B and using row wise multiplication, or does it access the elements of B in column-wise fashion which would lead to many cache misses. I have observed that memory bandwidth is becoming a bottleneck when I run multiple instances of the same program. For example, if I run 4 independent instances of a program which does matrix multiplication (for large matrices) on a 20 core machine, I only see a 2.3 times speedup.

Bharat
  • 2,139
  • 2
  • 16
  • 35
  • 1
    One of the old optimization tricks is to use Fortran striding rather than C striding for arrays you intend to matrix-multiply on the right. Which implies that it doesn't do this optimization automatically—or maybe just that it didn't back in the early days, and old people haven't learned to stop using the trick. :) – abarnert May 06 '15 at 01:29

1 Answers1

1

Numpy dot is implemented in multiarraymodule.c as PyArray_MatrixProduct2. The implementation it actually uses is dependent upon a number of factors.

If you have numpy linked to a BLAS implementation, your dtypes are all double, cdouble, float, or cfloat, and your arrays have 2 or fewer dimensions each, then numpy hands off the array to the BLAS implementation. What is does is dependent upon the package you're using.

Otherwise, no, it doesn't do this. However, at least on my machine, doing this (or just a dot product in general) with a transpose and einsum is ten times slower than just using dot, because dot pushes to BLAS.

cge
  • 9,552
  • 3
  • 32
  • 51
  • I have numpy linked to BLAS, i tried using the C and fortran style orderings as abarnert suggested and I could not observe any performance difference, rather using C ordering for left matrix and Fortran ordering for right matrix was slower. I think then the question should be, does BLAS does this thing internally, probably yes... – Bharat May 06 '15 at 07:03