6

I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:

large_dimension = 1000

a = np.random.random((1,))
b = np.random.random((1, large_dimension))

c = np.matmul(a, b)

In other words, multiplying matrix b with the scalar a[0].

I am looking for the most efficient way to compute this, since this operation is repeated millions of times.

I tested for performance of the two trivial ways to do this, and they are practically equivalent:

%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Is there a more efficient way to compute this?

  • Note: I cannot move these computations to a GPU since the program is using multiprocessing and many such computations are done in parallel.
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
Nur L
  • 791
  • 7
  • 15
  • 2
    Can you pack your a values into a diagonal matrix A and your b values into a 2d matrix B and just do AB? If you want this to be faster I think you're gonna have to rework whatever brought you to this point – CJR Apr 23 '21 at 13:28
  • Does `a*b` time any different? – hpaulj Apr 23 '21 at 14:04
  • Try a 3rd test, with `d=np.random.random()` , `%timeit d*b` I found this to be 33% faster than `%timeit a[0] * b` – kcw78 Apr 23 '21 at 14:06
  • 1
    `numpy` has provided `*` element wise multiplication forever, and it handles a mix of dimensions efficienty using `broadcasting`. `@` was added to provide batched `dot`, that is `sum-of-products`. Where possible it passes the task to fast BLAS (or related) code. It works here because the `summed` dimension is size 1. Relative speeds may vary with your `numpy` environment - which BLAS or related libraries you have installed. – hpaulj Apr 23 '21 at 15:12

3 Answers3

6
large_dimension = 1000

a = np.random.random((1,))
B = np.random.random((1, large_dimension))

%timeit np.matmul(a, B)
5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit a[0] * B
5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Use just float

%timeit float(a[0]) * B
3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

To avoid memory allocation use "buffer"

buffer = np.empty_like(B)

%timeit np.multiply(float(a[0]), B, buffer)
2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

To avoid unnecessary getting attribute use "alias"

mul = np.multiply

%timeit mul(float(a[0]), B, buffer)
2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And I don't recommend using numpy scalars at all, because if you avoid it, computation will be faster

a_float = float(a[0])

%timeit mul(a_float, B, buffer)
1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)

rng = range(1000)

%%timeit
for i in rng:
    pass
24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
for i in rng:
    mul(a_float, B, buffer)
1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So,

"best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)

"speedup" = 5.43 / 1.89 = 2.87

4

In this case, it is probably faster to work with an element-wise multiplication but the time you see is mostly the overhead of Numpy (calling C functions from the CPython interpreter, wrapping/unwraping types, making checks, doing the operation, array allocations, etc.).

since this operation is repeated millions of times

This is the problem. Indeed, the CPython interpreter is very bad at doing things with a low latency. This is especially true when you work on Numpy types as calling a C code and performing checks for trivial operation is much slower than doing it in pure Python which is also much slower than compiled native C/C++ codes. If you really need this, and you cannot vectorize your code using Numpy (because you have a loop iterating over timesteps), then you move away from using CPython, or at least not a pure Python code. Instead, you can use Numba or Cython to mitigate the impact doing C calls, wrapping types, etc. If this is not enough, then you will need to write a native C/C++ code (or any similar language) unless you find exactly a dedicated Python package doing exactly that for you. Note that Numba is fast only when it works on native types or Numpy arrays (containing native types). If you works with a lot of pure Python types and you do not want to rewrite your code, then you can try the PyPy JIT.


Here is a simple example in Numba avoiding the (costly) creation/allocation of a new array (as well as many Numpy internal checks and calls) that is specifically written to solve your specific case:

@nb.njit('void(float64[::1],float64[:,::1],float64[:,::1])')
def fastMul(a, b, out):
    val = a[0]
    for i in range(b.shape[1]):
        out[0,i] = b[0,i] * val

res = np.empty(b.shape, dtype=b.dtype)
%timeit fastMul(a, b, res)
# 397 ns ± 0.587 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

At the time of writing, this solution is faster than all the others. As most of the time is spent in calling Numba and performing some internal checks, using Numba directly for the function containing the iteration loop should result in an even faster code.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thank you. Can you elaborate on editing the question title to "small" matrix? What do you consider a large matrix? – Nur L Apr 26 '21 at 20:22
  • 1
    @NurL There is no threshold, but I consider a 1x1 matrix as small. The same for a vector that can fitting in a CPU cache. A matrix of size 1000x1000 start to be quite big to me. However, this is a bit subjective. I just added that to highlight that the matrix is small enough so that the overhead of managing matrix objets was much higher than their actual computation. If you disagree with that, feel free to change the title ;) . – Jérôme Richard Apr 26 '21 at 21:10
  • 2
    Very nice contribution, it looks like you indeed passed the data types to get a bit more speed. However, assuming the array creation `res` is included in the timing, it is actually slower. I get 1.9 us for the numpy only, 0.73 us for the numba code I posted, 0.56 um for yours, and 1.17 us on yours when including the array creation. – Mike Apr 27 '21 at 01:23
  • @Mike Thanks. I am not so surprised by the results since almost everything matters at this granularity including especially the number of parameters. The new parameter need to be checked by Numba (from a CPython interpreter object type). Normally, you should not have to create a new matrix in a loop every time. Even when the size of your matrix change, you can "cheat" with Numpy slices and a big allocated matrix as long as the shape is bounded at runtime by a (small) value. Keep in mind that many Numba/Numpy type checks will be mostly removed if Numba is used to compile the caller function. – Jérôme Richard Apr 27 '21 at 07:17
2
import numpy as np
import numba

def matmult_numpy(matrix, c):
    return np.matmul(c, matrix)

@numba.jit(nopython=True)
def matmult_numba(matrix, c):
    return c*matrix

if __name__ == "__main__":
    large_dimension = 1000
    a = np.random.random((1, large_dimension))
    c = np.random.random((1,))

About a factor of 3 speedup using Numba. Numba cognoscenti may be able to do better by explicitly casting the parameter "c" as a scalar

Check: The result of

%timeit matmult_numpy(a, c) 2.32 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit matmult_numba(a, c) 763 ns ± 6.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Mike
  • 1,727
  • 3
  • 15
  • 25