Why there is no speed benefit of in-place multiplication when returning a numpy array?

Question

I have defined two functions as a minimal working example.

In [2]: A = np.random.random(10_000_000)

In [3]: def f():
   ...:     return A.copy() * np.pi
   ...: 

In [4]: def g():
   ...:     B = A.copy()
   ...:     B *= np.pi
   ...:     return B

Both of them return the same result:

In [5]: assert all(f() == g())

but I would expect g() to be faster, as augmented assignment is (for A) more than 4 times as fast as multiplication:

In [7]: %timeit B = A.copy(); B * np.pi
82.2 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit B = A.copy(); B *= np.pi
55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit B = A.copy()
46.3 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sadly, there is no speedup:

In [10]: %timeit f()
54.5 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit g()
54.6 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Of course dis.dis(g) shows some overhead when compared to dis.dis(f) (2 * STORE_FAST + 2 * LOAD_FAST):

In [26]: dis.dis(f)
  2           0 LOAD_GLOBAL              0 (A)
              2 LOAD_METHOD              1 (copy)
              4 CALL_METHOD              0
              6 LOAD_GLOBAL              2 (np)
              8 LOAD_ATTR                3 (pi)
             10 BINARY_MULTIPLY
             12 RETURN_VALUE

In [27]: dis.dis(g)
  2           0 LOAD_GLOBAL              0 (A)
              2 LOAD_METHOD              1 (copy)
              4 CALL_METHOD              0
              6 STORE_FAST               0 (B)

  3           8 LOAD_FAST                0 (B)
             10 LOAD_GLOBAL              2 (np)
             12 LOAD_ATTR                3 (pi)
             14 INPLACE_MULTIPLY
             16 STORE_FAST               0 (B)

  4          18 LOAD_FAST                0 (B)
             20 RETURN_VALUE

but for A = np.random.random(1) the overhead (difference in execution time) is less than 2 µs.

To make things even more confusing I defined a third function h() which behaves as expected (is slower than f()):

In [19]: def h():
    ...:     B = A.copy()
    ...:     return B * np.pi
    ...: 

In [20]: %timeit h()
81.9 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

but dis.dis(h) gives me no insight why:

In [28]: dis.dis(h)
  2           0 LOAD_GLOBAL              0 (A)
              2 LOAD_METHOD              1 (copy)
              4 CALL_METHOD              0
              6 STORE_FAST               0 (B)

  3           8 LOAD_FAST                0 (B)
             10 LOAD_GLOBAL              2 (np)
             12 LOAD_ATTR                3 (pi)
             14 BINARY_MULTIPLY
             16 RETURN_VALUE

Why there is no speed benefit of in-place multiplication when returning a numpy array, or maybe why f() gets the speed benefit despite of binary multiplication?

I use Python 3.7.12 and numpy 1.21.6.

"I use Python 2.7.12 and numpy 1.21.6." - are you sure? NumPy 1.21.6 doesn't support Python 2. — user2357112, May 08 '23 at 17:16
I suggest tag `performance` instead of `python-3.7`, unless you have proved that python version is relevant — dankal444, May 08 '23 at 17:20
@user2357112 Fixed, thanks! I should use copy-paste instead. — abukaj, May 08 '23 at 17:20
`f()` also seems like it should be slower because multiply creates a new array -- you don't need `A.copy()` — Barmar, May 08 '23 at 17:21
@Barmar That is my point. It should be slower but it is not. `B.copy()` is a stub of actual function call in my code. — abukaj, May 08 '23 at 17:24
`ufunc.at` gives insight into the buffering normally used with `*=` operations. https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html — hpaulj, May 08 '23 at 18:14

user2357112 · Accepted Answer · 2023-05-08T17:52:27.010

Your f is benefiting from a temporary elision optimization introduced in NumPy 1.13.

When NumPy can tell that one of the operands of an arithmetic operator has no other references, it may reuse that operand's memory for the result array. That's the case in your f function - the A.copy() array has no other references.

However, detecting this situation is very expensive and not always possible. Checking that the refcount is 1 is easy, but NumPy has to inspect the C-level call stack to make sure that the operation is being invoked by the bytecode evaluation loop (which will discard the operand references) instead of an extension module (which might hold onto those references).

On platforms without the backtrace function, NumPy cannot perform this optimization. Even on platforms with backtrace, the cost of the stack inspection means that NumPy only tries the optimization for arrays of size at least 256 KiB.

You can see the implementation in numpy/core/src/multiarray/temp_elide.c.

Why there is no speed benefit of in-place multiplication when returning a numpy array?

1 Answers1