I have defined two functions as a minimal working example.
In [2]: A = np.random.random(10_000_000)
In [3]: def f():
...: return A.copy() * np.pi
...:
In [4]: def g():
...: B = A.copy()
...: B *= np.pi
...: return B
Both of them return the same result:
In [5]: assert all(f() == g())
but I would expect g()
to be faster, as augmented assignment is (for A
) more than 4 times as fast as multiplication:
In [7]: %timeit B = A.copy(); B * np.pi
82.2 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit B = A.copy(); B *= np.pi
55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit B = A.copy()
46.3 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sadly, there is no speedup:
In [10]: %timeit f()
54.5 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: %timeit g()
54.6 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Of course dis.dis(g)
shows some overhead when compared to dis.dis(f)
(2 * STORE_FAST + 2 * LOAD_FAST):
In [26]: dis.dis(f)
2 0 LOAD_GLOBAL 0 (A)
2 LOAD_METHOD 1 (copy)
4 CALL_METHOD 0
6 LOAD_GLOBAL 2 (np)
8 LOAD_ATTR 3 (pi)
10 BINARY_MULTIPLY
12 RETURN_VALUE
In [27]: dis.dis(g)
2 0 LOAD_GLOBAL 0 (A)
2 LOAD_METHOD 1 (copy)
4 CALL_METHOD 0
6 STORE_FAST 0 (B)
3 8 LOAD_FAST 0 (B)
10 LOAD_GLOBAL 2 (np)
12 LOAD_ATTR 3 (pi)
14 INPLACE_MULTIPLY
16 STORE_FAST 0 (B)
4 18 LOAD_FAST 0 (B)
20 RETURN_VALUE
but for A = np.random.random(1)
the overhead (difference in execution time) is less than 2 µs.
To make things even more confusing I defined a third function h()
which behaves as expected (is slower than f()
):
In [19]: def h():
...: B = A.copy()
...: return B * np.pi
...:
In [20]: %timeit h()
81.9 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
but dis.dis(h)
gives me no insight why:
In [28]: dis.dis(h)
2 0 LOAD_GLOBAL 0 (A)
2 LOAD_METHOD 1 (copy)
4 CALL_METHOD 0
6 STORE_FAST 0 (B)
3 8 LOAD_FAST 0 (B)
10 LOAD_GLOBAL 2 (np)
12 LOAD_ATTR 3 (pi)
14 BINARY_MULTIPLY
16 RETURN_VALUE
Why there is no speed benefit of in-place multiplication when returning a numpy array, or maybe why f()
gets the speed benefit despite of binary multiplication?
I use Python 3.7.12 and numpy 1.21.6.