Numpy array multiply behavior is different from pure-Python to Cython

Question

In pure-Python code:

Case A:

retimg = np.zeros((dstH, dstW, 3), dtype=np.uint8)
A = img[x % (scrH - 1), y % (scrW - 1)]
B = img[x % (scrH - 1), y1 % (scrW - 1)]
C = img[x1 % (scrH - 1), y % (scrW - 1)]
D = img[x1 % (scrH - 1), y1 % (scrW - 1)]
retimg[i, j] = A * (1 - mu) * (1 - nu) + B * mu * (1 - nu) + C * (1 - mu) * nu + D * mu * nu

Case B:

retimg = np.zeros((dstH, dstW, 3), dtype=np.uint8)
A = img[x % (scrH - 1), y % (scrW - 1)]
B = img[x % (scrH - 1), y1 % (scrW - 1)]
C = img[x1 % (scrH - 1), y % (scrW - 1)]
D = img[x1 % (scrH - 1), y1 % (scrW - 1)]
(r, g, b) = (
          A[0] * (1 - mu) * (1 - nu) + B[0] * mu * (1 - nu) + C[0] * (1 - mu) * nu + D[0] * mu * nu,
          A[1] * (1 - mu) * (1 - nu) + B[1] * mu * (1 - nu) + C[1] * (1 - mu) * nu + D[1] * mu * nu,
          A[2] * (1 - mu) * (1 - nu) + B[2] * mu * (1 - nu) + C[2] * (1 - mu) * nu + D[2] * mu * nu)
retimg[i, j] = (r, g, b)

Case A is much faster than Case B

Then I use Cython to speed up the execution.

Case C:

cdef np.ndarray[DTYPEU8_t, ndim=3] dst = np.zeros((dstH, dstW, 3), dtype=np.uint8)
cdef np.ndarray[DTYPEU8_t, ndim=1] A,B,C,D
A = img[x % (scrH - 1), y % (scrW - 1)]
B = img[x % (scrH - 1), y1 % (scrW - 1)]
C = img[x1 % (scrH - 1), y % (scrW - 1)]
D = img[x1 % (scrH - 1), y1 % (scrW - 1)]
retimg[i, j] = A * (1 - mu) * (1 - nu) + B * mu * (1 - nu) + C * (1 - mu) * nu + D * mu * nu

Case D:

cdef np.ndarray[DTYPEU8_t, ndim=3] dst = np.zeros((dstH, dstW, 3), dtype=np.uint8)
cdef float r,g,b
cdef np.ndarray[DTYPEU8_t, ndim=1] A,B,C,D
A = img[x % (scrH - 1), y % (scrW - 1)]
B = img[x % (scrH - 1), y1 % (scrW - 1)]
C = img[x1 % (scrH - 1), y % (scrW - 1)]
D = img[x1 % (scrH - 1), y1 % (scrW - 1)]
(r, g, b) = (
                A[0] * (1 - mu) * (1 - nu) + B[0] * mu * (1 - nu) + C[0] * (1 - mu) * nu + D[0] * mu * nu,
                A[1] * (1 - mu) * (1 - nu) + B[1] * mu * (1 - nu) + C[1] * (1 - mu) * nu + D[1] * mu * nu,
                A[2] * (1 - mu) * (1 - nu) + B[2] * mu * (1 - nu) + C[2] * (1 - mu) * nu + D[2] * mu * nu)

retimg[i, j] = (r, g, b)

Case C is much slower than Case D

Why Numpy multiplying arrays behaves differently from Python to Cython? Theoretically Case C should faster than Case D.

"Theoretically Case C should faster than Case D" - why? (There's a few repeated terms that you could probably factor out of Case D of course, but apart from those...) — DavidW, May 24 '21 at 09:17

Jérôme Richard · Accepted Answer · 2021-05-24T12:58:13.007

The reason Case C is slower than Case D here is due to the type of temporary variables. Indeed, in Case C, many temporary arrays are implicitly created and deleted. This results in a lot of memory allocations. Memory allocation is something quite fast relative to the CPython interpreter. However, when the code is optimized using Cython, allocations are prohibitively slow since they are much slower than flew multiplications. Moreover, with Cython, scalar expressions can be optimized so they use processor registers while array-based expressions are usually not optimized and use the slow memory hierarchy (since this is very hard to do). Not to mention Numpy calls may add an additional significant overhead.

On my machine, the cost of 1 allocation/deallocation takes more time than computing the full expression.

One solution to avoid allocations is to specify to Numpy the destination of the arrays and avoid temporary array-based operations as much as possible. Here is an untested example:

# tmp is a predefined temporary array and res the resulting array
np.multiply(A, (1 - mu) * (1 - nu), out=res)
np.multiply(B, mu * (1 - nu), out=tmp)
np.add(tmp, res, out=res)
np.multiply(C, (1 - mu) * nu, out=tmp)
np.add(tmp, res, out=res)
np.multiply(D, mu * nu, out=tmp)
np.add(tmp, res, out=res)

Note that the above solution does not solve the issues (related to the use of register and the overhead of Numpy) while Case D should fix them.

thanks Richard.I test your code in my machine. it's faster than `Case C` and slower than `Case D`. It really seems that many temporary arrays are implicitly created and deleted — bin381, May 24 '21 at 14:08

score 0 · Answer 2 · answered May 24 '21 at 17:04

The only thing that typing as np.ndarray achieves in Cython is to make indexing individual elements quicker. Array slicing, whole array operations (such as *, +) and other Numpy function calls are not accelerated.

For case D, A[0], B[0], C[0], A[1] etc are indexed efficiently and directly multiplied with C floats thus this calculation is very quick. In contrast in case C you have a bunch of array multiplications which proceed as a normal Python function call. Since the arrays are small (3 elements long) the cost of the Python function call is significant.

retimg[i, j] = (r, g, b) is probably better written as:

retimg[i,j,0] = r
retimg[i,j,1] = g
retimg[i,j,2] = b

to take advantage of the indexing (i.e. what Cython does well). It's possible that Cython optimizes it towards that naturally though (but probably not quite that far).

In summary: typing things as np.ndarray is pointless unless you're doing single-element indexing. It'll actually waste time doing unnecessary type checks if you aren't.

Numpy array multiply behavior is different from pure-Python to Cython

2 Answers2

Linked