Bad performance of numpy slicing in function

Question

I have example code like

import numpy as np

nbins = 1301
init = np.ones(nbins*2+1)*(-1)
init[0] = 0
init[1::2] = 0
z0 = np.array(init, dtype=np.complex128, ndmin=1)
initn = z0.view(np.float64)
deltasf = np.linspace(-10, 10, nbins)
gsf = np.linspace(1, 2, nbins)

def jacobian_mbes_test(Y, t):
     leny = len(Y)
     J = np.zeros((leny, leny))
     J[0,0] = -10
     J[0,1] = 0
     J[0, 2::4] = gsf
     J[1,0] = 0
     J[1,1] = -10
     J[1,3::4] = gsf
     J[2::4, 0] = gsf*Y[4::4]
     J[3::4, 0] = gsf*Y[5::4]
     J[4::4, 0] = -4*gsf*Y[2::4]
     J[2::4, 1] = -gsf*Y[5::4]
     J[3::4, 1] = gsf*Y[4::4]
     J[4::4, 1] = -4*gsf*Y[3::4]
     J[(range(2, leny, 4), range(2, leny, 4))] = -0.8
     J[(range(3, leny, 4), range(3, leny, 4))] = -0.8
     J[(range(4, leny, 4), range(4, leny, 4))] = -0.001
     J[(range(5, leny, 4), range(5, leny, 4))] = -0.001
     J[(range(2, leny, 4), range(3, leny, 4))] = deltas
     J[(range(3, leny, 4), range(2, leny, 4))] = -deltas
     J[(range(2, leny, 4), range(4, leny, 4))] = gsf*Y[0]
     J[(range(2, leny, 4), range(5, leny, 4))] = -gsf*Y[1]
     J[(range(3, leny, 4), range(4, leny, 4))] = gsf*Y[1]
     J[(range(3, leny, 4), range(5, leny, 4))] = gsf*Y[0]
     J[(range(4, leny, 4), range(2, leny, 4))] = -4*gsf*Y[0]
     J[(range(4, leny, 4), range(3, leny, 4))] = -4*gsf*Y[1]
     return J

which computes a Jacobian for a set of differential equations using scipy.integrate.odeint

Unfortunately the performance of the function jacobian_mbes_test is not very good. %timeit jacobian_mbes_test(initn, 0) gives 12.2ms even though I tried to do everything efficiently with slicing(?).

Changing the function to (since I guess the more complicated index assingment takes most time)

def jacobian_mbes_test2(Y, t):
     leny = len(Y)
     J = np.zeros((leny, leny))
     J[0,0] = -10
     J[0,1] = 0
     J[0, 2::4] = gsf
     J[1,0] = 0
     J[1,1] = -10
     J[1,3::4] = gsf
     J[2::4, 0] = gsf*Y[4::4]
     J[3::4, 0] = gsf*Y[5::4]
     J[4::4, 0] = -4*gsf*Y[2::4]
     J[2::4, 1] = -gsf*Y[5::4]
     J[3::4, 1] = gsf*Y[4::4]
     J[4::4, 1] = -4*gsf*Y[3::4]
     return J

reduces this time to 4.6ms, which is still not great. The funny thing is that if I time these assignments outside of a function like

J = np.zeros((len(initn), len(initn))
%timeit J[4::4, 1] = -4*gsf*initn[3::4]

I get at most 10µs, so I would expect at most (!) sth like 100µs for the second function. Is there some copying around in memory going on in this function I am not aware of?

Is there some more efficient way to assign values to certain indices?

Interpreted languages are usually slow because they "run" every line (read, analyze, and execute) at runtime. Having more lines, extra variables, and function stack adds up to it. — Jishan Shaikh, Jul 09 '21 at 11:53

score 1 · Accepted Answer · answered Jul 09 '21 at 19:56

The problem comes from the matrix size (206 MiB). Indeed, the matrix is pretty big and (virtually) filled with zeros every time the function is called. Assuming the all values would be physically written to memory in 12.2 ms, the throughput would be 5206**2*8/12.2e-3/1024**3 = 16.5 GiB/s which is pretty good for a sequential workload (especially on an average personal computer which often barely reach more than 25 GiB/s in sequential). In practice, the memory is not set to zero and most of the computation time comes from the management of virtual memory.

np.zeros((leny, leny)) internally request your operating systems (OS) to reserve some space and initialize it with zero. Your OS will virtually allocate many pages (chunks of typically 4 KiB) and mark them as to be set to zero later (delayed initialization). As a result, this process is very cheap. However, when J[0,0] = -10 is executed later, it write data in given first-touched page that need to be initialized. This is called a page fault. This process is expensive (especially in sequential) as it stop the execution of your process, find some place in physical memory, fill a whole page, and then resume your process, for each page! In your case, many array cells to be initialized cause pages to be filled which is much more costly than just filling actually the cells.

CPU caches also matter a lot in this code as page filling will be much slower when the computation is done in RAM. This happens when the amount of data is to big to fit in the last level-cache (of typically 1-64 MiB).

The memory access pattern strongly impacts performance on quite big data structures (ie. not fitting in CPU caches). Indeed, a random access pattern or strided access with huge stride are really bad for performance since the processor cannot easily predict the location of the RAM chunks to load ahead of time (not to mention the RAM latency is quite big). Contiguous accesses are much faster.

All of this should be taken into account when you measure the time on simpler cases.

One simple way to see the impact of virtual memory is to use np.full((leny, leny), 0.0) instead of np.zeros(leny, leny). This is 3 times slower on my machine. All the pages needs to be filled in that case. You can also create measure the time to fille multiple time the J matrix. The first time should be significantly slower (2.2 times slower on my machine).

One solution is to allocate and initialize the array once and fill/recycle only some values but this is a bit tricky to do. In the worst case, most of the matrix need to be zeroised which will be too slow in your case since it will be bound by your memory throughput.

One simple solution is to use a space matrix. Space matrices can be created using the module scipy.sparse. They are often a slower to compute, but faster and much more compact if there are a lot of zeros (which is your case). Here is an example:

from scipy.sparse.lil import lil_matrix

def jacobian_mbes_test(Y, t):
     leny = len(Y)
     J = lil_matrix((leny, leny))
     J[0,0] = -10
     J[0,1] = 0
     J[0, 2::4] = gsf
     J[1,0] = 0
     J[1,1] = -10
     J[1,3::4] = gsf
     J[2::4, 0] = gsf*Y[4::4]
     J[3::4, 0] = gsf*Y[5::4]
     J[4::4, 0] = -4*gsf*Y[2::4]
     J[2::4, 1] = -gsf*Y[5::4]
     J[3::4, 1] = gsf*Y[4::4]
     J[4::4, 1] = -4*gsf*Y[3::4]
     J[(range(2, leny, 4), range(2, leny, 4))] = -0.8
     J[(range(3, leny, 4), range(3, leny, 4))] = -0.8
     J[(range(4, leny, 4), range(4, leny, 4))] = -0.001
     J[(range(5, leny, 4), range(5, leny, 4))] = -0.001
     J[(range(2, leny, 4), range(3, leny, 4))] = deltas
     J[(range(3, leny, 4), range(2, leny, 4))] = -deltas
     J[(range(2, leny, 4), range(4, leny, 4))] = gsf*Y[0]
     J[(range(2, leny, 4), range(5, leny, 4))] = -gsf*Y[1]
     J[(range(3, leny, 4), range(4, leny, 4))] = gsf*Y[1]
     J[(range(3, leny, 4), range(5, leny, 4))] = gsf*Y[0]
     J[(range(4, leny, 4), range(2, leny, 4))] = -4*gsf*Y[0]
     J[(range(4, leny, 4), range(3, leny, 4))] = -4*gsf*Y[1]
     return J

This is 2 times faster on my machine. There are plenty of different types of sparse matrices, each adapted to some specific use-cases. There is probably a better type of sparse matrix than LIL ones.

LIL spaces matrices are quite good for space matrices with a dynamic structure. CSR are quite good for static ones. Thus, you can for example build a CSR matrix ahead of time filled with NaN values at the location where you plan to write some data and then copy it and fill the new space matrix with the new values when needed. This is 10-15% faster on my machine. It is probably possible to reduce the time even more using a better access pattern.

Bad performance of numpy slicing in function

1 Answers1