Is there any way, the speed of the following numpy code can be increased, may be by parallelizing?

Question

I am writing an application which requires very low latency. The application will be running on intel Xenon processor enabled with mkl-dnn instructions/AVX instructions set. The following code is taking 22 milliseconds when executed on intel 9750H processor.

def func(A,B):
    result = 0
    for ind in range(len(B)):
        index = (A[:,0] <= B[ind,0]) & (A[:,1] <= B[ind,1]) & (A[:,2] <= B[ind,2])
        result += ((A[index,3].sum()) * B[ind,3])
        A = A[~index]
    return result

%timeit func(A,B)
21.5 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is there a way to improve the code so that execution time decreases. Anything lesser than 5 milliseconds would be great. By the way, the matrix A has a shape of (80000 x 4) and matrix B has a shape of (32 x 4). Both of them are sorted on first three columns. Can we parallelize any component, the application can use 16 cores.

Question has actually nothing to do with `deep-learning` - kindly do not spam irrelevant tags (removed). — desertnaut, Dec 23 '20 at 23:16

Valdi_Bo · Accepted Answer · 2020-12-23T18:01:39.263

1

Instead of your function use:

def func2(A, B):
    x = np.zeros(A.shape[0], dtype=int)
    for bInd in range(len(B)):
        x[np.where(x, False, np.all(A[:, 0:3] <= B[bInd, 0:3], axis=1))] = B[bInd, 3]
    return (A[:, 3] * x).sum()

The speed gain is smaller than what you expect. Using A of shape (10, 4) and B of shape (4, 4), I got execution time by 15 % shorter than for your function.

But maybe on bigger source arrays the speed gain will be more apparent. Try on your own.

edited Dec 23 '20 at 18:01

answered Dec 23 '20 at 17:55

Valdi_Bo

30,023
4
23
41

The speed gains are much higher on my dataset. It is taking 7ms on average. 3 times faster. – Shanthan K Dec 24 '20 at 04:29

Is there any way, the speed of the following numpy code can be increased, may be by parallelizing?

1 Answers1