3

I have been testing the following block for numba speed up:

import numpy as np
import timeit
from numba import njit
import numba 

@numba.guvectorize(["void(float64[:],float64[:],float64[:],float64, float64, float64[:])"],
             "(m),(m),(m),(),()->(m)",nopython=True,target="parallel")
def func_diff_calc_numba_v2(X,refY,Y,lower,upper,arr):
    fac=1000
    for i in range(len(X)):
        if X[i] >=lower and X[i] <upper:
           diff=Y[i]-refY[i]
           arr[i] = diff**2*fac
        else:
            arr[i] = 0

@numba.vectorize('(float64, float64, float64, float64, float64)',nopython=True,target="parallel")
def func_diff_calc_numba_v3(X,refY,Y,lower,upper):
    fac=1000
    if X >= lower and X < upper:
       return (Y-refY)**2*fac
    else:
       return 0.0
@njit
def func_diff_calc_numba(X,refY,Y,lower,upper):
    fac=1000
    arr=np.zeros(len(X))
    for i in range(len(X)):
        if X[i] >=lower and X[i] <upper:
           arr[i]=(Y[i]-refY[i])**2*fac
        else:
            arr[i] = 0
    return arr

np.random.seed(69)
X=np.arange(10000)
refY = np.random.rand(10000)
Y = np.random.rand(10000)

lower=1
upper=10000

print("func_diff_calc_numba: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v2: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v2(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v3: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v3(X,refY,Y,lower,upper)", number=10000, globals=globals())))

The speedups for the v2 and v3 are significantly different:

func_diff_calc_numba: 0.58257
func_diff_calc_numba_v2: 0.49573
func_diff_calc_numba_v3: 1.07519

and if I change the number of iterations from 10,000 to 100,000 then:

func_diff_calc_numba: 1.67251
func_diff_calc_numba_v2: 4.85828
func_diff_calc_numba_v3: 11.63361

I was expecting vectorize and guvectorize to be almost similar in speedup but while njit and guvectorize are almost equal to each other in time, vectorize is ~2 and ~10 times slower than guvectorize and njit respectively. Is there is something wrong in my implementation or something else?

mykd
  • 191
  • 1
  • 7

1 Answers1

1

The task (function + inputs) is probably too small/simple to be effectively parallelized, causing the overhead of doing so to increase total runtime. If you compile both to the default cpu target the difference disappears I assume?

Because your input is 1D, with the given ufunc signature, the guvectorize doesn't parallelize anything, because there's only one task.

A like-for-like parallel comparison can be done by setting the signature to "(),(),(),(),()->()" basically telling it to also (like vectorize) apply the function element-wise. And those results should be very close again. But then you'll see that the overhead of parallelization makes it worse for both in this case.

For me timings are:

  1. Using target="parallel" for both, and "(m),(m),(m),(),()->(m)":

    numba_guvec    : 0.26364
    numba_vec      : 3.26960
    
  2. Using target="cpu" for both, and "(m),(m),(m),(),()->(m)":

    numba_guvec    : 0.21886
    numba_vec      : 0.26198
    
  3. Using target="parallel" for both, and "(),(),(),(),()->()":

    numba_guvec    : 3.05748
    numba_vec      : 3.15587
    

You'll probably find similar behavior if you would also compare @njit(parallel=True) with the numba.prange.

At the end, there's just some extra work involved for parallelizing something, and that's only worth it for a sufficiently large (slow) task.

Rutger Kassies
  • 61,630
  • 17
  • 112
  • 97