3

I'm using Numba (version 0.37.0) to optimize code for GPU. I would like to use combined vectorized functions (using @vectorize decorator of Numba).

Imports & Data:

import numpy as np
from math import sqrt
from numba import vectorize, guvectorize

angles = np.random.uniform(-np.pi, np.pi, 10)
coords = np.stack([np.cos(angles), np.sin(angles)], axis=1)

This works as expected:

@guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm(vec, out):
    acc = 0.0
    for value in vec:
        acc += value**2
    out[0] = sqrt(acc)

l2_norm(coords)

Output:

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)

But I'd like to avoid using that "for" inside "l2_norm" by calling another vectorized function.

I've tried this:

@vectorize(["float32(float32)"], target="cuda")
def power(value):
    return value**2

@guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm_power(vec, out):
    acc = 0.0
    acc = power(vec)
    acc = acc.sum()
    out[0] = sqrt(acc)

l2_norm_power(coords)

But raises TypingError:

TypingError: Failed at nopython (nopython frontend)
Untyped global name 'power': cannot determine Numba type of <class  
'numba.cuda.dispatcher.CUDAUFuncDispatcher'>

Any idea about how to perform this combination?

Any suggestion about other way to optimize l2_norm with Numba?

jetxeberria
  • 187
  • 1
  • 8

1 Answers1

3

I think you can only call device=True functions from other cuda functions:

3.13.2. Example: Calling Device Functions

All CUDA ufunc kernels have the ability to call other CUDA device functions:

 from numba import vectorize, cuda
 # define a device function
 @cuda.jit('float32(float32, float32, float32)', device=True, inline=True)
 def cu_device_fn(x, y, z):
     return x ** y / z
 # define a ufunc that calls our device function
 @vectorize(['float32(float32, float32, float32)'], target='cuda')
 def cu_ufunc(x, y, z):
     return cu_device_fn(x, y, z)

Note that you can call cuda.jit functions with device:

@cuda.jit(device=True)
def sum_of_squares(arr):
    acc = 0
    for item in arr:
        acc += item ** 2
    return acc

@nb.guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm_power(vec, out):
    acc = sum_of_squares(vec)
    out[0] = sqrt(acc)

l2_norm_power(coords)

But that probably defeats the purpose of splitting it.

Since numba.vectorize doesn't support device that's not possible for these functions. But that's a good thing because vectorize allocates an array to put the values in, that's an unnecessary intermediate array and allocating arrays on the GPU is also very inefficient (and forbidden in numba):

3.5.5. Numpy support

Due to the CUDA programming model, dynamic memory allocation inside a kernel is inefficient and is often not needed. Numba disallows any memory allocating features. This disables a large number of NumPy APIs. For best performance, users should write code such that each thread is dealing with a single element at a time.


Given all that I would simply use your original approach:

@guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm(vec, out):
    acc = 0.0
    for value in vec:
        acc += value**2
    out[0] = sqrt(acc)

l2_norm(coords)
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Let's say you wanted to perform quicksort. Would you then just hard-code and copy-paste the sorting code as many times as it's needed in the top-level function? – Sterling Aug 19 '21 at 02:02
  • Or is the example you gave with `cuda.jit(device=True)` and `@guvectorize` OK in terms of efficiency? – Sterling Aug 19 '21 at 02:40
  • @Sterling I cannot really say. I haven't worked with numba in a while and am not exactly sure what you ask. Maybe better to ask a new question with your explicit details? :) – MSeifert Aug 19 '21 at 07:20