GPU-accelerate neural network calculations

Question

I'm trying to accomplish Nvidia's "Fundamentals of Accelerated Computing with CUDA Python" course and have got a task to refactor a simple version of some code that performs work needed to create a hidden layer in a neural network:

import numpy as np
from numba import cuda, vectorize

n = 1000000

greyscales = np.floor(np.random.uniform(0, 255, n).astype(np.float32))
weights = np.random.normal(.5, .1, n).astype(np.float32)

from numpy import exp

def normalize(grayscales):
    return grayscales / 255

def weigh(values, weights):
    return values * weights
    
def activate(values):
    return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )

def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
    normalized = normalize(greyscales)
    weighted = weigh(normalized, weights)
    activated = activate(weighted)
    return activated

arguments = {"n":n,
            "greyscales": greyscales,
            "weights": weights,
            "exp": exp,
            "normalize": normalize,
            "weigh": weigh,
            "activate": activate}

a = create_hidden_layer(**arguments)
print(a)

I have transformed the code a little bit and after modifications, it looks like this:

from math import exp

@vectorize(['float32(float32)'],target='cuda')
def normalize(grayscales):
    return grayscales / 255

@vectorize(['float32(float32,float32)'],target='cuda')
def weigh(values, weights):
    return values * weights

@vectorize(['float32(float32)'],target='cuda')
def activate(values):
    return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )

def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
    normalized = normalize(greyscales)
    weighted = weigh(normalized, weights)
    activated = activate(weighted)
    return activated

greyscales = cuda.to_device(greyscales)
weights = cuda.to_device(weights)

normalized = cuda.device_array(shape=(n,), dtype=np.float32)
weighted = cuda.device_array(shape=(n,), dtype=np.float32)
activated = cuda.device_array(shape=(n,), dtype=np.float32)

activated = activated.copy_to_host()

arguments = {"n":n,
            "greyscales": greyscales,
            "weights": weights,
            "exp": exp,
            "normalize": normalize,
            "weigh": weigh,
            "activate": activate}

a = create_hidden_layer(**arguments)
print(a)

The code seems to work fine after all the transformations, but there is one but... It's not fast enough. In the task, it is stated that the code should run in less than 1s, while my code runs in 1.23s...

Maybe someone knows how I could refactor my code more? Or maybe notices any silly mistakes I have made in my code? Would be very grateful for any help!

I can't compare your `cuda` implementation using `float32` because of this [`numba 0.56` bug](https://github.com/numba/numba/issues/8400). In my benchmarks with `float64`, `@cuda.jit` kernels are ~20x faster than `@vectorize` (722 **µs** vs 15.8 **ms**, `numpy` 82.1 **ms**, excluding `copy_to_host`). I don't know how you got *1.23 s* runtime for the provided examples. `float32` are 5x faster with `cuda.jit`, the results might also vary for `@vectorize`. — Michael Szczesny, Sep 19 '22 at 22:07
How did you measure the results ? Imparts takes some time as well as the CUDA runtime initialization made during the first CUDA call, not to mention the compilation time of the first call. Besides I get ~13 ms for the first code. It can be easily optimized so to reach 9 ms and 5 ms with Numba in parallel on my CPU. The GPU code should be bound by overheads (and otherwise certainly the GPU memory) but it is faster. — Jérôme Richard, Sep 19 '22 at 22:56
@MichaelSzczesny I have tried to use @cuda.jit, @jit or @cuda.jit(no device = True) instead of @vectorize, but I kept getting errors that I didn't know how to eliminate (I'm completely new with this). And those 1.23 seconds, as a result, I have got after running these lines: `from assessment import assess`, `assess(create_hidden_layer, arguments)` (while %%timeit have shown completely different results). — kndrtt, Sep 20 '22 at 17:59
@JérômeRichard While writing code I was using `%%timeit` to check the performance. But to complete the whole assessment (what is mandatory to complete the course) I needed to run `from assessment import assess`, `assess(create_hidden_layer, arguments)` from which I got completely different results compared to what have shown `%%timeit`... — kndrtt, Sep 20 '22 at 18:06
This is a [colab notebook](https://colab.research.google.com/drive/1NKF4uew_nLv2qI8_TYUA9cVEMARYUcoh?usp=sharing) with the cuda kernels I used for the benchmark. Please choose the GPU runtime to run the cells. — Michael Szczesny, Sep 20 '22 at 18:10
@MichaelSzczesny Thank you for such effort and patience to help me! I have run the code in your sent colab notebook and have tried to do the same in the jupyter notebook (where the assignment is held). Firstly, noticed that the runtime of the same code differs (136 µs Vs. 306 µs, in jupyter it's slowlier approximately x2 times). And secondly, even if the sent code works fine, I start getting errors after trying to assess it the way it's required... — kndrtt, Sep 20 '22 at 19:30
@MichaelSzczesny At the moment the biggest headache is that I don't know how to include or reflect these changes in the following lines of code (): `arguments = {"n":n, "greyscales": greyscales, "weights": weights, "exp": exp, "normalize": normalize, "weigh": weigh, "activate": activate} a = create_hidden_layer(**arguments) from assessment import assess assess(create_hidden_layer, arguments)` — kndrtt, Sep 20 '22 at 19:39
@kndrtt `a = exp(values); b= exp(-values); return (a - b)/(a + b)` in the function `activate`. — Guimoute, Sep 22 '22 at 10:03

score 1 · Accepted Answer · answered Jan 29 '23 at 14:56

greyscales = cuda.to_device(greyscales)
weights = cuda.to_device(weights)

normalized = cuda.device_array(shape=(n,), dtype=np.float32)
weighted = cuda.device_array(shape=(n,), dtype=np.float32)
activated = cuda.device_array(shape=(n,), dtype=np.float32)

activated = activated.copy_to_host()

Move this section inside the "create_hidden_layer" function. I did that and it ran in ~0.5 secs.

score 0 · Answer 2 · answered Sep 19 '22 at 23:36

Here are some things that you could try to speed up your code:

Use @cuda.jit to compile your kernel.
In your kernel, use cuda.grid(2) to get the 2D thread index and use cuda.blockDim.x to get the number of threads in a block. Use those to calculate the 1D index of your array and store it in a shared memory array.
In your kernel, once all threads have reached the shared memory array, use cuda.synchronize() to wait for all threads to reach that point in the kernel. Then, use the shared memory array to access the data from global memory.
Use cuda.shared.array() and cuda.shared.to_device() to create and copy the shared memory array to the GPU.
Once your kernel is done, use cuda.synchronize() to wait for all threads to reach the end of the kernel. Then, use cuda.from_device() to copy the data back to the CPU.
You can also use cuda.to_device() and cuda.from_device() to copy data between the CPU and GPU, if you want to.
It is also possible to use cuda.device_array_like() to create an array on the GPU that is similar to an array on the CPU.

GPU-accelerate neural network calculations

2 Answers2