1

I am trying to use CuPy to accelerate python functions that are currently mostly using NumPy. I have installed CuPy on the Jetson AGX Xavier with CUDA 10.0 installed.

The CuPy functions seem to be working fine, however, they are a lot slower than their NumPy counterparts. For example, I ran the first example from here with devastating results:

import numpy as np
import cupy as cp
import time

### Numpy and CPU
s = time.time()
x_cpu = np.ones((1000,1000,1000))
e = time.time()
print(e - s) # output: 0.9008722305297852

### CuPy and GPU
s = time.time()
x_gpu = cp.ones((1000,1000,1000))
cp.cuda.Stream.null.synchronize()
e = time.time()
print(e - s) # output: 4.973184823989868

I also ran other functions (e.g. np./cp.nonzero), but they gave similar or worse results. How is this possible?

I want to do image processing (ca. size 2500x2000 greyscale/mono images) for a lane detection algorithm and cannot really use the cuda functions from OpenCV for this, since the only part in my code that is implemented in their library is cv2.cuda.warpPerspective() (and it would likely not make too much sense to upload/download the image to the GPU only for this). Where do I go from this? Use numba? (-> probably not a good fit, since (the compute-intensive parts of) my algorithm mostly consists of numpy function calls) Implement the whole thing in C++? (-> I doubt my C++ code would be faster than the optimized NumPy functions)

Sidenote: CuPy was installed using pip3 install cupy because the recommended pip3 install cupy-cuda100 failed with the output:

ERROR: Could not find a version that satisfies the requirement cupy-cuda100
ERROR: No matching distribution found for cupy-cuda100
talonmies
  • 70,661
  • 34
  • 192
  • 269

1 Answers1

0

First : No official cupy for ARM

Your error comes from the fact there is no binary distribution for cupy for arm 64 in official pip repository.

Wheels (precompiled binary packages) are available for Linux (x86_64) and Windows (amd64).

For Nvidia L4T / Jetpack, you can find an official NVIDIA docker image including Cupy that runs on Xavier there : https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml. That works for me and did increase my performances.

If you have a solution to run Cupy on Xavier using effectively CUDA without running that docker image. I'm interested in. I didn't try but if they managed to compile Cupy in the docker image, it's obvious it is also possible in a fresh native OS. Here some success on Nano : https://forums.developer.nvidia.com/t/cupy-installation-on-the-nano/189099

Try an install from sources : https://docs.cupy.dev/en/stable/install.html

Second : No computation in your test

Is your test really revelant for measuring acceleration ? I mean, the first call to CUDA functions is expected to be slow because of kind of Just In Time kernel compilation or whatever and you only test some allocation stuff and minor parallel computation.

Update 10/03/21

Here the instructions they run in docker to construct image :

#
# CuPy
#
ARG CUPY_VERSION=v9.2.0
ARG CUPY_NVCC_GENERATE_CODE="arch=compute_53,code=sm_53;arch=compute_62,code=sm_62;arch=compute_72,code=sm_72"

RUN git clone -b ${CUPY_VERSION} --recursive https://github.com/cupy/cupy cupy && \
    cd cupy && \
    pip3 install --no-cache-dir fastrlock && \
    python3 setup.py install --verbose && \
    cd ../ && \
    rm -rf cupy

You can easily execute that in native environment. Maybe adapt the CUPY_NVCC_GENERATE_CODE variable for better performances depending on your board. I didn't set anything specific when I tried.

Just tried on my Nano board with from source install (git clone cupy then pip3 install --no-cache-dir -vvvv . (no sudo) then reboot board). Treating images with several sum, substract, divide, multiply, clip, ... operations on VGA images. Don't forget to use sudo jetson_clocks command to set fixed high frequency operations on your board. Numpy cpu time = 125ms / img vs Cupy time = 13ms /img after some rework on the code using NVIDIA profiler.

Use nvprof -o file.out python3 mycupyscript.py with with cp.cuda.profile(): instruction in to understand better bottlenecks. Use nvvp to load file.out and explore graphically the performances. It will let you upgrade your computation approach to fit well with GPU.

Patemino
  • 1
  • 2