cupy performs task for 48ms vs numpy for 4ms - why and how to fix it?

Question

I try to use cupy to perform task on GPU - here is a code:

# on CPU
x_cpu = np.array([1, 2, 3])
%timeit l2_cpu = np.linalg.norm(x_cpu)

# on GPU
x_gpu = cp.array([1, 2, 3])
%timeit l2_gpu = cp.linalg.norm(x_gpu)

here is the output:

4 µs ± 18 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
48.7 µs ± 86.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ```

Question:

My question is - why in my case cupy works slowly than NumPy? I expected that the CuPy will work quicker than NumPy. What did I do wrong and maybe somebody can advise me how to fix it?

Environment:

OS: Ubuntu 20.04
Video:

 > nvidia-smi
 Wed Sep 15 22:11:36 2021       
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
 | 41%   33C    P8     1W / 260W |    184MiB / 11019MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
                                                                              
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0   N/A  N/A    627367      C   ...conda3/envs/t1/bin/python      181MiB |
 +-----------------------------------------------------------------------------+

Also, I use puthon3.8 and I have installed:

cupy 8.3.0
cupy-cuda114 9.4.0
cudatoolkit 10.1.243 h6bb024c_0
and so on.

UPDATED

I used array with 1023272 items also - here is a result:

175 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
579 µs ± 97.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also, I checked GPU utilization using nvidia-smi and I can confirm - GPU was involved in calculation

You are giving it a *tiny* task, and there is setup overhead involved in communicating with the GPU. — Karl Knechtel, Sep 16 '21 at 05:48
that was my first thought - therefore on a next step i used array with 1023272 items - here is a result: - 175 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) - 579 µs ± 97.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) — DL-Newbie, Sep 16 '21 at 06:44
Normalization is low on compute to data ratio right? You may be measuring data copying efficiency instead of copying where numpy already has data in ram but cupy needs to do extra copy to gpu. — huseyin tugrul buyukisik, Sep 16 '21 at 20:43

huseyin tugrul buyukisik · Answer 1 · 2021-09-16T20:54:55.643

You are measuring 4-8 GB/s data copying performance from the cupy test. Norm is normalization so its most probably a short reduction to find max element in matrix and dividing all elements by this value. This has little computation per byte transferred through pcie.

If you really want to compare compute performances, you may do something like matrix-matrix multiplication and with big enough matrix sizes like 2048x2048 x 2048x2048.

In your current test, numpy bottlenecked by ram bandwidth while cupy is bottlenecked by pcie bandwidth. You could open a profiler for cuda and see exact compute time of kernel and use that (should be faster than cpu already).

In first test, there are only 3 elements so first test numpy result is probably function call overhead of python. Then, the cupy result must be the kernel launch overhead and python function overhead and an extra overhead from calling data copying function. Even just kernel launching in cuda costs 10-20 microseconds for an empty kernel function.

There is no PCIE overhead in GPU case, as only the l2 norm operation is timed. It is GRAM vs. RAM overhead when there are many elements, and purely initialization/kernel launch/python wrappers overheads for very small elements. — Mikhail Genkin, Oct 03 '21 at 17:39

cupy performs task for 48ms vs numpy for 4ms - why and how to fix it?

Question:

Environment:

UPDATED

1 Answers1