Why numba cuda is running slow after recalling it several times?

Question

I am experimenting how to use cuda inside numba. However I have encountered something different from my expectation. Here is my code

from numba import cuda
@cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
d=cuda.local.array((3,3),dtype=numba.float64)
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
    tmp = 0.
    for k in range(A.shape[1]):
        tmp += A[i, k] * B[k, j]
    C[i, j] = tmp

This is the matrix function I self defined for testing using numba.cuda. Before running the tests, I also loaded the arrays in the following code:

import numpy as np
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
c=np.empty((2000,2000))
a1=cuda.to_device(a)
b1=cuda.to_device(b)
c1=cuda.to_device(c)

Then I used the following code for experiment:

from time import time
count =0
start=time()
for i in range(2000):
  matmul[(256,256),(16,16)](a1,b1,c1)
  count +=1
  print(count)

The for loops ran very fast for the first 1028 runs. However it ran very slow after the 1028th.What exactly caused this and how do I fix it. I am running on win10 by the way.

Here is my cuda information called from numba.cuda

from numba import cuda
gpu = cuda.get_current_device()
print("name = %s" % gpu.name)
print("maxThreadsPerBlock = %s" % str(gpu.MAX_THREADS_PER_BLOCK))
print("maxBlockDimX = %s" % str(gpu.MAX_BLOCK_DIM_X))
print("maxBlockDimY = %s" % str(gpu.MAX_BLOCK_DIM_Y))
print("maxBlockDimZ = %s" % str(gpu.MAX_BLOCK_DIM_Z))
print("maxGridDimX = %s" % str(gpu.MAX_GRID_DIM_X))
print("maxGridDimY = %s" % str(gpu.MAX_GRID_DIM_Y))
print("maxGridDimZ = %s" % str(gpu.MAX_GRID_DIM_Z))
print("maxSharedMemoryPerBlock = %s" % 
str(gpu.MAX_SHARED_MEMORY_PER_BLOCK))
print("asyncEngineCount = %s" % str(gpu.ASYNC_ENGINE_COUNT))
print("canMapHostMemory = %s" % str(gpu.CAN_MAP_HOST_MEMORY))
print("multiProcessorCount = %s" % str(gpu.MULTIPROCESSOR_COUNT))
print("warpSize = %s" % str(gpu.WARP_SIZE))
print("unifiedAddressing = %s" % str(gpu.UNIFIED_ADDRESSING))
print("pciBusID = %s" % str(gpu.PCI_BUS_ID))
print("pciDeviceID = %s" % str(gpu.PCI_DEVICE_ID))

and the output is:

name = b'GeForce GTX 1050 Ti'

maxThreadsPerBlock = 1024

maxBlockDimX = 1024

maxBlockDimY = 1024

maxBlockDimZ = 64

maxGridDimX = 2147483647

maxGridDimY = 65535

maxGridDimZ = 65535

maxSharedMemoryPerBlock = 49152

asyncEngineCount = 2

canMapHostMemory = 1

multiProcessorCount = 6

warpSize = 32

unifiedAddressing = 1

pciBusID = 3

pciDeviceID = 0

score 7 · Accepted Answer · answered Sep 10 '18 at 18:53

This is caused by the asynchronous launch queue associated with GPU kernel launches.

When you tell numba to submit a GPU kernel:

matmul[(256,256),(16,16)](a1,b1,c1)

This request goes into a queue, and the CPU thread (i.e. python) that issued that kernel call can continue, even though the GPU kernel has not completed or even started yet.

The CUDA runtime queues up these requests and issues them as the GPU is ready for more work.

What you are witnessing initially during the very fast incrementing of your for-loop is the queue filling up with work requests. That is not representative of the actual time that the GPU requires to perform the work.

Eventually the queue fills, and the CUDA runtime halts the CPU thread (i.e. python) at the point of kernel launch, until a queue slot opens up. At that point, the for-loop is allowed to proceed for one more iteration. It is at this point (perhaps around 1028 iterations) that you start to see a "slow down". Thereafter, the for-loop proceeds at approximately the rate at which GPU kernels are executed and removed from the processing queue.

There is nothing to fix here; this is expected behavior.

If you want the for-loop to proceed only at the rate at which GPU kernels are actually executed, then you should insert a synchronizing function in your for-loop.

For example, numba provides numba.cuda.synchronize() So if you modify your for-loop as follows:

for i in range(2000):
  matmul[(256,256),(16,16)](a1,b1,c1)
  cuda.synchronize()
  count +=1
  print(count)

You will see the for-loop proceed at the actual rate of GPU work completion, instead of the "queue-filling" rate.

Thanks very much! Now I understand but why this is even slower than pure numpy. Should gpu be faster than CPU when the array is at this size? — Peter Deng, Sep 10 '18 at 19:19
It's slower than pure numpy because numpy uses a nicely tuned implementation for matrix multiply, perhaps using a library like intel MKL to do the work on the CPU, whereas this GPU code is a horrible matrix multiply implementation, from a tuning/speed perspective. If you want a fast GPU matrix multiply operation, you should use a similarly well-written library such as CUBLAS. You can access CUBLAS from numba's [pyculib](https://pyculib.readthedocs.io/en/latest/cublas.html) which should be generally interoperable with numpy/numba cuda. — Robert Crovella, Sep 10 '18 at 19:31
Alternatively, for an apples-to-apples comparison, write your own matrix multiply function, and compare against that. — Robert Crovella, Sep 10 '18 at 19:32
If you do decide to compare your GPU using pyculib, you should also be aware that the GPU you are running on doesn't have a high level of FP64 floating point capability, so if you want to see your GPU run as fast as possible using pyculib/CUBLAS, then you might want to try it using `numpy.dtype=float32` throughout. — Robert Crovella, Sep 10 '18 at 19:36
Pyculib has apparently now been deprecated. From https://pyculib.readthedocs.io/en/latest/index.html : "Pyculib is currently available for archival purposes and is not receiving updates. We think that CuPy provides a much more complete interface to standard GPU algorithms, and CuPy arrays work with Numba-compiled GPU kernels now". — Matti Wens, Apr 29 '19 at 08:45

Why numba cuda is running slow after recalling it several times?

1 Answers1

Linked