2

Consider the following Python code:

from numpy import float64
from pycuda import compiler, gpuarray
import pycuda.autoinit

# N > 960 is crucial!
N = 961
code = """
__global__ void kern(double *v)
{
    double a = v[0]*v[2];
    double lmax = fmax(0.0, a), lmin = fmax(0.0, -a);
    double smax = sqrt(lmax),   smin = sqrt(lmin);

    if(smax > 0.2) {
        smax = fmin(smax, 0.2)/smax ;
        smin = (smin > 0.0) ? fmin(smin, 0.2)/smin : 0.0;
        smin = lmin + smin*a;

        v[0] = v[0]*smin + smax*lmax;
        v[2] = v[2]*smin + smax*lmax;
    }
}
"""
kernel_func = compiler.SourceModule(code).get_function("kern")
kernel_func(gpuarray.zeros(3, float64), block=(N,1,1))

Executing this gives:

Traceback (most recent call last):
  File "test.py", line 25, in <module>
    kernel_func(gpuarray.zeros(3, float64), block=(N,1,1))
  File "/usr/lib/python3.5/site-packages/pycuda/driver.py", line 402, in function_call
    func._launch_kernel(grid, block, arg_buf, shared, None)
pycuda._driver.LaunchError: cuLaunchKernel failed: too many resources requested for launch

My setup: Python v3.5.2 with pycuda==2016.1.2 and numpy==1.11.1 on Ubuntu 16.04.1 (64-bit), kernel 4.4.0, nvcc V7.5.17. The graphics card is an Nvidia GeForce GTX 480.

Can you reproduce this on your machine? Do you have any idea, what causes this error message?

Remark: I know that, in principle, there is a race condition because all kernels try to change v[0] and v[2]. But the kernels shouldn't reach the inside of the if-block anyway! Moreover, I'm able to reproduce the error without the race condition, but it's much more complicated.

thomas
  • 561
  • 2
  • 17

1 Answers1

2

It is almost certain that you are hitting a registers-per-block limit.

Reading the relevant documentation, your device has a limit of 32k 32 bit registers per block. When the block size is larger than 960 threads (30 warps), your kernel launch requires too many registers and the launch fails. NVIDIA supply an excel spreadsheet and advice on how to determine the per thread the register requirement of your kernel and the limiting block sizes you can use for your kernel to launch on your device.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • How can your answer explain that the error vanishes when replacing all occurrences of v[2] by v[1]? – thomas Sep 28 '16 at 17:37
  • @thomeas: Because you change the code, you change the register footprint of the kernel the compiler emits. The CUDA compiler is extremely aggressive with optimisation strategies, and even small changes can trigger idiom and pattern identification paths which result in smaller code. By moving the storage you modify to adjacent positions in memory, it might allow a register to be saved. You can easily check this for yourself by following the instructions I linked to. – talonmies Sep 28 '16 at 18:43
  • Thanks for the explanation! I tested the register usage with --ptxas-options=-v and could verify that the kernel uses 34 registers which is too much for my device with more than 960 threads per block. Now I have to understand, what's the reason for this heavy use of registers and how I can reduce this. – thomas Sep 29 '16 at 09:20
  • 2
    It is possible to avoid this error message by providing the compiler information about the size of your block using the `__launch_bounds__` decorator. The compiler will constrain the register usage enough to allow the block to fit, though this may compromise performance. – Jez Sep 29 '16 at 12:55