`Out of resources` error while doing loop unrolling

Question

When I increase the unrolling from 8 to 9 loops in my kernel, it breaks with an out of resources error.

I read in How do I diagnose a CUDA launch failure due to being out of resources? that a mismatch of parameters and an overuse of registers could be a problem, but that seems not be the case here.

My kernel calculates the distance between n points and m centroids and selects for each point the closest centroid. It works for 8 dimensions but not for 9. When I set dimensions=9 and uncomment the two lines for the distance calculation, I get an pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources.

What do you think, could cause this behavior? What other iusses can cause an out of resources*?

I use an Quadro FX580. Here is the minimal(ish) example. For the unrolling in the real code I use templates.

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit


## preference
np.random.seed(20)
points = 512
dimensions = 8
nclusters = 1

## init data
data = np.random.randn(points,dimensions).astype(np.float32)
clusters = data[:nclusters]

## init cuda
kernel_code = """

      // the kernel definition 
    __device__ __constant__ float centroids[16384];

    __global__ void kmeans_kernel(float *idata,float *g_centroids,
    int * cluster, float *min_dist, int numClusters, int numDim) {
    int valindex = blockIdx.x * blockDim.x + threadIdx.x ;
    float increased_distance,distance, minDistance;
    minDistance = 10000000 ;
    int nearestCentroid = 0;
    for(int k=0;k<numClusters;k++){
      distance = 0.0;
      increased_distance = idata[valindex*numDim] -centroids[k*numDim];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+1] -centroids[k*numDim+1];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+2] -centroids[k*numDim+2];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+3] -centroids[k*numDim+3];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+4] -centroids[k*numDim+4];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+5] -centroids[k*numDim+5];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+6] -centroids[k*numDim+6];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+7] -centroids[k*numDim+7];
      distance = distance +(increased_distance * increased_distance);
      //increased_distance =  idata[valindex*numDim+8] -centroids[k*numDim+8];
      //distance = distance +(increased_distance * increased_distance);

      if(distance <minDistance) {
        minDistance = distance ;
        nearestCentroid = k;
        } 
      }
      cluster[valindex]=nearestCentroid;
      min_dist[valindex]=sqrt(minDistance);
    } 
 """
mod = compiler.SourceModule(kernel_code)
centroids_adrs = mod.get_global('centroids')[0]    
kmeans_kernel = mod.get_function("kmeans_kernel")
clusters_gpu = gpuarray.to_gpu(clusters)
cluster = gpuarray.zeros(points, dtype=np.int32)
min_dist = gpuarray.zeros(points, dtype=np.float32)

driver.memcpy_htod(centroids_adrs,clusters)

distortion = gpuarray.zeros(points, dtype=np.float32)
block_size= 512

## start kernel
kmeans_kernel(
    driver.In(data),driver.In(clusters),cluster,min_dist,
    np.int32(nclusters),np.int32(dimensions),
    grid = (points/block_size,1),
    block = (block_size, 1, 1),
)
print cluster
print min_dist

Could you verify how many registers your kernel is using? Loop unrolling, especially with templates, can increase the register pressure. Unfortunately the CUDA compiler does not come with the best optimisation strategies: they try to reduce the number of operations, not the number of used registers. — CygnusX1, Sep 30 '11 at 18:05

Jared Hoberock · Accepted Answer · 2011-10-01T02:24:30.050

8

You're running out of registers because your block_size (512) is too large.

ptxas reports that your kernel uses 16 registers with the commented lines:

$ nvcc test.cu -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 16 registers, 24+16 bytes smem, 65536 bytes cmem[0]

Uncommenting the lines increases register use to 17 and an error at runtime:

$ nvcc test.cu -run -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 17 registers, 24+16 bytes smem, 65536 bytes cmem[0]
error: too many resources requested for launch

The number of physical registers used by each thread of a kernel limits the size of blocks you can launch at runtime. An SM 1.0 device has 8K registers that can be used by a block of threads. We can compare that to your kernel's register demands: 17 * 512 = 8704 > 8K. At 16 registers, your original commented kernel just squeaks by: 16 * 512 = 8192 == 8K.

When no architecture is specified, nvcc compiles kernels for an SM 1.0 device by default. PyCUDA may work the same way.

To fix your problem, you could either decrease block_size (to say, 256) or find a way to configure PyCUDA to compile your kernel for an SM 2.0 device. SM 2.0 devices such as your QuadroFX 580 provide 32K registers, more than enough for your original block_size of 512.

edited Oct 01 '11 at 02:24

answered Oct 01 '11 at 02:16

Jared Hoberock

11,118
3
40
76

But where do the 16 registers come from and why do they increase? I thought only variables that I declare in the kernel are saved as registers. So I count only 6 registers named: `int valindex`, `float increased_distance,distance, minDistance`,`int nearestCentroid` and `int k`. – Framester Oct 01 '11 at 10:06
2

Registers come from the compiler's [register allocation algorithm](http://en.wikipedia.org/wiki/Register_allocation), which is essentially a black box to the programmer. There's not necessarily a correlation between the number of declared variables and the number of registers allocated. Instead, what is important is the number of variables which are "live" at the same time. Register allocation is hard, and in practice heuristics are used. It may be that your example straddles a threshold within the compiler's heuristics, causing it to "give up" a search and allocate an additional register. – Jared Hoberock Oct 01 '11 at 10:21
Okay, I did not know that. Is there anything else to do than trial and error? – Framester Oct 01 '11 at 10:42
2

To reduce the number of live variables, you could always try to eliminate the number of named variables you use, but it may not be possible. By far the easiest thing to do in your situation would be to simply decrease the ```block_size``` and just launch more thread blocks. Failing that, try to get PyCUDA to compile your kernel for sm_20, where you will have much more registers at your disposal. – Jared Hoberock Oct 01 '11 at 11:06
Super thanks for this insight. Do you know why nobody answered in the first place? – Framester Oct 01 '11 at 17:02
1

The PyCUDA tag might have been a bridge too far for some who are otherwise knowledgable about CUDA. – Jared Hoberock Oct 01 '11 at 17:21

`Out of resources` error while doing loop unrolling

1 Answers1