CUDA limit seems to be reached, but what limit is that?

Question

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:

__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, 
                        int segmentCount, int* output)
{
    int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int pointCount = segmentCount + 1;

    if(segmentIndex >= segmentCount)
        return;

    int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
    int result = 0;
    if(polylineIndex >= 0)
    {
        float2 p1 = points[segmentIndex];
        float2 p2 = points[segmentIndex+1];
        float2 A = p2;
        float2 a;
        a.x = p2.x - p1.x;
        a.y = p2.y - p1.y;

        for(int i = segmentIndex+2; i < segmentCount; i++)
        {
            int currentPolylineIndex = segmentToPolylineIndexMap[i];

            // if not a different segment within out polyline and
            // not a fake segment
            bool isLegit = (currentPolylineIndex != polylineIndex && 
                currentPolylineIndex >= 0);      

            float2 p3 = points[i];
            float2 p4 = points[i+1];
            float2 B = p4;
            float2 b;
            b.x = p4.x - p3.x;
            b.y = p4.y - p3.y;

            float2 c;
            c.x = B.x - A.x;
            c.y = B.y - A.y;

            float2 b_perp;
            b_perp.x = -b.y;
            b_perp.y = b.x;

            float numerator = dot(b_perp, c);
            float denominator = dot(b_perp, a);
            bool isParallel = (denominator == 0.0);

            float quotient = numerator / denominator;
            float2 intersectionPoint;
            intersectionPoint.x = quotient * a.x + A.x;
            intersectionPoint.y = quotient * a.y + A.y;

            result = result | (isLegit && !isParallel && 
                intersectionPoint.x > min(p1.x, p2.x) && 
                intersectionPoint.x > min(p3.x, p4.x) && 
                intersectionPoint.x < max(p1.x, p2.x) && 
                intersectionPoint.x < max(p3.x, p4.x) && 
                intersectionPoint.y > min(p1.y, p2.y) && 
                intersectionPoint.y > min(p3.y, p4.y) && 
                intersectionPoint.y < max(p1.y, p2.y) && 
                intersectionPoint.y < max(p3.y, p4.y));
        }
    }

    output[segmentIndex] = result;
}

Here is the call to execute the kernel function:

DoCheck<<<702, 32>>>(
    (float2*)devicePoints, 
    deviceSegmentsToPolylineIndexMap, 
    numSegments, 
    deviceOutput);

The sizes of the parameters are as follows:

devicePoints = 22,464 float2s = 179,712 bytes
deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
numSegments = 1 int = 4 bytes
deviceOutput = 22,463 ints = 89,852 bytes

When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.

Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.

Update More information on my system configuration:

Video Card: nVidia 8800GT
CUDA Version: 1.1
OS: Windows Server 2008 R2

I also tried it on a laptop with the following configuration, but got the same results:

Video Card: nVidia Quadro FX 880M
CUDA Version: 1.2
OS: Windows 7 64-bit

Probably wall clock time if this is a display card. The display driver has a watchdog timer that will kill kernels which take more than a few seconds to complete. Implementation details and work-arounds are OS specific. What OS, card and CUDA version are you using? — talonmies, Aug 02 '11 at 14:25
Interesting. Ok, I will update the question with that information. — Jonathan DeCarlo, Aug 02 '11 at 14:34
Is the watchdog timer still an issue on Windows? If so, your kernel might be taking too long to execute. — Eric, Aug 02 '11 at 14:45
Those CUDA versions are the compute capability of your card, not the version of CUDA you are using... But you ar definitely hitting the display driver watchdog timer limit - I take it you are getting a "driver crashed and was reset" message when executing? — talonmies, Aug 02 '11 at 14:47
@Eric: yes it it, unless using a tesla card with the non-WDDM compute driver. — talonmies, Aug 02 '11 at 14:48
@talonmies: Yes, that is the error. So, if this is the problem, how would I get around it? — Jonathan DeCarlo, Aug 02 '11 at 15:02
As side point - you launch way too small number of threads. 32 threads will launch a single warp so the performance will be eaten by latences etc. You may try to launch much larger number of threads. — Maciej Piechotka, Jul 15 '12 at 09:39

score 7 · Accepted Answer · answered Aug 02 '11 at 15:14

7

The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.

On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:

Get a Telsa card and use the TCC driver, which eliminates the problem completely
Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can't be more specific than that)
Modify your kernel code to be "re-entrant" and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn't all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.

answered Aug 02 '11 at 15:14

talonmies

70,661
34
192
269

For solution #2, this seems like a good page: http://msdn.microsoft.com/en-us/windows/hardware/gg487368 – Jonathan DeCarlo Aug 02 '11 at 15:58
One more question... So what does the timeout look at with respect to CUDA? The whole kernel call? Each block? Each thread? – Jonathan DeCarlo Aug 02 '11 at 16:01
1

Any host API function that doesn't yield the GPU to the display driver inside the watchdog limit will trigger the driver reset event. In your case that means the whole kernel (technically on WDDM platforms in recent CUDA versions, it could also mean any other API operation which gets batched with the kernel too). – talonmies Aug 02 '11 at 16:10
Quadro cards can work in TCC too, if they are not attached to monitor. – huseyin tugrul buyukisik Jul 26 '18 at 12:21

CUDA limit seems to be reached, but what limit is that?

1 Answers1

Linked

Related