driver.Context.synchronize()- what else to take into consideration -- -a clean-up operation failed

Question

I have this code here (modified due to the answer).

Info

32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes cmem[2], 76 bytes cmem[16]

I don't know what else to take into consideration in order to make it work for different combinations of points "numPointsRs" and "numPointsRp"

When ,for example, i run the code with Rs=10000 and Rp=100000 with block=(128,1,1),grid=(200,1) its fine.

My computations:

46 registers*128threads=5888 registers .

My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).

With the occupancy calculator i found that using 128 threads/block gives me 42% and am in the limits of my card.

Also,the number of threads per MP is 640 (limit is 1536)

Now,if i try to use Rs=100000 and Rp=100000 (for the same threads and blocks) it gives me the message in the title,with:

cuEventDestroy failed: launch timeout

cuModuleUnload failed: launch timeout

1) I don't know/understand what else is needed to be computed.

2) I can't understand how we use/find the number of the blocks.I can see that mostly,someone puts (threads-1+points)/threads ,but that still doesn't work.

--------------UPDATED----------------------------------------------

After using driver.Context.synchronize() ,the code works for many points (1000000)!

But ,what impact has this addition to the code?(for many points the screen freezes for 1 minute or more).Should i use it or not?

--------------UPDATED2----------------------------------------------

Now,the code doesn't work again without doing anything!

Snapshot of code:

import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t






#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)

J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))

Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))

#-----------------------------------------------------------------------------------    
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>

typedef  pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];

__device__ __constant__ float Pi;

extern "C"{  


    __device__ void computeEvec(fp3 Rs_mat[], int numPointsRs,   
         cp3 J[],
         cp3 M[],
         fp3 Rp,
         cmplx kp, 
         cmplx eta,
         cmplx *Evec,
         cmplx *Hvec, cmplx *All)

{

            while (c<numPointsRs){

        ...                      
                c++;

                }        
        }


__global__  void computeEHfields(float *Rs_mat_, int numPointsRs,     
        float *Rp_mat_, int numPointsRp,     
    cmplx *J_,
    cmplx *M_,
    cmplx  kp, 
    cmplx  eta,
    cmplx E[][3],
    cmplx H[][3], cmplx *All )
    {

        fp3 * Rs_mat=(fp3 *)Rs_mat_;
        fp3 * Rp_mat=(fp3 *)Rp_mat_;
        cp3 * J=(cp3 *)J_;
        cp3 * M=(cp3 *)M_;


    int k=threadIdx.x+blockIdx.x*blockDim.x;

      while (k<numPointsRp)  
     {

        computeEvec( Rs_mat, numPointsRs,  J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
        k+=blockDim.x*gridDim.x;

    }

}
}

""" ,no_extern_c=1,options=['--ptxas-options=-v'])


#call the function(kernel)
func = mod.get_function("computeEHfields")

func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))


#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()

My card:

Device 0: "GeForce GTX 560"
  CUDA Driver Version / Runtime Version          4.20 / 4.10
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073283072 bytes)
  ( 0) Multiprocessors x (48) CUDA Cores/MP:     0 CUDA Cores   //CUDA Cores    336 => 7 MP and 48 Cores/MP

Did you check for errors on CUDA API calls and kernel launches? Although I've not worked with pycuda, according to a quick google search, It seems an non-catched exception. Get the last error will help to identify the problem. — pQB, Sep 05 '12 at 13:54
@pQB:pycuda has "Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.". — George, Sep 05 '12 at 14:33
@pQB:Sorry,i updated,its cuEventDestroy failed: launch timeout cuModuleUnload failed: launch timeout — George, Sep 05 '12 at 14:58
"launch timeout" would appear to indicate that the kernel ran too long and was killed by the watchdog timer. This can happen on GPUs that are also used for graphics output (e.g. a graphical desktop), where the task of the watchdog timer is to prevent the desktop from locking up for more than a few seconds. If possible, avoid using this GPU for the desktop and/or other graphics (e.g. don't run X if you are on Linux). Otherwise, reduce the kernel run time so it does not trigger the watchdog timer. Best I can recall the watchdog time limit is on the order of 5 seconds or thereabouts. — njuffa, Sep 06 '12 at 00:04
@ njuffa :Hello, i tried " driver.Context.synchronize()" from http://documen.tician.de/pycuda/driver.html?highlight=synchronize#pycuda.driver.Context.synchronize and it works!But 1) what impact has this addition to the code?(for many points the screen freezes for 1 minute or more) and 2) how can i reduce the kernel time?i didn't find sth — George, Sep 06 '12 at 12:27
@ njuffa: I also tried to run it with X disabled and it runs fine!Without the "Context.synchronize".Does this mean my code its ok?And also,does this mean that my code when it will run in a X system,it will freeze it as long as the process holds? — George, Sep 06 '12 at 16:15
Sorry, I do not have any specific advice as I do not use pyCUDA or the driver API it is based on (I was the first user of CUDART and have never used the driver API since that day). At any given moment, the GPU can either run graphics, or CUDA, so the watchdog timer is needed when running a GUI. If running without graphics isn't an option, to reduce kernel execution time to avoid hitting watchdog timer kernel termination, you will have to do less work per kernel launch, optimize the code so the kernel runs faster for the same amount of work, or deploy a faster GPU. — njuffa, Sep 10 '12 at 20:19
And this is the answer, folks. First to copy and paste it gets a bounty. :) — harrism, Sep 11 '12 at 22:06
@harrism:Hello,i didn't understand the meaning of your comment..You mean the answer is what njuffa says?Because i have other problems too.If you could help.. — George, Sep 12 '12 at 08:14
If you have more problems, post more (separate) questions. Don't lump everything on one question. If you have the answer to this question, though, please post it and accept it. — harrism, Sep 13 '12 at 01:29
@harrism: I have copied the most relevant parts of my comments into an answer now, I hope it is useful in this form. — njuffa, Sep 14 '12 at 00:58

score 3 · Answer 1 · answered Sep 14 '12 at 00:56

"launch timeout" would appear to indicate that the kernel ran too long and was killed by the watchdog timer. This can happen on GPUs that are also used for graphics output (e.g. a graphical desktop), where the task of the watchdog timer is to prevent the desktop from locking up for more than a few seconds. Best I can recall the watchdog time limit is on the order of 5 seconds or thereabouts.

At any given moment, the GPU can either run graphics, or CUDA, so the watchdog timer is needed when running a GUI to prevent the GUI from locking up for an extended period of time, which renders the machine inoperable through the GUI.

If possible, avoid using this GPU for the desktop and/or other graphics (e.g. don't run X if you are on Linux). If running without graphics isn't an option, to reduce kernel execution time to avoid hitting watchdog timer kernel termination, you will have to do less work per kernel launch, optimize the code so the kernel runs faster for the same amount of work, or deploy a faster GPU.

Greg Smith · Accepted Answer · 2012-09-14T04:12:30.473

There are quite a few issues that you have to deal with. Answer 1 provided by @njuffa is the best general solution. I'll provide more feedback based upon the limited data you have provided.

PTX output of 46 registers is not the number of registers used by your kernel. PTX is an intermediate representation. The offline or JIT compiler will convert this to device code. Device code may use more or less registers. Nsight Visual Studio Edition, the Visual Profiler, and the CUDA command line profiler can all provide you the correct register count.
The occupancy calculation is not simply RegistersPerSM / RegistersPerThread. Registers are allocated based upon a granularity. For CC 2.1 the granularity is 4 registers per thread per warp (128 registers). 2.x devices can actually allocate at a 2 register granularity but this can lead to fragmentation later in the kernel.
In your occupancy calculation you state

My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM (my card has limit 6).

I'm not sure what 6 means. Your device has 7 SMs. The maximum blocks per SM for 2.x devices is 8 blocks per SM.

You have provided an insufficient amount of code. If you provide pieces of code please provide the size of all inputs, the number of times each loop will be executed, and a description of the operations per function. Looking at the code you may be doing too many loops per thread. Without knowing the order of magnitude of the outer loop we can only guess.
Given that the launch is timing out you should probably approach debugging as follows:

a. Add a line to the beginning of the code

if (blockIdx.x > 0) { return; }

Run the exact code you have in one of the previously mentioned profilers to estimate the duration of a single block. Using the launch information provided by the profiler: register per thread, shared memory, ... use the occupancy calculator in the profiler or the xls to determine the maximum number of blocks that you can run concurrently. For example, if the theoretical block occupancy is 3 blocks per SM, and the number of SMs is 7 the you can run 21 blocks at a time which for you launch is 9 waves. NOTE: this assumes equal work per thread. Change the early exit code to allow 1 wave (21 blocks). If this launch times out then you need to reduce the amount of work per thread. If this passes then calculate how many waves you have and estimate when you will timeout (2sec on windows, ? on linux).

b. If you have too many waves then reduce you have to reduce the launch configuration. Given that you index by gridDim.x and blockDim.x you can do this by passing in these dimensions as as parameters to your kernel. This will require tou to minimally change your indexing code. You will also have to pass a blockIdx.x offset. Change your host code to launch multiple kernels back to back. Since there should be no conflict you can rr launch these in multiple streams to benefit from overlap at the end of each wave.

:Thanks for the help.Can you please check out the code (it's here http://stackoverflow.com/questions/12159709/cuda-out-of-memory-threads-and-blocks-issue-address-is-out-of-bounds/12230988#comment16400460_12230988)?Maybe with your experience you'll find something..If you use numPointsRs=100000 and numPointsRp=100000 ,it show me launch timed out.(at a console environment,without Xserver it runs ok)I use linux.Also,about your comments "a" and "b" i lost it somewhere.. — George, Sep 14 '12 at 09:01
:With the occupancy calculator i found that using 128 threads and 46 registers (i have 42%) ,i have 5 blocks/SM and 640 threads/SM. So , i must use 5*7=35 blocks?Because that leads too to lanunc timed out. — George, Sep 14 '12 at 09:06
@George computeEHfields does multiple loops with numPointsRs iterations. A large value for numPointsRs will definitely exceed the watch dog timer. I can only estimate the duration of each threads in the billions of cycles. I need to run this in a tool just to try to get a scope of the bytes transferred per thread and instructions executed per thread. If you need to handle large values of numPointsRs you need to redesign the algorithm and break it into multiple kernels. — Greg Smith, Sep 14 '12 at 14:47
@ Greg Smith:Thanks for helping.Either large Rs points either Rp cause launch timeout.I really can't think sth else about redesign it!It is very difficult for me at this point.Also,about multiple kernels,i have no idea how to do this.If you can help a little more,ok..(I updated the code in this post ,to the one i am using.) — George, Sep 14 '12 at 15:31
If you are not comfortable changing the algorithm then you need to disable the timeout. This can be done using a Quadro or Tesla card on Windows with the TCC driver or by running on Linux without a display. On Windows you can also extend the TDR timeout but I would discourage disabling it completely. For development I tend to run with a 30s timeout. — Greg Smith, Sep 14 '12 at 19:18

score 0 · Answer 3 · answered Feb 18 '19 at 15:26

To provide more inputs on @njuffa's answer, in Windows systems you can increase the launch timeout or TDR (Timeout Detection & Recovery) by following these steps:

1: Open the options in Nsight Monitor.

2: Set an appropriate value for WDDM TDR Delay

CUATION: If this value is small you may get timeout error and for higher values your screen will stay frozen until kernel finishes it's job.

source

driver.Context.synchronize()- what else to take into consideration -- -a clean-up operation failed

3 Answers3

Linked