CUDA profiling inside kernel

Question

Is there any option to profile a CUDA kernel? Not as a whole, but rather part of it. I have some device functions invocation and I want to measure their times. Are there any flags/events/instructions that I can set and then it will be visible in NVIDIA Visual Profiler? Or do I need to do it manually by inserting cudaEventCreate and similar functions.

score 3 · Answer 1 · answered May 30 '13 at 11:35

3

You can time specific parts of your kernel manually using the clock() or clock64() functions:

unsigned long long* time_spent;

__global__ void kernel(...)
{
    unsigned int t1, t2;
    // ...
    t1 = clock();
    // code of interest
    t2 = clock();
    atomicAdd(&time_spent, t2 - t1);
}

'clock()` officially returns a clock_t, but I prefer the explicit use of unsigned int to make obvious how the above code correctly handles wraparound of clock values (as long as the timed code does not take more than 2^32-1 cycles to complete.

Make sure to also time the code with

    t1 = clock();
    t2 = clock();

back-to-back so you can subtract the timing overhead.

answered May 30 '13 at 11:35

tera

7,080
1
21
32

Is it not that `clock()` returns a local counter per SM? – KiaMorot May 30 '13 at 11:55
In addiction to this answer: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#time-function – KiaMorot May 30 '13 at 12:00
Yes it does. But that does not matter as long as we only take differences between times from the same SM. (Dynamic parallelism would indeed create a problem here, for simplicity I've just assumed that the code of interest does not launch any other kernels). – tera May 30 '13 at 14:44

CUDA profiling inside kernel

1 Answers1