Using nvprof to Count CUDA Kernel Executions

Question

Is it possible to use nvprof to count the number of CUDA kernel executions (ie how many kernels are launched)?

Right now when I run nprof what I am seeing is:

==537== Profiling application: python tf.py
==537== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 51.73%  91.294us        20  4.5640us  4.1280us  6.1760us  [CUDA memcpy HtoD]
 43.72%  77.148us        20  3.8570us  3.5840us  4.7030us  [CUDA memcpy DtoH]
  4.55%  8.0320us         1  8.0320us  8.0320us  8.0320us  [CUDA memset]

==537== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 90.17%  110.11ms         1  110.11ms  110.11ms  110.11ms  cuDevicePrimaryCtxRetain
  6.63%  8.0905ms         1  8.0905ms  8.0905ms  8.0905ms  cuMemAlloc
  0.57%  700.41us         2  350.21us  346.89us  353.52us  cuMemGetInfo
  0.55%  670.28us         1  670.28us  670.28us  670.28us  cuMemHostAlloc
  0.28%  347.01us         1  347.01us  347.01us  347.01us  cuDeviceTotalMem
...

score 1 · Answer 1 · answered Mar 09 '17 at 22:29

Yes, its possible. In case you're not aware, there is both documentation and command-line help available (nvprof --help).

What you're asking for is provided by the simplest usage of nvprof:

nvprof ./my_application

this will output (among other things) a list of kernels by name, how many times each one was launched, and what percentage of overall GPU usage each one accounted for.

Here's an example:

$ nvprof ./t1288
==12904== NVPROF is profiling process 12904, command: ./t1288
addr@host: 0x402add
addr@device: 0x8
run on device
func_A is correctly invoked!
run on host
func_A is correctly invoked!
==12904== Profiling application: ./t1288
==12904== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.93%  195.28us         1  195.28us  195.28us  195.28us  run_on_device(Parameters*)
  1.07%  2.1120us         1  2.1120us  2.1120us  2.1120us  assign_func_pointer(Parameters*)

==12904== Unified Memory profiling result:
Device "Tesla K20Xm (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       1  4.0000KB  4.0000KB  4.0000KB  4.000000KB  3.136000us  Host To Device
       6  32.000KB  4.0000KB  60.000KB  192.0000KB  34.20800us  Device To Host
Total CPU Page faults: 3

==12904== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.08%  321.35ms         1  321.35ms  321.35ms  321.35ms  cudaMallocManaged
  0.93%  3.0613ms       364  8.4100us     278ns  286.84us  cuDeviceGetAttribute
  0.42%  1.3626ms         4  340.65us  331.12us  355.60us  cuDeviceTotalMem
  0.38%  1.2391ms         2  619.57us  113.13us  1.1260ms  cudaLaunch
  0.08%  251.20us         4  62.798us  57.985us  70.827us  cuDeviceGetName
  0.08%  246.55us         2  123.27us  21.343us  225.20us  cudaDeviceSynchronize
  0.03%  98.950us         1  98.950us  98.950us  98.950us  cudaFree
  0.00%  8.9820us        12     748ns     278ns  2.2670us  cuDeviceGet
  0.00%  6.0260us         2  3.0130us     613ns  5.4130us  cudaSetupArgument
  0.00%  5.7190us         3  1.9060us     490ns  4.1130us  cuDeviceGetCount
  0.00%  5.2370us         2  2.6180us  1.2100us  4.0270us  cudaConfigureCall
$

In the above example run_on_device and assign_func_pointer are the kernel names. There is also example output in the documentation I linked.

I updated the question with what I am seeing when running nprof. I don't see anything called out as a kernel. — Alex Rothberg, Mar 10 '17 at 15:54
I can think of two possibilities: 1. Your python code is not making any (successful) kernel calls - are you doing proper error checking? Do you know for sure that kernels are being called? 2. You may need to tell nvprof to profile child processes - how to do this is covered in the documentation I linked. This will depend on what kind of work exactly you are issuing in your `tf.py` - probably tensorflow. — Robert Crovella, Mar 10 '17 at 15:58

Using nvprof to Count CUDA Kernel Executions

1 Answers1