1

I know how to time the execution of one CUDA kernel using CUDA events, which is great for simple cases. But in the real world, an algorithm is often made up of a series of kernels (CUB::DeviceRadixSort algorithms, for example, launch many kernels to get the job done). If you're running your algorithm on a system with a lot of other streams and kernels also in flight, it's not uncommon for the gaps between individual kernel launches to be highly variable based on what other work gets scheduled in-between launches on your stream. If I'm trying to make my algorithm work faster, I don't care so much about how long it spends sitting waiting for resources. I care about the time it spends actually executing.

So the question is, is there some way to do something like the event API and insert a marker in the stream before the first kernel launches, and read it back after your last kernel launches, and have it tell you the actual amount of time spent executing on the stream, rather than the total end-to-end wall-clock time? Maybe something in CUPTI can do this?

Baxissimo
  • 2,629
  • 2
  • 25
  • 23
  • 1
    When trying to make an algorithm work faster, I would benchmark it on its own (e.g. with [nvbench](https://github.com/NVIDIA/nvbench)). What you are describing sounds more like something one would need for displaying nice stats during/after runtime of a production code. – paleonix Jun 12 '22 at 22:28
  • Yeh if there's no handy way to get just the kernel time on a stream, writing a benchmark is what I'll do. It's just a pain because I have to come up with data to feed it that is realistic enough for the benchmark performance be meaningful. It just seems like it is technically very possible for there to be an API that works like cuda event timers, but which only counts busy time between start and stop. The CUPTI API can call you back and tell you when kernels start and stop, so it should be possible to build such a thing using that. – Baxissimo Jun 13 '22 at 06:34

1 Answers1

1

You can use Nsight Systems or Nsight Compute. (https://developer.nvidia.com/tools-overview)

In Nsight Systems, you can profile timelines of each stream. Also, you can use Nsight Compute to profile details of each CUDA kernel. I guess Nsight Compute is better because you can inspect various metrics about GPU performances and get hints about the kernel optimization.

  • That sounds like a fairly manual way to find the information. Sure the info is there, but I don't think there's any way for those tools to tell me what the average iteration time is or give me a list of iteration times. When timing kernels using cuda events I can print out how long each event took and pipe those numbers into something to give me stats like min,max,mean,variance, and it's all fairly automated. I could be wrong but I think with Nsight Systems I'll have to go manually find inspect how long each multi-kernel iteration is taking. – Baxissimo Jun 13 '22 at 05:49
  • Nsight Compute is good for digging into the performance of one kernel, and I use it for that. But if I reorganize the work among several kernels, combining some or splitting others, it's not so useful for telling if I made things faster overall. – Baxissimo Jun 13 '22 at 05:52
  • 2
    @Baxissimo You can export SQLite db file from .nsys-rep results file. I didn't try that, but I guess it can help you. (https://docs.nvidia.com/nsight-systems/UserGuide/index.html#exporter-sqlite-schema) – Hyunwoo Kim Jun 13 '22 at 07:39