I'm working on a CUDA application I'd like to profile. Up to now all I've used is the command line profiler, nvprof
, which just displayes the summarized statistics.
I thought about using the GUI profiler, NVVP. The problem is that the remote Linux node I'm running the application on doesn't have anything GUI (even X.org). Moreover, even if I managed to get some X11 stack on the remote node, keeping my own laptop alive for the whole time of the profiling would be, well, tedious.
I tried collecting all the needed information in the following way:
nvprof --analysis-metrics -o application.nvprof ./myapplication
Then I copy the output file onto my laptop and view it in NVVP. This has three problems, though.
First of all, I don't get any file transfer information when I load the output file into NVVP. It's not shown at all in the NVVP window.
Secondly, the call graph is completely distorted. The gaps between kernel launches are at least 100x bigger than the kernel durations, which makes any dependency and flow analysis impossible.
Lastly, my application uses a lot of the GPU memory. During the profiling the device gets out of memory, which is not the case during the standalone run.
How should I properly profile my CUDA application on a headless node?