How to profile CUDA code on a headless node?

Question

I'm working on a CUDA application I'd like to profile. Up to now all I've used is the command line profiler, nvprof, which just displayes the summarized statistics.

I thought about using the GUI profiler, NVVP. The problem is that the remote Linux node I'm running the application on doesn't have anything GUI (even X.org). Moreover, even if I managed to get some X11 stack on the remote node, keeping my own laptop alive for the whole time of the profiling would be, well, tedious.

I tried collecting all the needed information in the following way:

 nvprof --analysis-metrics -o application.nvprof ./myapplication

Then I copy the output file onto my laptop and view it in NVVP. This has three problems, though.

First of all, I don't get any file transfer information when I load the output file into NVVP. It's not shown at all in the NVVP window.

Secondly, the call graph is completely distorted. The gaps between kernel launches are at least 100x bigger than the kernel durations, which makes any dependency and flow analysis impossible.

Lastly, my application uses a lot of the GPU memory. During the profiling the device gets out of memory, which is not the case during the standalone run.

How should I properly profile my CUDA application on a headless node?

The method you are describing is the proper method. There are potentially limitations in the nvprof->nvvp method that may not be there when using nvvp only. I don't know what you mean exactly by "file transfer information", but when you capture metrics with `nvprof`, there are other types of data which are not collected. Study the nvprof documentation and understand that it has a limitation on what it will capture depending on which mode you select. I've never seen the second issue you report. For the third issue: limit the scope of profiling. — Robert Crovella, Nov 07 '17 at 22:59
not everything that can be done in nvvp can be done with nvprof import into nvvp. For example, the guided analysis is quite limited. — Robert Crovella, Nov 07 '17 at 23:00

score 3 · Accepted Answer · answered Nov 08 '17 at 04:54

NVVP supports headless nodes as a first-class citizen. Remote profiling is a major feature of NVVP.

The way this works is that NVVP runs on your local GUI-enabled host machine and invokes nvprof on the headless machine, generates the required files there, copies the files over, and opens them. All of this happens transparently and automatically. You can run further analyses from NVVP as usual and it will repeat these steps for you.

To use remote profiling, open NVVP, then File->New Session. Add a Connection instead of using Local, putting in details of the headless machine. Click on Manage... to point NVVP to the toolkit path on the remote machine. Once this one-time setup is done, enter the path to the executable and run as usual.

You can read about remote profiling in the relevant documentation.

How to profile CUDA code on a headless node?

1 Answers1