4

I am trying to figure out what a profile result means, before I start to optimize. I am very new with CUDA and profiling in general and I am confused by the result.

Specifically, I want to know what is happening during seemingly unoccupied chunks of computation. When I look from top to bottom at the CPU and GPU there appears to be nothing happening during large portions of the code. These look like columns with nothing in Thread1 and nothing in GeForce. Is this normal? Whats happening here?

The run was done a multicore machine under no load with nvprof. The GPU code was compiled with -arch=sm_20 -m32 -g -G for CUDA 5.

enter image description here Larger Image

Mikhail
  • 7,749
  • 11
  • 62
  • 136
  • Is this windows or linux? If windows, is the GPU in question also hosting the display? – Robert Crovella Dec 12 '12 at 13:14
  • @RobertCrovella Linux, I don't believe there is a physical display on this machine although vnc is turned on. – Mikhail Dec 12 '12 at 13:23
  • I guess X is enabled since you mention vnc. If so, is the GPU in question called out with a relevant Display section in the xorg.conf file? – Robert Crovella Dec 12 '12 at 13:35
  • @RobertCrovella Is this really an explanation for this effect? It would seem that the CUDA portion occupies significantly less than half the GPU. I could see this being the case for a machine that is under OpenGL load, but not my machine. – Mikhail Dec 12 '12 at 13:52
  • It could be that those gaps are caused by the profiler, as it is collecting the metrics and saving them to disk. – Roger Dahl Dec 12 '12 at 15:14
  • 2
    @Mikhail The Thread and GeForce row are parent rows and never contain content. The Thread row is a parent row for Runtime API, Driver API, and NVTX rows. The GeForce row is a parent row for your CUDA Device. The sub rows show multiple views of memory copies and kernel launches. If you look at only the Runtime API, Profiling Overhead, and Stream 2 row you can get a good idea of your execution. At ~400µs/pixel there is insufficient detail. The GPU work is close the to CPU work which leads me to believe the CPU thread is busy doing non-CUDA work. Can you post the code or an API trace? – Greg Smith Dec 12 '12 at 17:28
  • @Roger Dahl, The Visual Profiler shows when it is adding overhead in the Profiling Overhead row so I don't think that is the cause. Its definitely a good point to consider as tools overhead can impact the performance of a CPU or disk bound application. – Greg Smith Dec 13 '12 at 01:59
  • Did you test this in release? (without those debug flags) – BenC Dec 13 '12 at 13:44
  • @BenC Should I? Does the profiler work without `-G`? There are a few things on my plate and I expect to get back to this question in a few days. – Mikhail Dec 13 '12 at 16:36
  • @Mikhail: yes it does (at least with CUDA 5.0). – BenC Dec 13 '12 at 16:49
  • @BenC Your suggestion worked, can you post a formal answer to this question? – Mikhail Feb 25 '13 at 23:47
  • @Mikhail: absolutely, thanks for the feedback :-) – BenC Feb 26 '13 at 04:51

1 Answers1

3

The error here was to profile the code in debug mode (-G compiler flag: "Generate debug information for device code"). The behavior of the program is deeply changed, and this should not be used to profile and optimize one's code.

One other thing: a thorough documentation of nvcc's debug mode is hard to find. nvcc probably dumps the registers/shared memory in global memory for easier host access and debugging, which may in turn hide problems such as race conditions in shared memory (cf. discussion here: https://stackoverflow.com/a/10726970/1043187). Thus, programs such as cuda-memcheck --tool racecheck should be used in release mode too.

Community
  • 1
  • 1
BenC
  • 8,729
  • 3
  • 49
  • 68