18

A piece of code that takes well over 1 minute on the command line was done in a matter of seconds in NVIDIA Visual Profiler (running the same .exe). So the natural question is why? Is there something wrong with command line, or does Visual Profiler do something different and not really execute everything as on the command line?

I'm using CUBLAS, Thrust and cuRAND.

Incidentally, there's been a noticeable slowdown in compiled code on my machine very recently, even old code that previously ran quickly, hence I'm getting suspicious.

Update:

  • I have checked that the calculated output on command line and Visual Profiler is identical - i.e. all required code has been run in both cases.
  • GPU-shark indicated that my performance state was unchanged at P0 when I switched from command line to Visual Profiler.
  • However, GPU usage was reported at 0.0% when run with Visual Profiler, but went as high as 98% when run off command line.
  • Moreover, far less memory is used with Visual Profiler. When run off command line, task manager indicates usage of 650-700MB of memory (spikes at the first cudaFree(0) call). In Visual Profiler that figure goes down to ~100MB.
mchen
  • 9,808
  • 17
  • 72
  • 125
  • Posting the piece of code can help a lot. – Ian Medeiros May 15 '13 at 02:27
  • 2
    Well, the piece of code in question is actually a project spanning 15 interdependent files, so probably out of the scope of this question. I was simply wondering if anyone else had encountered the Visual Profiler phenomenon and had an explanation for it. – mchen May 15 '13 at 02:46
  • 12
    The CUDA profilers (Nsight VSE, Visual Profiler, nvprof, and CUDA command line profiler) put the GPU in the highest P-State to make sure the results are consistent. This should not make a difference of more than a few percent. The more likely cause is that you application is failing when you run the profiler. Please confirm that your application runs to completion and no errors occur? – Greg Smith May 15 '13 at 03:50
  • @gregsmith - yes, it runs to completion. At each stage in the code I print to console, and all console messages are present. I notice that when I purposely `throw` an error in a function, only the remaining code in that *function* seem to get skipped. – mchen May 15 '13 at 10:12
  • 3
    And what is a P-state? – mchen May 15 '13 at 10:13
  • GPU-shark indicated that my performance state was unchanged at P0 when I switched from command line to Visual Profiler. However, GPU usage was reported at 0.0% when run with Visual Profiler, but went as high as 98% when run off command line. – mchen May 15 '13 at 10:35
  • Moreover, *far* less memory is used with Visual Profiler. When run off command line, task manager indicates usage of 650-700MB of memory (spikes at the first `cudaFree(0)` call). In Visual Profiler that figure goes down to ~100MB. – mchen May 15 '13 at 10:42
  • @MiloChen: Using the system and code I have in front of me, I find that running the application either in `nvprof` or in `nvvp` produces a completely predictable *slow down* of about 0.5s on an application which takes about 40 wallclock seconds to run to completion. This is in an application that allocates >2Gb memory on the device and executes >120 cublas calls. What you are asking about is a property of your code or your system, not something general. – talonmies May 15 '13 at 11:15
  • @talonmies - oK thanks. So then the natural question is, what property of code could possibly cause it to run over 10x faster with Visual Profiler? – mchen May 15 '13 at 11:18
  • @MiloChen: The obvious one is coverage differences. Despite your protests to the contrary, this is the most likely explanation. Occam's razor is rarely wrong in these situations. – talonmies May 15 '13 at 11:40
  • @talonmies - thanks, but what do you mean by "coverage difference"? – mchen May 15 '13 at 11:51
  • 1
    I mean that not all the code in your application is running (or running the same volume of instructions) when you are profiling it. – talonmies May 15 '13 at 11:56
  • @talonmies - ok, there's an easy way to check that all the code is running in visual profiler (i.e. no cheating): print the final calculated output. In doing so, the final output matrix is *identical* when run on command line or visual profiler, yet visual profiler is well over 10x (if not 100x) faster. I cannot think of what could be the cause. – mchen May 16 '13 at 10:21
  • 1
    If you can provide your code somewhere so we can try it, we can help investigate further. Meanwhile, I've marked this question as "too broad" because without something to test, we can't confirm or deny the hypothesis that not all of your code is running when in the profiler. – harrism Jul 03 '13 at 02:01
  • 1
    @MiloChen: You are now the owner of the most upvoted unanswered CUDA question on [SO]. Did you ever resolve this? Are you still working on the problem? It would be good if this could be resolved one way or another to get it off the unanswered question list. – talonmies Aug 06 '13 at 19:07
  • Wow didn't expect so many people to have the same problem. I'm working late in the office today, but I'll try and get the code uploaded tomorrow evening. Thanks for your diligent moderating. – mchen Aug 06 '13 at 20:34
  • 2
    If the code runs 10x faster in the tool while the GPU usage is so much lower at the same time, the only decent idea I have is that some emulation mode is used when you run under the tool. And for this particular workload running on CPU yields better performance - which is not rare given that much of caching happens automatically on the CPU side while it requires explicit thinking and explicit work when working in environments like CUDA and OpenCL. I'd recommend you take a look at various build options and tool settings to see if there is anything saying about emulation mode. – Alexey Alexandrov Sep 24 '13 at 05:56
  • Do you have multiple streams? Visual Profiler for example disables parallel stream procressing as well as copy/compute overlap. Also does your code contain some kind of numerical optimization loop, where it will run until some numerical condition is reached? Because if you have random memory corruption (due to erroneous copy/compute overlap for example) it can cause more work for such kind of codes. – Hugo Maxwell Sep 24 '13 at 18:45
  • Key question: How do the runtimes scale as you repeat the calculation 5x, 10x, 20x? Maybe Visual Profiler is performing start-up stuff that you are not including in your timing. Using wall-clock time, how long does it take to start up Visual Profiler, load your program, and run it? And then how long to run the program from command line? – dmm Oct 02 '13 at 14:24
  • If you are running on Windows and you installed Nsight VSE please make sure that you did not set the environment variable to make the application attachable by the Nsight VSE CUDA Debugger. Enabling the ability to attach can significantly slow down the application. The environment variable is NSIGHT_CUDA_DEBUGGER=1. – Greg Smith Oct 09 '13 at 21:32

3 Answers3

7

This is an old question, but I've just finished chasing the same issue (though the cause may not be the same).

Namely: my app achieved between 900 and 1100 frames (synchronous launches) per second when running under NVVP, but around 100-120 when running outside of the profiler.

The cause appears to be a status message I was printing to the console via cout. I had intended for this to only happen about once every 100-200 frames. Instead, it was printing the status message for every frame, and the console IO became the bottleneck.

By only printing the status message every 100 frames (though the optimal number here would depend on your application), the frame rate jumped back up to match what I was seeing in NVVP. Of course, this could also be handled in a separate CPU thread if that sort of overhead is unacceptable in your circumstances.

NVVP has to redirect stdout to its own internal buffer in order to capture the application's output (which it shows in its console tab). It appears that NVVP's mechanism for buffering or processing that output has significantly less overhead than allowing the operating system to handle it directly. It looks like NVVP is buffering everything, and displaying it in a separate thread, or just saving a bunch of output until some threshold is reached, when it adds that buffer to its own console tab.

So, my advice would be to disable any console IO, and see if or how that affects things.

(It didn't help that VS2012 refused to profile my CUDA app. It would have been nice to see that 80% of the execution time was spent on console IO.)

Hope this helps!

3Dave
  • 28,657
  • 18
  • 88
  • 151
0

This should not happen. I've never seen anything like it; probably something in your setup.

GaryO
  • 5,873
  • 1
  • 36
  • 61
0

It could be that some JIT-compile step is skipped by the profiler. This could explain the difference in memory usage. Try creating a fat binary?

damienfrancois
  • 52,978
  • 9
  • 96
  • 110