2

I'm having some issues with the CUDA nvprof profiler. Some of the metrics on the site are named differently than in the profiler, and the variables don't seem to be explained anywhere on the site, or for that matter anywhere on the web (I wasn't able to find any valid reference).

I decoded most of those (here: calculating gst_throughput and gld_throughput with nvprof), but I'm still not sure about:

elapsed_cycles
max_warps_per_sm

Anyone knows precisely how to count those?

I'm trying to use the nvprof to assess some 6000 different kernels via cmdline, so it is not really viable for me to use the visual profiler.

Any help appreciated. Thanks very much!

EDIT: What I'm using:

CUDA 5.0, GTX480 which is cc. 2.0.

What I've already done:

I've made a script that gets the formulas for each of the metrics from the profiler documentation site, resolves dependencies for any given metric, extracts those through nvprof and then counts the results from those. This involved using a (rather large) sed script that changes all the occurrences of variables that appear on the site to the ones with the same meaning that are actually accepted by the profiler. Basically I've emulated grepping metrics via nvprof. I'm just having problems with those:

Why there is a problem with those concrete variables:

max_warps_per_sm - If it is the bound of the cc or another metric/event which I am perhaps somehow missing and is specific for my program (wouldn't be a surprise as some of the variables in the profiler documentation have 3 (!) different names all for the same thing).

elapsed_cycles - I don't have elapsed_cycles in the output of nvprof --query-events. Not even anything containing the words "elapse" and the only one containing "cycle" is "active_cycles". Could that be it ? Is there any other way to count it? Is there any harm done in using "gputime" instead of this variable ? I don't need absolute numbers, I'm using it to find correlations and analyze code so if "gputime"= "elapsed_cycles" * CONSTANT, I'm perfectly okay with that.

Community
  • 1
  • 1
Melchior
  • 100
  • 1
  • 7
  • 1
    Which version of CUDA are you using? The profiling tools evolve, so we need that information in order to help you. – BenC May 02 '13 at 01:17

1 Answers1

2

You can use the following command that lists all the events available on each device:

nvprof --query-events

This is not very complete, but it's a good start to understand what these events/metrics are. For instance, with CUDA 5.0 and a CC 3.0 GPU, we get:

elapsed_cycles_sm: Elapsed clocks

elapsed_cycles_sm is the number of elapsed clock cycles per multiprocessor. If you want to measure this metric for your program:

nvprof --events elapsed_cycles_sm ./your_program

max_warps_per_sm is quite straightforward: this is the maximum number of resident warps per multiprocessor. This value depends on the Compute Capability (see the chart here). This is a hardware limit, no matter what your kernels are, at any given time, you will never have more resident warps per multiprocessor than this value.

Also, more information is available in the profiler's online documentation, with descriptions and formulae.

UPDATE

According to this answer:

active_cycles: Number of cycles a multiprocessor has at least one active warp.

Community
  • 1
  • 1
BenC
  • 8,729
  • 3
  • 49
  • 68
  • Hi ! First thanks for your answer. Second, I use CUDA 5.0 with GTX480 which is cc 2.0. I've of course looked throughout the profiler documentation and nvprof --query-events. I used – Melchior May 04 '13 at 23:26
  • Uah ... took too long to edit the comment I accidentally submitted:) The problem is that the profiler documentation names differ from nvprof --query-events names and there is no explanation on the site. As to the variables - I was unsure whether the "max_warps_per_sm" means how many warps of my code I can run on a SM (which could be bound by shared memory usage) or some cc static bound, thanks for clarifying ! However, I tried nvprof --query-events | grep elapsed (or grepping sm, cycles) and found nothing resembling what I want. Any ideas how to compute this from something else? – Melchior May 04 '13 at 23:37
  • @Melchior: the events available through `nvprof` depend on your hardware. `max_warps_per_sm` is not a runtime limit, it is a hardware limit: no matter what your kernels are, this is the maximum number of resident warps per multiprocessor. `active_cycles` is the number of cycles a multiprocessor has at least one active warp (see [this](http://stackoverflow.com/a/14886806/1043187)). – BenC May 06 '13 at 05:44
  • Also, some events may not be available in older cards, and it is not always possible to compute some particular metrics. What exactly would you like to measure/compute? – BenC May 06 '13 at 05:51
  • Sorry for the late response, used gputime instead of elapsed_clocks and max_warps_per_sm as you instructed. Just to answer your question, I'm profiling approximately 6k fused kernels of simple linear algebra functions and getting as much info as I can. Therefore I'm interested in getting *any* metrics I can :) Anyway, accepting. – Melchior May 21 '13 at 13:45
  • @Melchior: I see! Glad I could help! `gputime` is one of the default outputs. It can be enough when optimizing your code, unless you know exactly what you are trying to minimize/maximize. Even then, the optimization done by the compiler makes it difficult to evaluate a priori the effect any change to the code may have on the metrics. – BenC May 21 '13 at 14:03