2

I'm trying to get some benchmark timings in my CUDA program with nvprof but unfortunately it doesn't seem to be profiling any API calls or kernels. I looked for a simple beginners example to make sure I was doing it right and found one on the Nvidia dev blogs here:

https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

Code:

int main()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    return 0;
}

Command line:

-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test

So I replicated it word for word, line by line, and ran identical command line arguments. Unfortunately my result was the same:

-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.

==85454== API calls:
No API activities were profiled. 

I am running Nvidia toolkit 7.5

If anyone knows what what I'm doing wrong I'd be grateful to know the answer.

-----EDIT-----

So I modified the code to be

#include<cuda_profiler_api.h>

int main()
{
    cudaProfilerStart();
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    cudaProfilerStop();
    return 0;
}

Unfortunately it did not change things.

theKunz
  • 444
  • 4
  • 12
  • out of the blue, what is the kernel you are trying to profile ? – Florent DUGUET May 01 '16 at 19:46
  • @FlorentDUGUET its an implementation of a compressed row sparse matrix compression algorithm. Trying to get some metrics on its performance. – theKunz May 01 '16 at 20:00
  • 2
    You should check the return values of all your API calls, it's most likely that you have an error that you are not catching. You could also run it through `cuda-memcheck` which will report errors from API calls, but it's best practice to *always* check the return values from *any* API. – Tom May 01 '16 at 20:14
  • @Tom Thanks, that helped. Every call had an error of "No CUDA capable device is detectable". – theKunz May 01 '16 at 21:22

2 Answers2

7

It's a bug with unified memory profiling, the flag

--unified-memory-profiling off  ./profile_test

resolves all problems for me.

Kratos
  • 329
  • 3
  • 4
  • this answer should be the accepted one; it's the only one that actually worked for me, and it still worked even if I don't include the cudaProfileStart and Stop API calls that the other answers incorrectly included – xdavidliu Oct 05 '19 at 15:54
  • Thanks a lot for the help Kratos. I had the same issue under Ubuntu 20.04 and CUDA 10.1. This additional flag solved this issue for me! – Jürgen Brauer Jun 08 '21 at 11:37
2

You need to call cudaProfilerStop() (for Runtime API) before exiting from thread. This allows nvprof to collect all necessary data.

According to CUDA doc:

To avoid losing profile information that has not yet been flushed, the application being profiled should make sure, before exiting, that all GPU work is done (using CUDA sychronization calls), and then call cudaProfilerStop() or cuProfilerStop(). Doing so forces buffered profile information on corresponding context(s) to be flushed.

Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
  • 2
    Alternatively, calling `cudaDeviceReset` on exit will trigger a profile buffer flush during the explicit context destruction. – talonmies May 01 '16 at 19:15
  • 1
    Tried your suggestion, unfortunately it's still not profiling. (see edited code) – theKunz May 01 '16 at 19:37
  • It might be that the compiler optimizes away the code or there are some issues with the API calls (i.e. check error codes). You may also try to put at least one kernel to profile. – Grzegorz Szpetkowski May 01 '16 at 19:57
  • For the onlooker, `cudaProfilerStop()` is defined with `#include "cuda_profiler_api.h"` (cuda 10.1) – interestedparty333 Jun 19 '19 at 04:46