0

I am just entering into the CUDA development world and now trying to profile my code. Expected to run the nvprof tool for profiling, but get the following error:

======== Warning: This version of nvprof doesn't support the underlying device, GPU profiling skipped

Searched for a bit, found out nvprof is legacy and all profiling should now be done with Nsight Systems CLI. When running nsys nvprof ./myapp 2 files are generated: report1.nsys-rep and report1.sqlite. How can I make use of these to obtain profiling information about my code?

Environment:

WSL with Ubunutu 20.04

NVIDIA Nsight Systems version 2023.1.2.43-32377213v0

Nvprof: Release version 10.1.243 (21)

NVCC: Cuda compilation tools, release 10.1, V10.1.243

I am expecting to obtain similar information as by using nvprof: enter image description here

I have tried only this command for profiling: nsys nvprof ./myapp. Hoping to understand if it is the correct one or other better variants you might have.

Output of nsys profile --stats=true ./diverged

Generating '/tmp/nsys-report-04e5.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: .../sum_reduction/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)       Name
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  --------------
     74.7        364907400          6  60817900.0  72485919.0   4489170  100201745   42231058.9  poll
     24.3        118728446        345    344140.4     81962.0       541   10034413    1039273.8  ioctl
      0.6          2840826          9    315647.3    449904.0      2254     535093     236455.8  read
      0.2           920219          2    460109.5    460109.5    105991     814228     500799.2  sem_timedwait
      0.1           471795          2    235897.5    235897.5     70382     401413     234074.3  pthread_create
      0.1           310682         25     12427.3      8907.0      2785      95078      18330.8  mmap
      0.0            84580          9      9397.8     10049.0      1473      15419       4316.1  open
      0.0            80611         13      6200.8      4559.0      1382      17002       5451.1  fopen
      0.0            65704          3     21901.3     21310.0     20649      23745       1630.5  write
      0.0            48833         26      1878.2        70.5        60      46898       9182.3  fgets
      0.0            18413          6      3068.8      1738.0      1182       8455       2815.7  fclose
      0.0             8245          1      8245.0      8245.0      8245       8245          0.0  pipe2
      0.0             7233          2      3616.5      3616.5      1853       5380       2494.0  munmap
      0.0             6662          5      1332.4      1533.0       351       1853        579.3  fcntl

[5/8] Executing 'cuda_api_sum' stats report
SKIPPED: .../sum_reduction/report2.sqlite does not contain CUDA trace data.
[6/8] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: .../sum_reduction/report2.sqlite does not contain CUDA kernel data.
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: .../sum_reduction/report2.sqlite does not contain GPU memory data.
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: .../sum_reduction/report2.sqlite does not contain GPU memory data.
dru10
  • 13
  • 5
  • @dru10 Is the profiled application using CUDA? – Zois Tasoulas May 19 '23 at 23:15
  • Does `nsys profile --stats=true ./test` output CUDA calls information? – Zois Tasoulas May 19 '23 at 23:44
  • Please don't post pictures of terminal output. Instead copy the text into your question, i.e. as a code block. – paleonix May 20 '23 at 06:55
  • `nsys profile --stats=true ./test` doesn't seem to output any CUDA calls information. Updated the question to display the output – dru10 May 20 '23 at 08:02
  • In this case either the application is not using CUDA (make sure you check the CUDA calls for returned errors in your source. Also you can open the report with the GUI and see the Diagnostics Summary view. That can provide hints if there is an issue with the app or the system). Or you could be facing a bug with CUDA tracing on WSL. – Zois Tasoulas May 20 '23 at 16:20
  • profiling support on WSL2 is [only supported](https://docs.nvidia.com/cuda/wsl-user-guide/index.html#nvidia-compute-software-support-on-wsl-2) on volta or newer GPUs and requires windows 11 and a recent driver. It will not work with the driver shipped with CUDA 10.1 – Robert Crovella May 20 '23 at 17:56
  • @RobertCrovella I have windows11 and a GeForce RTX 3050, is it worth changing up the drivers/cuda versions? – dru10 May 21 '23 at 18:17
  • For the moment I have solved my issues programatically with [CUDA Event API](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html). This outputs only execution times though. – dru10 May 21 '23 at 18:21

2 Answers2

2

nvprof is a legacy tool and will not be receiving new features. It would be best to switch to Nsight Systems or Nsight Compute, depending on your profiling goals.

Unless you have a specific profiling goal, the suggested profiling strategy is starting with Nsight Systems to determine system bottlenecks and identifying kernels that affect performance the most. On a second step, you can use Nsight Compute to profile the identified kernels and find ways to optimize them.

If you are familiar with nvprof and want to keep using it, Nsight Systems supports the nvprof command, you can find more information in the documentation section Migrating from NVIDIA nvprof, or from nsys nvprof --help.

When running nsys nvprof ./myapp 2 files are generated: report1.nsys-rep and report1.sqlite. How can I make use of these to obtain profiling information about my code?

Regarding the use of the .nsys-rep file, you can view its content using the Nsight Systems GUI, available for Windows, Linux (x86_64,SBSA), Mac. That means you can collect a profile on your target machine and share it and view it on other machines too. For example you can download the Windows Host to install the GUI.

You can extract profiling information on a terminal by using the nsys stats [3] and nsys analyze [4] commands. The latter two commands can receive either an .nsys-rep file or an .sqlite file as input.

.sqlite files can also be used as conventional database files, that would probably be needed for more advanced usecases.

Zois Tasoulas
  • 1,242
  • 1
  • 11
  • 23
  • `nsys stats report1.nsys-rep` also does not extract any information about CUDA data. How can I include such information in my compiled binary so that I can profile it? Current binary is obtained by `nvcc diverged.cu -o diverged` – dru10 May 20 '23 at 08:07
  • You don't need to include something special in your source for the profiler to trace CUDA. If a binary makes CUDA calls, Nsight Systems will trace these calls. No special compilation flag is needed. – Zois Tasoulas May 20 '23 at 16:16
  • 1
    compiling with `-lineinfo` may improve the experience when using nsight compute, but that doesn't apply to anything here. – Robert Crovella May 20 '23 at 17:57
0

You need to run something like nsys profile -t cuda ./test for cuda profiling

Tyler2P
  • 2,324
  • 26
  • 22
  • 31