I am trying to get some insight of why my CUDA kernel has a relatively low performance and I am hoping to get some answers with the NVIDIA profiler.
My CUDA program is a 'boiled down' version of a larger application, isolating and exercising the kernel in question. The program launches the kernel several times in order to measure it's execution time as a mean over multiple launches. After the timing loop a memory copy from device to host is issued to make sure all kernel calls have finished. The program is written in CUDA C++.
This is how I built the program:
main.o: main.cu
nvcc -res-usage -arch=sm_61 -c $<
main: main.o stopwatch.o
g++ -o $@ $^ -lcudart -L/usr/local/cuda-11.0/lib64
This test was done on a PC with Intel CPU and an NVIDIA GeForce GTX 1070. The OS is Ubuntu 20.04 with a freshly installed CUDA 11 from the NVIDIA website along with driver 450.51.06:
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 28% 38C P8 8W / 151W | 317MiB / 8111MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The following command was used to generate the profiling file:
sudo /usr/local/cuda-11.0/bin/nvprof -o main.nvvp --profile-from-start off ./main
I also tried with profiling from start but it leads to the same issue below.
The following command was used to launch the visual profiler:
nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java main.nvvp
The Visual profiler walks me through several steps and when it comes to "Perform Kernel Analysis" the program tells me:
Insufficient kernel bounds data. The data needed to calculate compute, memory, and latency bounds for the kernel could not be collected
Is this sort of detailed profiling not available on my GPU? (maybe because it's a gamer card)