nvprof is a command-line profiler that enables you to collect and view CPU and GPU timers and events in CUDA programs.
Questions tagged [nvprof]
89 questions
1
vote
1 answer
What would cause nvprof to return no data?
I have a Fortran MPI code instrumented with OpenACC. It is a big code. No way I can provide any meaningful snippets here. It runs fine under Cray aprun:
aprun -n 15 ./mycode
I want to profile it with nvprof. I try:
aprun -n 15 -b nvprof…

bob.sacamento
- 6,283
- 10
- 56
- 115
1
vote
1 answer
CUDA concurrent kernel launch not working
I'm writing a CUDA program for image processing. Same kernel "processOneChannel" will be launched for RGB channels.
Below I try to specify streams for the three kernel launches so they can be processed concurrently. But nvprof says they are still…

jszair
- 55
- 4
0
votes
2 answers
Nsys CLI profiling guidance
I am just entering into the CUDA development world and now trying to profile my code. Expected to run the nvprof tool for profiling, but get the following error:
======== Warning: This version of nvprof doesn't support the underlying device, GPU…

dru10
- 13
- 5
0
votes
0 answers
Using Matrix addition in cuda c,code executes but when profiling it with nvprof.It says NO kernels are profiled
nvprof profiles The API just fine. But says No kernels were profiled. It shows these 2 warning messages "
==525867== Warning: 4 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the…

Fasil
- 1
- 2
0
votes
1 answer
nvprof Warning: The path to CUPTI and CUDA Injection libraries might not be set in LD_LIBRARY_PATH
I get the message in the subject when I try to run a program I developed with OpenACC through Nvidia's nvprof profiler like this:
nvprof ./SFS 4
If I run nvprof with -o [output_file] the warning message doesn't appear, but the output file is not…

Bojan Niceno
- 113
- 1
- 1
- 11
0
votes
1 answer
Meaning of the "flop_count_sp" and "inst_fp_32" metric in CUDA Profiler
According to the profiler user guide:
flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The…

Booo
- 493
- 3
- 13
0
votes
1 answer
NVIDIA Visual Profiler: Insufficient kernel bounds data
I am trying to get some insight of why my CUDA kernel has a relatively low performance and I am hoping to get some answers with the NVIDIA profiler.
My CUDA program is a 'boiled down' version of a larger application, isolating and exercising the…

ritter
- 7,447
- 7
- 51
- 84
0
votes
1 answer
Why don't I get "thread_inst_executed"
When I list nvprof's metrics with
nvprof --query-events
I see:
thread_inst_executed: Number of instructions executed by the active threads. For each instruction it increments by number of threads, including predicated-off threads, that execute…

Richard
- 56,349
- 34
- 180
- 251
0
votes
1 answer
dram_write_bytes result on P100
I used nvprof to profile a simple vecadd example (n=1024) on P100 but observed the dram_write_bytes is only 256 (rather than 1024*4 that I expected). Can someone explain why this number is small? What other metrics I need to add in to count for…

llodds
- 153
- 1
- 2
- 11
0
votes
1 answer
How to stop running TensorRT server without using ctrl-c (for profiling with nvprof)
I'm running nvprof to profile GPU usage of a TensorRT server-client model.
Here's what I'm doing:
Run nvprof on terminal 1 within a docker container with TensorRT enabled, nvprof --profile-all-processes -o results%p.nvvp
Run TensorRT server on…

WannabeArchitect
- 1,058
- 2
- 11
- 22
0
votes
0 answers
What is the reason for K80 versus Pascal performance differences in this program that adds two arrays?
I followed the example on this page to get started with CUDA programming. It uses addition of two arrays with a million elements each for illustration with different execution configurations.
I used a Tesla P100 (Pascal architecture) to run the code…

Rajesh Shashi Kumar
- 137
- 10
0
votes
1 answer
nvprof warning on CUDA_VISIBLE_DEVICES
When I use os.environ['CUDA_VISIBLE_DEVICES'] in pytorch, I get the following message
Warning: Device on which events/metrics are configured are different than the device on which it is being profiled. One of the possible reason is setting…

Di Huang
- 63
- 8
0
votes
1 answer
No GPU activities in profiling with nvprof
I run nvprof.exe on the function that initialize data, calls three kernels and free's data. All profiled as it should and I got result like this:
==7956== Profiling application: .\a.exe
==7956== Profiling result:
GPU activities: 52.34% 25.375us …

Егор Лебедев
- 1,161
- 1
- 10
- 26
0
votes
1 answer
Do the SM's shown in the "occupancy graph" correspond to `blockIdx.x` or register `%smid`?
Do the SM's shown in the "occupancy graph" correspond to blockIdx.x or register %smid?
Here's an example of such a graph
And here's some sample output from when I print the blockIdx.x as the "logical" block, and print register %smid (accessed via…

interestedparty333
- 2,386
- 1
- 21
- 35
0
votes
1 answer
nvprof - profiling data are not recorded
I am trying to profile my CUDA program, using the nvprof tool.
Here is my code:
#include
#include
#include
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float…

PintoDoido
- 1,011
- 16
- 35