nvprof is a command-line profiler that enables you to collect and view CPU and GPU timers and events in CUDA programs.
Questions tagged [nvprof]
89 questions
0
votes
1 answer
nvidia visual profiler Encountered invalid option : --openacc-profiling
Running a simple application on nvidia Visual Profiler shows the error:
Encountered invalid option : --openacc-profiling
======== Use "nvprof --help" to get more information.
Any gpu applicatiion I try to profile gets the same error.
I tried to…

Rodolfo
- 1,091
- 3
- 13
- 35
0
votes
1 answer
nvprof is using all available GPU's when profiling python script
I am using a remote machine, which has 2 GPU's, in order to execute a Python script which has CUDA code. In order to find where I can improve the performance of my code, I am trying to use nvprof.
I have set on my code that I only want to use one…

Filipe Aleixo
- 3,924
- 3
- 41
- 74
0
votes
1 answer
Using nvprof to Count CUDA Kernel Executions
Is it possible to use nvprof to count the number of CUDA kernel executions (ie how many kernels are launched)?
Right now when I run nprof what I am seeing is:
==537== Profiling application: python tf.py
==537== Profiling result:
Time(%) Time …

Alex Rothberg
- 10,243
- 13
- 60
- 120
0
votes
2 answers
Is there some in-code profiling of CUDA program
In OpenCL world there is function clGetEventProfilingInfo which returns all profiling info of event like queued, submitted, start and end times in nanoseconds. It is quite convenient because I'm able to printf that info whenever I want.
For example…

petRUShka
- 9,812
- 12
- 61
- 95
0
votes
1 answer
Profiling Result doesn't appear in event/metric summary mode nvprof
According to the documentation for event/summary mode of nvprof, the output looks like:
==6461== Profiling application: matrixMul
==6461== Profiling result:
==6461== Event result:
//The outputs
==6461== Metric result:
//The outputs
The…

user3813674
- 2,553
- 2
- 15
- 26
0
votes
1 answer
Global load transaction count when in coalesced memory access
I've created a simple kernel to test the coalesced memory access by observing the transaction counts, in nvidia gtx980 card. The kernel is,
__global__
void copy_coalesced(float * d_in, float * d_out)
{
int tid = threadIdx.x +…

BAdhi
- 420
- 7
- 19
0
votes
1 answer
nvprof with MPICH
I am trying to profile an MPI/OpenACC Fortran code. I found a site that details how to run nvprof with MPI here. The examples given are for OpenMPI. However, I am limited to MPICH and I can't figure out the equivalent. Anyone know what it would…

bob.sacamento
- 6,283
- 10
- 56
- 115
0
votes
1 answer
Is there any difference in the output of nvvp (visual) and nvprof (command line)?
To measure metrics/events for CUDA programs, I have tried using the command line like:
nvprof --metrics <>
I also measured the same metrics on the Visual profiler nvvp. I noticed no difference in the values I get.
I noticed a…

Kajal
- 581
- 11
- 24
0
votes
1 answer
Where can i find thee missing formulas in latest Nvidia CUDA Profiler user guide
I found that in the previous version of profiler user guide, formula for the metrics are provided.
For example,
Metric Name: branch_efficiency
Description: Ratio of non-divergent branches to total branches
Formula: 100 * (branch -…

Steven Huang
- 153
- 1
- 13
0
votes
1 answer
What exactly does NVPROF Power Profile measure?
I have used NVPROF to get the power profile of a Kepler Architecture NVIDIA GPUs. My question is what exactly are we seeing? If I understand correctly there is a 12V and 3.3V rail feeding the GPU and the GPU can draw power from the PCI Bus. Is the…

travelingbones
- 7,919
- 6
- 36
- 43
0
votes
1 answer
My CUDA nvprof 'API Trace' and 'GPU Trace' are not synchronized - what to do?
I'm using the CUDA 7.0 profiler, nvprof, to profile some process making CUDA calls:
$ nvprof -o out.nvprof /path/to/my/app
Later, I generate two traces: the 'API trace' (what happens on the host CPU, e.g. CUDA runtime calls and ranges you mark) and…

einpoklum
- 118,144
- 57
- 340
- 684
-1
votes
1 answer
Why operations in two CUDA Streams are not overlapping?
My program is a pipeline, which contains multiple kernels and memcpys. Each task will go through the same pipeline with different input data. The host code will first chooses a Channel, an encapsulation of scratchpad memory and CUDA objects, when it…

StrikeW
- 501
- 1
- 4
- 11
-1
votes
1 answer
How to print api calls per thread with nvprof
I am profiling a CUDA application and dumping the logs to a file say target.prof
My application uses multiple threads to dispatch kernels and I want to observe the api calls from just one of those threads.
I tried using nvprof -i target.prof…

Tapan Chugh
- 354
- 2
- 4
-1
votes
1 answer
CUDA logarithm: nvprof detects single precision operations in double precision
I'm computing "log(x)" in double precision in CUDA, but when I profile, it detects single precision operations using metric "flop_count_sp_special".
I'm compiling with "-arch=sm_30" to ensure compute capability 3.0 and double precision arithmetic,…

Jesse Chan
- 168
- 9