Questions tagged [nvprof]

nvprof is a command-line profiler that enables you to collect and view CPU and GPU timers and events in CUDA programs.

89 questions
0
votes
1 answer

nvidia visual profiler Encountered invalid option : --openacc-profiling

Running a simple application on nvidia Visual Profiler shows the error: Encountered invalid option : --openacc-profiling ======== Use "nvprof --help" to get more information. Any gpu applicatiion I try to profile gets the same error. I tried to…
Rodolfo
  • 1,091
  • 3
  • 13
  • 35
0
votes
1 answer

nvprof is using all available GPU's when profiling python script

I am using a remote machine, which has 2 GPU's, in order to execute a Python script which has CUDA code. In order to find where I can improve the performance of my code, I am trying to use nvprof. I have set on my code that I only want to use one…
Filipe Aleixo
  • 3,924
  • 3
  • 41
  • 74
0
votes
1 answer

Using nvprof to Count CUDA Kernel Executions

Is it possible to use nvprof to count the number of CUDA kernel executions (ie how many kernels are launched)? Right now when I run nprof what I am seeing is: ==537== Profiling application: python tf.py ==537== Profiling result: Time(%) Time …
Alex Rothberg
  • 10,243
  • 13
  • 60
  • 120
0
votes
2 answers

Is there some in-code profiling of CUDA program

In OpenCL world there is function clGetEventProfilingInfo which returns all profiling info of event like queued, submitted, start and end times in nanoseconds. It is quite convenient because I'm able to printf that info whenever I want. For example…
petRUShka
  • 9,812
  • 12
  • 61
  • 95
0
votes
1 answer

Profiling Result doesn't appear in event/metric summary mode nvprof

According to the documentation for event/summary mode of nvprof, the output looks like: ==6461== Profiling application: matrixMul ==6461== Profiling result: ==6461== Event result: //The outputs ==6461== Metric result: //The outputs The…
user3813674
  • 2,553
  • 2
  • 15
  • 26
0
votes
1 answer

Global load transaction count when in coalesced memory access

I've created a simple kernel to test the coalesced memory access by observing the transaction counts, in nvidia gtx980 card. The kernel is, __global__ void copy_coalesced(float * d_in, float * d_out) { int tid = threadIdx.x +…
BAdhi
  • 420
  • 7
  • 19
0
votes
1 answer

nvprof with MPICH

I am trying to profile an MPI/OpenACC Fortran code. I found a site that details how to run nvprof with MPI here. The examples given are for OpenMPI. However, I am limited to MPICH and I can't figure out the equivalent. Anyone know what it would…
bob.sacamento
  • 6,283
  • 10
  • 56
  • 115
0
votes
1 answer

Is there any difference in the output of nvvp (visual) and nvprof (command line)?

To measure metrics/events for CUDA programs, I have tried using the command line like: nvprof --metrics <> I also measured the same metrics on the Visual profiler nvvp. I noticed no difference in the values I get. I noticed a…
Kajal
  • 581
  • 11
  • 24
0
votes
1 answer

Where can i find thee missing formulas in latest Nvidia CUDA Profiler user guide

I found that in the previous version of profiler user guide, formula for the metrics are provided. For example, Metric Name: branch_efficiency Description: Ratio of non-divergent branches to total branches Formula: 100 * (branch -…
Steven Huang
  • 153
  • 1
  • 13
0
votes
1 answer

What exactly does NVPROF Power Profile measure?

I have used NVPROF to get the power profile of a Kepler Architecture NVIDIA GPUs. My question is what exactly are we seeing? If I understand correctly there is a 12V and 3.3V rail feeding the GPU and the GPU can draw power from the PCI Bus. Is the…
travelingbones
  • 7,919
  • 6
  • 36
  • 43
0
votes
1 answer

My CUDA nvprof 'API Trace' and 'GPU Trace' are not synchronized - what to do?

I'm using the CUDA 7.0 profiler, nvprof, to profile some process making CUDA calls: $ nvprof -o out.nvprof /path/to/my/app Later, I generate two traces: the 'API trace' (what happens on the host CPU, e.g. CUDA runtime calls and ranges you mark) and…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
-1
votes
1 answer

Why operations in two CUDA Streams are not overlapping?

My program is a pipeline, which contains multiple kernels and memcpys. Each task will go through the same pipeline with different input data. The host code will first chooses a Channel, an encapsulation of scratchpad memory and CUDA objects, when it…
StrikeW
  • 501
  • 1
  • 4
  • 11
-1
votes
1 answer

How to print api calls per thread with nvprof

I am profiling a CUDA application and dumping the logs to a file say target.prof My application uses multiple threads to dispatch kernels and I want to observe the api calls from just one of those threads. I tried using nvprof -i target.prof…
Tapan Chugh
  • 354
  • 2
  • 4
-1
votes
1 answer

CUDA logarithm: nvprof detects single precision operations in double precision

I'm computing "log(x)" in double precision in CUDA, but when I profile, it detects single precision operations using metric "flop_count_sp_special". I'm compiling with "-arch=sm_30" to ensure compute capability 3.0 and double precision arithmetic,…
Jesse Chan
  • 168
  • 9
1 2 3 4 5
6