Questions tagged [nvprof]

nvprof is a command-line profiler that enables you to collect and view CPU and GPU timers and events in CUDA programs.

89 questions
3
votes
1 answer

What is a transaction and a request in the 'gld_transactions_per_request' metric of the Cuda profiler?

For a perfectly coalesced accesses to an array of 4096 doubles, each 8 bytes, nvprof reports the following metrics on a Nvidia Tesla V100: global_load_requests: 128 gld_transactions: 1024 gld_transactions_per_request: 8.000000 I cannot find a…
anroesti
  • 11,053
  • 3
  • 22
  • 33
3
votes
2 answers

get the execution time in nvprof

Is there a way to get the kernel execution time in nvprof like for a metric? for example, to get the dram read transactions I type: nvprof --metrics dram_read_transactions ./myprogram My question is: is there something like nvprof --metrics…
user352102
  • 199
  • 2
  • 9
3
votes
2 answers

Numba and guvectorize for CUDA target: Code running slower than expected

Notable details Large datasets (10 million x 5), (200 x 10 million x 5) Numpy mostly Takes longer after every run Using Spyder3 Windows 10 First thing is attempting to use guvectorize with the following function. I am passing in a bunch of numpy…
Bryce Booze
  • 165
  • 1
  • 11
3
votes
0 answers

What does a slice mean in cuda?

I'm a new on cuda programming. I have to GPU profiling using the nvprof about my application. I find a metric l2_subp0_write_sector_misses that means number of write requests sent to DRAM from slice 0 of L2 cache. But I don't know what does a slice…
kh.chung
  • 53
  • 1
  • 4
3
votes
1 answer

What exactly are the transaction metrics reported by NVPROF?

I'm trying to figure out what exactly each of the metrics reported by "nvprof" are. More specifically I can't figure out which transactions are System Memory and Device Memory read and writes. I wrote a very basic code just to help figure this…
B.Md
  • 107
  • 2
  • 10
2
votes
1 answer

nvprof --metrics works with c++ executable but not with fortran executable

I am trying to learn CUDA and I am now stuck at running a simple nvprof command. I am testing a simple script in both C++ and Fortran using CUDA. The CUDA kernels test two different ways of performing a simple task with the intent to show the…
2
votes
0 answers

CUDA nvprof on Windows: "Warning: unable to locate profiling library, GPU profiling skipped" (NOT cupti64_102.dll)

I am trying to use nvprof on a cuda/c++ program, but I get the output: ======== Warning: unable to locate profiling library, GPU profiling skipped ... my output ... ======== Warning: No CUDA application was profiled, exiting My command: nvprof.exe…
nonsence90
  • 21
  • 4
2
votes
1 answer

Running nvprof --metrics command under windows gives an error:cuda profiling error

Running nvprof --metrics command under windows gives an error: ==6580== NVPROF is profiling process 6580, command: Project1.exe ==6580== Error: Internal profiling error 4292:1. ======== Error: CUDA profiling error. error1 If I only use the nvprof…
bourbon
  • 73
  • 7
2
votes
1 answer

Why nvprof does not have metrics on floating point division operations?

Using nvprof to measure floating point operations of my sample kernels, it seems that there is no metrics for flop_count_dp_div, and the actual double-precision division operations is measured in terms of add/mul/fma of double-precision and even…
bruin
  • 979
  • 1
  • 10
  • 30
2
votes
1 answer

Where is the boundary of start and end of CPU launch and GPU launch of Nvidia Profiling NVPROF?

What is the definition of start and end of kernel launch in the CPU and GPU (yellow block)? Where is the boundary between them? Please notice that the start, end, and duration of those yellow blocks in CPU and GPU are different.Why CPU invocation…
skytree
  • 1,060
  • 2
  • 13
  • 38
2
votes
0 answers

nvprof produces unexpected branch efficiency results

I followed the examples (the following codes) of warp divergence on the textbook "Professional CUDA C Programming". __global__ void math_kernel1(float *c) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float a, b; a = b = 0.f; if…
2
votes
0 answers

nvprof shows error with TensorFlow

I am trying to run nvprof with cifar10_multigpu_train.py. I am using following command /home/ibm/tensorflow/third_party/gpus/cuda/bin/nvprof python cifar10_multi_gpu_train.py It starts the application but after sometime it shows following errors…
Khayam Gondal
  • 2,366
  • 2
  • 28
  • 40
2
votes
2 answers

Unable to import nvprof generated profile data

I am trying to profile a TensorFlow based code using nvprof. I am using following command for this nvprof python ass2.py The program runs successfully but at the end it shows following error. ==49791== Profiling application: python…
Khayam Gondal
  • 2,366
  • 2
  • 28
  • 40
2
votes
1 answer

Data Size to Instructions per Warp relationship in CUDA

I tried to see the number of instructions executed in a kernel when the size of the data type changed In order to get a custom sized data structure I created a struct as following, #define DATABYTES 40 __host__ __device__ struct floatArray { …
BAdhi
  • 420
  • 7
  • 19
2
votes
2 answers

nvprof not picking up any API calls or kernels

I'm trying to get some benchmark timings in my CUDA program with nvprof but unfortunately it doesn't seem to be profiling any API calls or kernels. I looked for a simple beginners example to make sure I was doing it right and found one on the…
theKunz
  • 444
  • 4
  • 12