Highest Voted 'nvprof' Questions

1

vote

1 answer

What would cause nvprof to return no data?

I have a Fortran MPI code instrumented with OpenACC. It is a big code. No way I can provide any meaningful snippets here. It runs fine under Cray aprun: aprun -n 15 ./mycode I want to profile it with nvprof. I try: aprun -n 15 -b nvprof…

asked Jun 08 '16 at 19:09

bob.sacamento

6,283
10
56
115

1

vote

1 answer

CUDA concurrent kernel launch not working

I'm writing a CUDA program for image processing. Same kernel "processOneChannel" will be launched for RGB channels. Below I try to specify streams for the three kernel launches so they can be processed concurrently. But nvprof says they are still…

c++ image-processing cuda nvprof

asked Apr 09 '16 at 21:53

jszair

55
4

0

votes

2 answers

Nsys CLI profiling guidance

I am just entering into the CUDA development world and now trying to profile my code. Expected to run the nvprof tool for profiling, but get the following error: ======== Warning: This version of nvprof doesn't support the underlying device, GPU…

cuda profiling nsight nvprof nsight-systems

asked May 19 '23 at 19:29

dru10

13
5

0

votes

0 answers

Using Matrix addition in cuda c,code executes but when profiling it with nvprof.It says NO kernels are profiled

nvprof profiles The API just fine. But says No kernels were profiled. It shows these 2 warning messages " ==525867== Warning: 4 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the…

c matrix cuda gpgpu nvprof

asked Oct 19 '21 at 16:35

Fasil

1
2

0

votes

1 answer

nvprof Warning: The path to CUPTI and CUDA Injection libraries might not be set in LD_LIBRARY_PATH

I get the message in the subject when I try to run a program I developed with OpenACC through Nvidia's nvprof profiler like this: nvprof ./SFS 4 If I run nvprof with -o [output_file] the warning message doesn't appear, but the output file is not…

nvidia openacc nvprof

asked Sep 24 '20 at 10:30

Bojan Niceno

113
1
1
11

0

votes

1 answer

Meaning of the "flop_count_sp" and "inst_fp_32" metric in CUDA Profiler

According to the profiler user guide: flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The…

cuda gpu profiler nvprof nvvp

asked Sep 09 '20 at 17:06

Booo

493
3
13

0

votes

1 answer

NVIDIA Visual Profiler: Insufficient kernel bounds data

I am trying to get some insight of why my CUDA kernel has a relatively low performance and I am hoping to get some answers with the NVIDIA profiler. My CUDA program is a 'boiled down' version of a larger application, isolating and exercising the…

cuda nvprof nvvp

asked Aug 18 '20 at 22:08

ritter

7,447
7
51
84

0

votes

1 answer

Why don't I get "thread_inst_executed"

When I list nvprof's metrics with nvprof --query-events I see: thread_inst_executed: Number of instructions executed by the active threads. For each instruction it increments by number of threads, including predicated-off threads, that execute…

cuda gpu profiling nvidia nvprof

asked Jul 30 '20 at 06:20

Richard

56,349
34
180
251

0

votes

1 answer

dram_write_bytes result on P100

I used nvprof to profile a simple vecadd example (n=1024) on P100 but observed the dram_write_bytes is only 256 (rather than 1024*4 that I expected). Can someone explain why this number is small? What other metrics I need to add in to count for…

cuda nvprof

asked Jul 14 '20 at 03:45

llodds

153
1
2
11

0

votes

1 answer

How to stop running TensorRT server without using ctrl-c (for profiling with nvprof)

I'm running nvprof to profile GPU usage of a TensorRT server-client model. Here's what I'm doing: Run nvprof on terminal 1 within a docker container with TensorRT enabled, nvprof --profile-all-processes -o results%p.nvvp Run TensorRT server on…

docker tensorrt nvidia-docker nvprof nvvp

asked Mar 16 '20 at 07:30

WannabeArchitect

1,058
2
11
22

0

votes

0 answers

What is the reason for K80 versus Pascal performance differences in this program that adds two arrays?

I followed the example on this page to get started with CUDA programming. It uses addition of two arrays with a million elements each for illustration with different execution configurations. I used a Tesla P100 (Pascal architecture) to run the code…

cuda gpu nvidia nvprof

asked Feb 27 '20 at 15:44

Rajesh Shashi Kumar

137
10

0

votes

1 answer

nvprof warning on CUDA_VISIBLE_DEVICES

When I use os.environ['CUDA_VISIBLE_DEVICES'] in pytorch, I get the following message Warning: Device on which events/metrics are configured are different than the device on which it is being profiled. One of the possible reason is setting…

python cuda pytorch nvprof

asked Dec 20 '19 at 03:34

Di Huang

63
8

0

votes

1 answer

No GPU activities in profiling with nvprof

I run nvprof.exe on the function that initialize data, calls three kernels and free's data. All profiled as it should and I got result like this: ==7956== Profiling application: .\a.exe ==7956== Profiling result: GPU activities: 52.34% 25.375us …

c++ c cuda nvidia nvprof

asked Nov 03 '19 at 17:25

Егор Лебедев

1,161
1
10
26

0

votes

1 answer

Do the SM's shown in the "occupancy graph" correspond to `blockIdx.x` or register `%smid`?

Do the SM's shown in the "occupancy graph" correspond to blockIdx.x or register %smid? Here's an example of such a graph And here's some sample output from when I print the blockIdx.x as the "logical" block, and print register %smid (accessed via…

cuda nvprof

asked Aug 15 '19 at 21:01

interestedparty333

2,386
1
21
35

0

votes

1 answer

nvprof - profiling data are not recorded

I am trying to profile my CUDA program, using the nvprof tool. Here is my code: #include #include #include // Kernel function to add the elements of two arrays __global__ void add(int n, float *x, float…

cuda nvcc nvprof

asked Jul 01 '19 at 10:36

PintoDoido

1,011
16
35

Questions tagged [nvprof]