Cuda zero-copy performance

Question

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?

I have a kernel that uses the zero-copy feature and with NVVP I see the following:

Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead.

When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead.

However, the global load efficiency and global store efficiency for both the normal problem size kernel run and the very very large problem size kernel run are the same. I'm not really sure what to make of this combination of metrics.

The main thing I'm not sure of is which statistics in NVVP will help me see what is going on with the zero copy feature. Any ideas of what type of statistics I should be looking at?

score 5 · Accepted Answer · answered Dec 14 '12 at 03:08

Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:

The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
The memory operation had thread address divergence requiring access to multiple cache lines.
The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
The LSU unit resources are full and the instruction needs to be replayed when the resource are available.

The latency to

L2 is 200-400 cycles
device memory (dram) is 400-800 cycles
zero copy memory over PCIe is 1000s of cycles

The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.

The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.

The profiler's exposes the following counters:

gld_throughput
l1_cache_global_hit_rate
dram_{read, write}_throughput
l2_l1_read_hit_rate

In the zero copy case all of these metrics should be much lower.

The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

As far as Nsight, isn't that only for windows? If it is, is it possible to profile on a remote machine and view the PCIe activity on that machine? — user926914, Dec 14 '12 at 12:00
Nsight Visual Studio Edition (Windows) has a different CUDA profiler and CUDA trace tool than the Visual Profiler. Nsight VSE supports remote profiling and trace. Nsight Eclipse Edition (Mac and Linux) integrates the Visual Profiler. Nsight EE does not support remote profiling and tracing. Neither Nsight VSE 3.0 or the Visual Profiler in CUDA 5.0 support monitoring of PCIe traffic to/from the GPU. — Greg Smith, Dec 14 '12 at 16:32

Cuda zero-copy performance

1 Answers1

Linked