Profiling L2 cache on CUDA compute capability 3.x with nvprof

Question

I have a problem profiling the L2 cache on my CUDA card of compute capability 3.5. In Kepler (3.x) the loads from global memory are cached only in L2 and never in L1. My question is how do I use nvprof (command-line profiler) to find the hit rate my global loads achieve in the L2 cache? I have queried for all the metrics I can collect and the ones involving L2 ache are:

         l2_l1_read_hit_rate:  Hit rate at L2 cache for all read requests from L1 cache
    l2_texture_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
       l2_l1_read_throughput:  Memory read throughput seen at L2 cache for read requests from L1 cache
  l2_texture_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
        l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
       l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
          l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
         l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
              l2_utilization:  The utilization level of the L2 cache relative to the peak utilization

The only hit rate I get is for reads coming from L1. But the reads to global memory would never come from L1 as they are not cached there! Or am I wrong here and that is exactly the metric I want?

Surprisingly (or not) there is a metric giving the L1 hit rate for global memory reads.

    l1_cache_global_hit_rate:  Hit rate in L1 cache for global loads

Can this ever be non-zero for Kepler?

Cheers!

score 3 · Answer 1 · answered Dec 10 '14 at 03:53

On CC 3.5 devices there are two paths for global loads. The LDG instruction goes through the texture unit (l2_texture_read_hit_rate). All other global load operations including uncached loads go through L1 to L2 (l2_l1_read_hit_rate). The counter names is l2__read_hit_rate. This counter does not imply that the load was cached in L1.

The counter l1_cached_global_hit_rate can be non-zero on GK110B and GK210 if the developer enables the L1 cache. See The Kepler Tuning Guide section on L1 Cache for details.

score 0 · Answer 2 · answered Dec 09 '14 at 19:09

0

With the default nvcc compilation, it would be 0. However, if you compile with -Xptxas -dlcm=ca,then it can be non-zero.

answered Dec 09 '14 at 19:09

user2030440

81
1
1

Profiling L2 cache on CUDA compute capability 3.x with nvprof

2 Answers2