I have a problem profiling the L2 cache on my CUDA card of compute capability 3.5. In Kepler (3.x) the loads from global memory are cached only in L2 and never in L1. My question is how do I use nvprof (command-line profiler) to find the hit rate my global loads achieve in the L2 cache? I have queried for all the metrics I can collect and the ones involving L2 ache are:
l2_l1_read_hit_rate: Hit rate at L2 cache for all read requests from L1 cache
l2_texture_read_hit_rate: Hit rate at L2 cache for all read requests from texture cache
l2_l1_read_throughput: Memory read throughput seen at L2 cache for read requests from L1 cache
l2_texture_read_throughput: Memory read throughput seen at L2 cache for read requests from the texture cache
l2_read_transactions: Memory read transactions seen at L2 cache for all read requests
l2_write_transactions: Memory write transactions seen at L2 cache for all write requests
l2_read_throughput: Memory read throughput seen at L2 cache for all read requests
l2_write_throughput: Memory write throughput seen at L2 cache for all write requests
l2_utilization: The utilization level of the L2 cache relative to the peak utilization
The only hit rate I get is for reads coming from L1. But the reads to global memory would never come from L1 as they are not cached there! Or am I wrong here and that is exactly the metric I want?
Surprisingly (or not) there is a metric giving the L1 hit rate for global memory reads.
l1_cache_global_hit_rate: Hit rate in L1 cache for global loads
Can this ever be non-zero for Kepler?
Cheers!