1

For some CUDA application profilings, I see that the value of local hit rate (local_hit_rate metric) is 0%.

I want to distinguish the following concepts with that value.

  1. The application has no access to the local cache.

  2. All accesses to local cache were misses.

How can I find the answer? Since the value of inst_compute_ld_st, ldst_issued and ldst_executed are non-zero, is it fine to discard the first question? Or there is something else?

The device is M2000 which is CC5.3 CC5.2

talonmies
  • 70,661
  • 34
  • 192
  • 269
mahmood
  • 23,197
  • 49
  • 147
  • 242

1 Answers1

3

nvprof supports both events (raw counters) and metrics. These can be queried using the following commands: nvprof --query-events nvprof --query-metrics

CC5./6. Local Memory Metircs

  • local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
  • local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
  • local_load_transactions: Number of local memory load transactions
  • local_store_transactions: Number of local memory store transactions
  • local_hit_rate: Hit rate for local loads and stores
  • local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage
  • local_load_throughput: Local memory load throughput
  • local_store_throughput: Local memory store throughput
  • inst_executed_local_loads: Warp level instructions for local loads
  • inst_executed_local_stores: Warp level instructions for local stores
  • l2_local_load_bytes: Bytes read from L2 for misses in Unified Cache for local loads
  • l2_local_global_store_bytes: Bytes written to L2 from Unified Cache for local and global stores. This does not include global atomics.
  • local_load_requests: Total number of local load requests from Multiprocessor
  • local_store_requests: Total number of local store requests from Multiprocessor

local__request is the number of instructions executed to local memory via generic address space or local address space. On CC5./6.* I do not recall if this includes fully predicated of instructions.

local_*_transactions is the number of cache accesses that occurred due to the size (32-bit, 64-bit, ...) of the request and the address divergence of the request. If this is non-zero then local memory was accessed.

l2_local_*_bytes is the number of bytes of data loaded/stored to the L2 cache.

Greg Smith
  • 11,007
  • 2
  • 36
  • 37