local cache hit metric in cuda profiler

Question

For some CUDA application profilings, I see that the value of local hit rate (local_hit_rate metric) is 0%.

I want to distinguish the following concepts with that value.

The application has no access to the local cache.
All accesses to local cache were misses.

How can I find the answer? Since the value of inst_compute_ld_st, ldst_issued and ldst_executed are non-zero, is it fine to discard the first question? Or there is something else?

The device is M2000 which is ~~CC5.3~~ CC5.2

1

M2000 is not cc5.3 – Robert Crovella Apr 17 '19 at 21:20

score 3 · Accepted Answer · answered Apr 18 '19 at 13:49

nvprof supports both events (raw counters) and metrics. These can be queried using the following commands: nvprof --query-events nvprof --query-metrics

CC5./6. Local Memory Metircs

local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
local_load_transactions: Number of local memory load transactions
local_store_transactions: Number of local memory store transactions
local_hit_rate: Hit rate for local loads and stores
local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage
local_load_throughput: Local memory load throughput
local_store_throughput: Local memory store throughput
inst_executed_local_loads: Warp level instructions for local loads
inst_executed_local_stores: Warp level instructions for local stores
l2_local_load_bytes: Bytes read from L2 for misses in Unified Cache for local loads
l2_local_global_store_bytes: Bytes written to L2 from Unified Cache for local and global stores. This does not include global atomics.
local_load_requests: Total number of local load requests from Multiprocessor
local_store_requests: Total number of local store requests from Multiprocessor

local__request is the number of instructions executed to local memory via generic address space or local address space. On CC5./6.* I do not recall if this includes fully predicated of instructions.

local_*_transactions is the number of cache accesses that occurred due to the size (32-bit, 64-bit, ...) of the request and the address divergence of the request. If this is non-zero then local memory was accessed.

l2_local_*_bytes is the number of bytes of data loaded/stored to the L2 cache.

local cache hit metric in cuda profiler

1 Answers1