I have written some simple benchmarks that perform a series of global memory accesses. When I measure the L1 and L2 cache statistics, I've found out that (in GTX580 that has 16 SMs):
total L1 cache misses * 16 != total L2 cache queries
Indeed the right side is much higher than the left side (around five times). I've heard that some register spilling can be put into L2 too. But my kernel has only less than 28 registers, not that many. I wonder what would be the source of this difference? Or am I misinterpreting the meaning of those performance counters?
Thanks