GPU L1 and L2 cache statistics

Question

I have written some simple benchmarks that perform a series of global memory accesses. When I measure the L1 and L2 cache statistics, I've found out that (in GTX580 that has 16 SMs):

 total L1 cache misses * 16 != total L2 cache queries

Indeed the right side is much higher than the left side (around five times). I've heard that some register spilling can be put into L2 too. But my kernel has only less than 28 registers, not that many. I wonder what would be the source of this difference? Or am I misinterpreting the meaning of those performance counters?

Thanks

How do you measure the cache statistics? I'm wondering if your kernel is using 100% of 16SMs. 28 registers may lead to limit occupancy. — pQB, Sep 20 '11 at 13:13
The code is extremely simple, just a single for loop inside the kernel each of which reads a global memory. By the way, I'm pretty sure that my kernel uses all the available SM. There is 16 blocks and thus equally divided into 16 SMs. There is no divergence. Cache statistics are measured using performance counters. Occupancy is 0.833 (I don't think it is important here though). — Zk1001, Sep 21 '11 at 06:15
What I am wondering is whether or not the screen monitor uses the L2 cache too. – thanhtuan 46 mins ago — Zk1001, Sep 21 '11 at 07:04
What is your memory access pattern? A single fetch instruction may split into several memory transactions. As long as your kernel does not use local memory, you have no register spills. — CygnusX1, Nov 12 '11 at 12:37
@thanhtuan I am working on an answer for this but it depends on what tool you are using for measurement. Are you using the CUDA visual profiler (or command line profiler), or Parallel NSight? — harrism, Nov 24 '11 at 04:51
@harrism Yes I am using Cuda command profiler (i guess it gives same numbers for visual profiler). If you get an answer feel free to drop here a note :) — Zk1001, Mar 18 '12 at 14:50
Since you waited months to reply I have completely lost the context. — harrism, Mar 18 '12 at 23:28
It's weird that nobody can give a satisfactory answer to this interesting question. — dalibocai, May 02 '12 at 16:06

score 2 · Answer 1 · answered Nov 28 '11 at 14:04

cuda programming guide G.4.2 section:

Global memory accesses are cached. Using the –dlcm compilation flag, they can be configured at compile time to be cached in both L1 and L2 (-Xptxas -dlcm=ca) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg). A cache line is 128 bytes and maps to a 128-byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

score 1 · Answer 2 · answered Nov 28 '11 at 03:15

1

It could be due to fact that reads from L1 are 128 bytes long while reads from L2 are 32 bytes long.

answered Nov 28 '11 at 03:15

Ravi

11
1

reference for the L2 cache line length? All the documentation I have says the L1 and L2 cache line length is 128 bytes on Fermi. – talonmies Nov 28 '11 at 06:56
1

@talonmies I really doubt this. I think the L2 cache line is 32 byte and a L1 cache miss will result 4 memory requests to L2 or something. Or maybe am I wrong? It would be great if you can point to some reliable documents telling the numbers. – Zk1001 Mar 18 '12 at 14:53

GPU L1 and L2 cache statistics

2 Answers2