6

I have written some simple benchmarks that perform a series of global memory accesses. When I measure the L1 and L2 cache statistics, I've found out that (in GTX580 that has 16 SMs):

 total L1 cache misses * 16 != total L2 cache queries

Indeed the right side is much higher than the left side (around five times). I've heard that some register spilling can be put into L2 too. But my kernel has only less than 28 registers, not that many. I wonder what would be the source of this difference? Or am I misinterpreting the meaning of those performance counters?

Thanks

Zk1001
  • 2,033
  • 4
  • 19
  • 36
  • How do you measure the cache statistics? I'm wondering if your kernel is using 100% of 16SMs. 28 registers may lead to limit occupancy. – pQB Sep 20 '11 at 13:13
  • The code is extremely simple, just a single for loop inside the kernel each of which reads a global memory. By the way, I'm pretty sure that my kernel uses all the available SM. There is 16 blocks and thus equally divided into 16 SMs. There is no divergence. Cache statistics are measured using performance counters. Occupancy is 0.833 (I don't think it is important here though). – Zk1001 Sep 21 '11 at 06:15
  • 1
    What I am wondering is whether or not the screen monitor uses the L2 cache too. – thanhtuan 46 mins ago – Zk1001 Sep 21 '11 at 07:04
  • What is your memory access pattern? A single fetch instruction may split into several memory transactions. As long as your kernel does not use local memory, you have no register spills. – CygnusX1 Nov 12 '11 at 12:37
  • @thanhtuan I am working on an answer for this but it depends on what tool you are using for measurement. Are you using the CUDA visual profiler (or command line profiler), or Parallel NSight? – harrism Nov 24 '11 at 04:51
  • @harrism Yes I am using Cuda command profiler (i guess it gives same numbers for visual profiler). If you get an answer feel free to drop here a note :) – Zk1001 Mar 18 '12 at 14:50
  • Since you waited months to reply I have completely lost the context. – harrism Mar 18 '12 at 23:28
  • It's weird that nobody can give a satisfactory answer to this interesting question. – dalibocai May 02 '12 at 16:06

2 Answers2

2

cuda programming guide G.4.2 section:

Global memory accesses are cached. Using the –dlcm compilation flag, they can be configured at compile time to be cached in both L1 and L2 (-Xptxas -dlcm=ca) (this is the default setting) or in L2 only (-Xptxas -dlcm=cg). A cache line is 128 bytes and maps to a 128-byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

Gaszton
  • 21
  • 2
1

It could be due to fact that reads from L1 are 128 bytes long while reads from L2 are 32 bytes long.

Ravi
  • 11
  • 1
  • reference for the L2 cache line length? All the documentation I have says the L1 and L2 cache line length is 128 bytes on Fermi. – talonmies Nov 28 '11 at 06:56
  • 1
    @talonmies I really doubt this. I think the L2 cache line is 32 byte and a L1 cache miss will result 4 memory requests to L2 or something. Or maybe am I wrong? It would be great if you can point to some reliable documents telling the numbers. – Zk1001 Mar 18 '12 at 14:53