When I was studying shared L2 cache in NVIDIA fermi GPU, I thought the L2 cache should be located on-chip, together with L1 cache and SMs. However, I saw some CUDA material describes L2 cache as off-chip memory. Then, I got confused on L2 cache more, because it takes more than 100 cycles to access L2 cache.
Is there any comment to understand L2 cache in NVIDIA GPU?