The microbenchmarking papers that I have found such as [1] and [2] report L2 bandwidths of 1200 GB/s and 900 GB/s respectively. I'm developing a kernel which attempts to leverage the L2 cache for global read and write operations.
So far, I have not been able to achieve a significant performance boost when writing to the L2 cache, as opposed to just writing to global memory.
Am I misunderstanding how the L2 cache operates? Is it unreasonable to even assume that both read and write bandwidth to be equivalent? Or is it that some non-straightforward methodology is needed to exercise the L2 write bandwidth properly?