Suppose you have a memory bound GPU kernel, how close can you get to the stated theoretical bandwidth of the GPU? Even in Mark Harris's Optimising Parallel Reduction presentation he 'only' gets 63GB/sec which is about 73% of the bandwidth of his test GPU (a G80) which he claimed 84.6GB/sec peak bandwidth. Could Harris have optimised his kernel further? are there other techniques which which were possibly to advanced/out of scope for the presentation? eg __shfl type instructions? Why didn't he achieve a higher bandwidth?
This article claims, using a test machine with a Tesla C2050
"throughput is memory-bandwidth limited, sustaining around 75% of the 144 GB/s peak memory bandwidth, compared to a practical limit of 85% of peak when accounting for overheads such as DRAM refresh."
Is this correct? The authors don't provide a source for the "85% practical bandwidth limit" and I haven't been able to find anything else mentioning it. If so, what other factors (supposing you have a very well optimised kernel) would prevent you from reaching theoretical peak bandwidth?