How close to GPU theoretical memory bandwidth can you get?

Question

Suppose you have a memory bound GPU kernel, how close can you get to the stated theoretical bandwidth of the GPU? Even in Mark Harris's Optimising Parallel Reduction presentation he 'only' gets 63GB/sec which is about 73% of the bandwidth of his test GPU (a G80) which he claimed 84.6GB/sec peak bandwidth. Could Harris have optimised his kernel further? are there other techniques which which were possibly to advanced/out of scope for the presentation? eg __shfl type instructions? Why didn't he achieve a higher bandwidth?

This article claims, using a test machine with a Tesla C2050

"throughput is memory-bandwidth limited, sustaining around 75% of the 144 GB/s peak memory bandwidth, compared to a practical limit of 85% of peak when accounting for overheads such as DRAM refresh."

Is this correct? The authors don't provide a source for the "85% practical bandwidth limit" and I haven't been able to find anything else mentioning it. If so, what other factors (supposing you have a very well optimised kernel) would prevent you from reaching theoretical peak bandwidth?

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

A similar thread: GPU Memory bandwidth theoretical vs practical

Running a minimal kernel that only writes data to a 1D large vector:

__global__ void kernel( int *out ) {
    int idx =  threadIdx.x + blockIdx.x * blockDim.x;
    out[idx] = idx%4;
}

on GeForce GT 710 I got 0.9 of the theoretical bandwidth

practical 12.9 GB/s.

theoretical (spec) 14.4 GB/s

One thing that might contribute to the slow down is the caching.

How close to GPU theoretical memory bandwidth can you get?

1 Answers1