CUDA theoretical bandwidth vs effective bandwidth

Question

I have a CUDA kernel that multiplies two matrices which Width and Height are multiples of the blocksize i am using.

The Nvidia Quadro Fx 3800 I am using has a theoretical bandwidth of 50 Gb/s and I am having some strange results(Effective Bandwidth larger than Theoretical Bandwidth)

I will post here some results:

With Blocksize 2

[10][10] * [10][10] -> BW=0,02 Gb/s [1000][1000]*[1000][1000] -> BW=69,4 Gb/s

With Blocksize 64

[1000][1000] * [1000][1000] -> BW=486,4 Gb/s [10000][10000] * [10000][10000] -> BW= 45072,12 Gb/s

I took the effective bandwidth formula from the Nvidia Best Practices Guide(I have simplified it but its equivalent(unless there is a stupid mistake)). I think the kernel is fine as its very similar(if not equal) to some Nvidia Lectures I read and also because its working properly(afaik).

#define blocksize 64
#define HM (10000) 
#define WM (10000) 
#define WN (10000)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
   {
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];

int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
int Pvalue=0;

for(int m=0; m< uWM/blocksize;m++){
    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
    __syncthreads();
    for(int k=0;k<blocksize;k++)
        Pvalue+=MS[ty][k]*NS[k][tx];
    P[rowM*WP+colN]=Pvalue;
}

}
int main(){


cudaEvent_t evstart, evstop;
cudaEventCreate(&evstart);
cudaEventCreate(&evstop);

float*M=(float*)malloc(sizeof(float)*HM*WM);
float*N=(float*)malloc(sizeof(float)*HN*WN);

for(int i=0;i<WM*HM;i++)
    M[i]=(float)i;
for(int i=0;i<WN*HN;i++)
    N[i]=(float)i;




float*P=(float*)malloc(sizeof(float)*HP*WP);



float *Md,*Nd,*Pd;
cudaMalloc((void**)&Md,HM*WM*sizeof(float));

cudaMalloc((void**)&Nd,HN*WN*sizeof(float));

cudaMalloc((void**)&Pd,HP*WP*sizeof(float));



cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice);



dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);
dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

cudaEventRecord(evstart,0);

nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM,WN);

cudaEventRecord(evstop,0);
cudaEventSynchronize(evstop);
float time;
cudaEventElapsedTime(&time,evstart,evstop);

cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost);

    cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);


    printf("\ntime spent:%f",time);
float Bandwidth=(HM*WM*4+WN*HN*4+HP*WP*4)/(time*1000000); /
printf("\nEffective Bandwidth:%f Gb/s\n",Bandwidth);
    }

Thanks in advance

How does the effective bandwidth surpasses the theoretical one?I thought the theoretical bandwidth was the maximum the graphic card could achieve or am I wrong? — Bernardo, Mar 22 '11 at 18:54

score 2 · Accepted Answer · answered Mar 28 '11 at 06:17

2

I think the kernel is just silently failing.

Did you check for any errors after the kernel invocation ?
Does the code work ?
What results do you have on the timings ?

answered Mar 28 '11 at 06:17

fabrizioM

46,639
15
102
119

I have found the problem. I wasnt paying attention to the maximum threads per block. Since i was using 64 blocksize i didn't care, the problem is that its a 2D block so 64*64= 4096 Threads per Block. That was the error – Bernardo Mar 28 '11 at 14:18
I imagined that the kernel was indeed failing :) – fabrizioM Mar 28 '11 at 15:24
Thanks for the reply once again in one of my questions ;) – Bernardo Mar 28 '11 at 17:20

score 1 · Answer 2 · answered Mar 23 '11 at 13:02

1

Note that by using shared memory, texture memory, etc., it is sometimes possible to exceed theoretical bandwidth. That often means you're tapping into some dedicated hardware supported functions (such as built-in bilinear texture interpolation, etc.), perhaps unintentionally.

Besides the reasons that Robert Harvey mentioned, there's also potentially factory overclock of cards by vendors (albeit more common for GeForce than Quadros).

Overall, I'd say that you're doing well if you get close to or exceed the theoretical bandwidth (either in memory or compute).

answered Mar 23 '11 at 13:02

peakxu

6,667
1
28
27

I am a really noob concerning graphic cards, as 3 months ago they served only the purpose of gaming. However i am now having results of 200 Gb/s(4x more than the theoretical) and i have read a lot of lectures on matrix-matrix multiplication(from which i have taken the code) and they dont get these values of Bandwidth. So i might be miscalculating those values, but i cant really find the error – Bernardo Mar 23 '11 at 15:07

score 0 · Answer 3 · answered Mar 22 '11 at 19:08

0

I can think of a number of explanations:

Changes to the baseline code that adversely affect the measurements
Invalid performance assumptions
Unidentified micro-optimizations.
Unrealistic benchmarks.

You say your code is simplified. I would try using the original benchmark code, and see what happens. If the numbers are more realistic, you can compare the original benchmark code with your simplified code to identify the differences.

answered Mar 22 '11 at 19:08

Robert Harvey

178,213
47
333
501

When i say simplified it is just a matter of basic math(I just joined the 10^9 and 10^-3(to convert to seconds) factors, nothing else). As for the explanations you pointed out, I will check the code again and see if i missed something – Bernardo Mar 22 '11 at 19:28

CUDA theoretical bandwidth vs effective bandwidth

3 Answers3