2

I have a CUDA kernel that multiplies two matrices which Width and Height are multiples of the blocksize i am using.

The Nvidia Quadro Fx 3800 I am using has a theoretical bandwidth of 50 Gb/s and I am having some strange results(Effective Bandwidth larger than Theoretical Bandwidth)

I will post here some results:

With Blocksize 2

[10][10] * [10][10] -> BW=0,02 Gb/s [1000][1000]*[1000][1000] -> BW=69,4 Gb/s

With Blocksize 64

[1000][1000] * [1000][1000] -> BW=486,4 Gb/s [10000][10000] * [10000][10000] -> BW= 45072,12 Gb/s

I took the effective bandwidth formula from the Nvidia Best Practices Guide(I have simplified it but its equivalent(unless there is a stupid mistake)). I think the kernel is fine as its very similar(if not equal) to some Nvidia Lectures I read and also because its working properly(afaik).

#define blocksize 64
#define HM (10000) 
#define WM (10000) 
#define WN (10000)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
   {
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];

int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
int Pvalue=0;

for(int m=0; m< uWM/blocksize;m++){
    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
    __syncthreads();
    for(int k=0;k<blocksize;k++)
        Pvalue+=MS[ty][k]*NS[k][tx];
    P[rowM*WP+colN]=Pvalue;
}

}
int main(){


cudaEvent_t evstart, evstop;
cudaEventCreate(&evstart);
cudaEventCreate(&evstop);

float*M=(float*)malloc(sizeof(float)*HM*WM);
float*N=(float*)malloc(sizeof(float)*HN*WN);

for(int i=0;i<WM*HM;i++)
    M[i]=(float)i;
for(int i=0;i<WN*HN;i++)
    N[i]=(float)i;




float*P=(float*)malloc(sizeof(float)*HP*WP);



float *Md,*Nd,*Pd;
cudaMalloc((void**)&Md,HM*WM*sizeof(float));

cudaMalloc((void**)&Nd,HN*WN*sizeof(float));

cudaMalloc((void**)&Pd,HP*WP*sizeof(float));



cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice);



dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);
dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

cudaEventRecord(evstart,0);

nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM,WN);

cudaEventRecord(evstop,0);
cudaEventSynchronize(evstop);
float time;
cudaEventElapsedTime(&time,evstart,evstop);

cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost);

    cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);


    printf("\ntime spent:%f",time);
float Bandwidth=(HM*WM*4+WN*HN*4+HP*WP*4)/(time*1000000); /
printf("\nEffective Bandwidth:%f Gb/s\n",Bandwidth);
    }

Thanks in advance

Stecya
  • 22,896
  • 10
  • 72
  • 102
Bernardo
  • 531
  • 1
  • 13
  • 31
  • How does the effective bandwidth surpasses the theoretical one?I thought the theoretical bandwidth was the maximum the graphic card could achieve or am I wrong? – Bernardo Mar 22 '11 at 18:54

3 Answers3

2

I think the kernel is just silently failing.

  1. Did you check for any errors after the kernel invocation ?

  2. Does the code work ?

  3. What results do you have on the timings ?

fabrizioM
  • 46,639
  • 15
  • 102
  • 119
  • I have found the problem. I wasnt paying attention to the maximum threads per block. Since i was using 64 blocksize i didn't care, the problem is that its a 2D block so 64*64= 4096 Threads per Block. That was the error – Bernardo Mar 28 '11 at 14:18
  • I imagined that the kernel was indeed failing :) – fabrizioM Mar 28 '11 at 15:24
  • Thanks for the reply once again in one of my questions ;) – Bernardo Mar 28 '11 at 17:20
1

Note that by using shared memory, texture memory, etc., it is sometimes possible to exceed theoretical bandwidth. That often means you're tapping into some dedicated hardware supported functions (such as built-in bilinear texture interpolation, etc.), perhaps unintentionally.

Besides the reasons that Robert Harvey mentioned, there's also potentially factory overclock of cards by vendors (albeit more common for GeForce than Quadros).

Overall, I'd say that you're doing well if you get close to or exceed the theoretical bandwidth (either in memory or compute).

peakxu
  • 6,667
  • 1
  • 28
  • 27
  • I am a really noob concerning graphic cards, as 3 months ago they served only the purpose of gaming. However i am now having results of 200 Gb/s(4x more than the theoretical) and i have read a lot of lectures on matrix-matrix multiplication(from which i have taken the code) and they dont get these values of Bandwidth. So i might be miscalculating those values, but i cant really find the error – Bernardo Mar 23 '11 at 15:07
0

I can think of a number of explanations:

  1. Changes to the baseline code that adversely affect the measurements
  2. Invalid performance assumptions
  3. Unidentified micro-optimizations.
  4. Unrealistic benchmarks.

You say your code is simplified. I would try using the original benchmark code, and see what happens. If the numbers are more realistic, you can compare the original benchmark code with your simplified code to identify the differences.

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
  • When i say simplified it is just a matter of basic math(I just joined the 10^9 and 10^-3(to convert to seconds) factors, nothing else). As for the explanations you pointed out, I will check the code again and see if i missed something – Bernardo Mar 22 '11 at 19:28