Will give you illustration of 4*4 matrix addition program in CUDA. It might give you the idea of how threads are launched and operated.
int main()
{
int *a, *b, *c; //To store your matrix A & B in RAM. Result will be stored in matrix C
int *ad, *bd, *cd; // To store matrices into GPU's RAM.
int N =16;
//No of rows and columns.
size_t size=sizeof(float)* N * N;
a=(float*)malloc(size); //Allocate space of RAM for matrix A
b=(float*)malloc(size); //Allocate space of RAM for matrix B
//allocate memory on device
cudaMalloc(&ad,size);
cudaMalloc(&bd,size);
cudaMalloc(&cd,size);
//initialize host memory with its own indices
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i * N + j]=(float)(i * N + j);
b[i * N + j]= -(float)(i * N + j);
}
}
//copy data from host memory to device memory
cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
//calculate execution configuration
dim3 grid (1, 1, 1);
dim3 block (16, 1, 1);
//each block contains N * N threads, each thread calculates 1 data element
add_matrices<<<grid, block>>>(ad, bd, cd, N);
cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);
printf("Matrix A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",a[i*N+j]);
printf("\n");
}
printf("\nMatrix B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",b[i*N+j]);
printf("\n");
}
printf("\nAddition of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
printf("%f ",c[i*N+j]); //if correctly evaluated, all values will be 0
printf("\n");
}
//deallocate host and device memories
cudaFree(ad);
cudaFree(bd);
cudaFree (cd);
free(a);
free(b);
free(c);
getch();
return 1;
}
/////Kernel Part
__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
int index;
index = blockIDx.x * blockDim.x + threadIDx.x
cd[index] = ad[index] + bd[index];
}
Lets take an example of addition of 16*16 matrices..
you have two matrices A and B, having dimension 16*16..
First of all you have to decide your thread configuration.
You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.
Now,, one grid is launched with one kernel function..
A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).
Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)
Now our problem is addition of 16 * 16 matrices..
A | 1 2 3 4 | B | 1 2 3 4 | C| 1 2 3 4 |
| 5 6 7 8 | + | 5 6 7 8 | = | 5 6 7 8 |
| 9 10 11 12 | | 9 10 11 12 | | 9 10 11 12 |
| 13 14 15 16| | 13 14 15 16| | 13 14 15 16|
We need 16 threads to perform the computation.
i.e. A(1,1) + B (1,1) = C(1,1)
A(1,2) + B (1,2) = C(1,2)
. . .
. . .
A(4,4) + B (4,4) = C(4,4)
All these threads will get executed simultaneously.
So we need a block with 16 threads.
For our convenience we will arrange threads in (16 * 1 * 1) way in a block
As no of threads are 16 so we need one block only to store those 16 threads.
so, grid configuration will be dim3 Grid(1,1,1)
i.e. grid will have only one block
and block configuration will be dim3 block(16,1,1)
i.e. block will have 16 threads arranged column wise.
Following program will give you the clear idea about its execution..
Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win half of battle.! So spend some time with cuda books... :-)