How do CUDA threads work

Question

I have a lot of doubts about the way threads are formed and executed.

Firstly, the documentation describes GPU threads as lightweight threads. Suppose I wish to multiply two 100*100 matrices. This would require 100*100 threads if each element were calculated by a different thread. However, my GPU (NVIDIA GT 640M LE) specifications show two SM's each of which can support only 2048 threads. How is it two possible to calculate the rest of the elements parallel y given that my GPU can't support so many threads.

Also consider the basic vector add code. Suppose I invoke a kernel with 1 block and 64 threads to add two arrays of 100 elements each as follows:

    __global__ void add(int* a,int* b,int* c)
    {
        int i = threadIdx.x;
        for(i<100)
        {
            c[i] = a[i] + b[i];
        {    
     }

Since only 64 threads were initialized I assume 64 elements are added in parallel.

How are the remaining elements added?
How does the warp scheduler decide which threads to assign to add the last 36 elements?

My main problem is:

I don't understand how a thread knows which elements to operate on.

score 6 · Accepted Answer · edited May 23 '17 at 10:28

Your card has compute capability 3.0, see here.

From Table 12 of the CUDA C Programming Guide, the number of 2048 threads you are mentionining for your compute capability refers to maximum number of resident threads per multiprocessor. This does not mean that you cannot launch more than 2048 threads overall. For example, from a few rows above of that table you can read that the maximum maximum x-dimension of a grid of thread blocks is 2^31-1. This means that it is perfectly legal to launch, for example, a 1d thread grid of, for example, 8192 threads. The reason is that the card will perform a context switching between thread warps as indicated in this post: What is the context switching mechanism in GPU?.

Concerning the second part of your question, your implementation of the add function is conceptually wrong. You are using the index i both as thread index and as a for loop index. A more correct implementation is the following

__global__ void add(int* a,int* b,int* c)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

The above writing means the following: each thread will execute the two assignments, namely

    int i = threadIdx.x;
    c[i] = a[i] + b[i];

Now, for example, for thread #3 the value of the threadIdx.x variable will be 3. Thus, thread #3 will deal with a local variable i, private to its memory space, whose value will be assigned to 3. Furthermore, it will load a[3] and b[3] from global memory, add them up, assign the result to c[3] and then store the final result to global memory. Accordingly, when you launch the grid you cannot of course fill the whole array of 100 elements by only 64 threads and you will need 100 threads.

Note that the above explanation is oversimplified. I recommend you read some basic textbook as the famous CUDA By Example.

Thanks for the answer.I understood the part about launching lots of threads. — mastercheif141, Feb 13 '14 at 10:27
Coming on to the second part I still don't get how the remaining elements are added. Since only 64 threads were launched only two warps are present on the block.After the 64 elements are added will the remaining elements be added?If so,then by which threads? — mastercheif141, Feb 13 '14 at 10:42
@mastercheif141 I have tried to better explain your second question. See the edited answer. — Vitality, Feb 13 '14 at 18:39

score 1 · Answer 2 · answered Feb 12 '14 at 13:21

Will give you illustration of 4*4 matrix addition program in CUDA. It might give you the idea of how threads are launched and operated.

int main()
    {
     int *a, *b, *c;            //To store your matrix A & B in RAM. Result will be stored in matrix C
     int *ad, *bd, *cd;         // To store matrices into GPU's RAM. 
     int N =16;   

          //No of rows and columns.

 size_t size=sizeof(float)* N * N;

 a=(float*)malloc(size);     //Allocate space of RAM for matrix A
 b=(float*)malloc(size);     //Allocate space of RAM for matrix B

//allocate memory on device
  cudaMalloc(&ad,size);
  cudaMalloc(&bd,size);
  cudaMalloc(&cd,size);

//initialize host memory with its own indices
    for(i=0;i<N;i++)
      {
    for(j=0;j<N;j++)
         {
            a[i * N + j]=(float)(i * N + j);
            b[i * N + j]= -(float)(i * N + j);
         }
      }

//copy data from host memory to device memory
     cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
     cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);

//calculate execution configuration 
   dim3 grid (1, 1, 1); 
   dim3 block (16, 1, 1);

//each block contains N * N threads, each thread calculates 1 data element

    add_matrices<<<grid, block>>>(ad, bd, cd, N);

   cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);  
   printf("Matrix A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",a[i*N+j]);
        printf("\n");
    }

   printf("\nMatrix B was---\n");
   for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",b[i*N+j]);
        printf("\n");
    }

    printf("\nAddition of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",c[i*N+j]);   //if correctly evaluated, all values will be 0
        printf("\n");
    }



    //deallocate host and device memories
    cudaFree(ad); 
    cudaFree(bd); 
    cudaFree (cd);

    free(a);
    free(b);
    free(c);

    getch();
    return 1;
}

/////Kernel Part

__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
  int index;
  index = blockIDx.x * blockDim.x + threadIDx.x            

  cd[index] = ad[index] + bd[index];
}

Lets take an example of addition of 16*16 matrices.. you have two matrices A and B, having dimension 16*16..

First of all you have to decide your thread configuration. You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.

Now,, one grid is launched with one kernel function.. A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).

Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)

Now our problem is addition of 16 * 16 matrices..

A | 1  2  3  4 |        B | 1  2  3  4 |      C| 1  2  3  4 |
  | 5  6  7  8 |   +      | 5  6  7  8 |   =   | 5  6  7  8 | 
  | 9 10 11 12 |          | 9 10 11 12 |       | 9 10 11 12 |  
  | 13 14 15 16|          | 13 14 15 16|       | 13 14 15 16|

We need 16 threads to perform the computation.

i.e. A(1,1) + B (1,1) = C(1,1)
     A(1,2) + B (1,2) = C(1,2) 
     .        .          .
     .        .          . 
     A(4,4) + B (4,4) = C(4,4)

All these threads will get executed simultaneously. So we need a block with 16 threads. For our convenience we will arrange threads in (16 * 1 * 1) way in a block As no of threads are 16 so we need one block only to store those 16 threads.

so, grid configuration will be dim3 Grid(1,1,1) i.e. grid will have only one block and block configuration will be dim3 block(16,1,1) i.e. block will have 16 threads arranged column wise.

Following program will give you the clear idea about its execution.. Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win half of battle.! So spend some time with cuda books... :-)

score 0 · Answer 3 · answered Jan 15 '16 at 19:42

for is very wrong here - some threads with threadid < 100 would run forewer. For newbies it could explain this way: the threadid is predefined by the system value, what displays the current thread number. Current thread takes it's value from a, from b and writes it in c so it would be

int i = threadIdx.x;
c[i] = a[i] + b[i];

If you have array size loke 100 what does not match the block size of 64x, in order some thread do not read/write out of bounds, do:

int i = threadIdx.x;
    if(i < 100){

        c[i] = a[i] + b[i];
    }

You will have divergence only on very last block. Probably you wanted that

How do CUDA threads work

3 Answers3