How is step calculated in CUDA?

Question

The following source code is from the book Programming in Parallel with CUDA, Page: 25-26:

15.1     __global__ void gpu_sin(float *sums, int steps, int terms, float step_size)
15.2     {
15.3         int step = blockIdx.x*blockDim.x+threadIdx.x;
15.4         if(step<steps){
15.5            float x = step_size*step;
15.6            sums[step] = sinsum(x,terms); // store values
15.7         }
15.8     }

The book gives an explanation:

Line 15.3 declares a variable step equivalent to the for loop index variable of the same name in line 24 of Example 1.1. It is set to a value defined by the built-in variables blockDim.x, blockIdx.x and threadIdx.x. The values of these variables depend on the launch parameters used in the host call to the kernel as follows:

○ blockDim.x will be set to threads, i.e., the thread block size used by the kernel.
○ blockIdx.x will be set to the rank of the thread block to which the current thread belongs and will be in the range [0,blocks-1].
○ threadIdx.x will be set to the rank of the current thread within its thread block and will be in the range [0,threads-1].
○ step = blockDim.x * blockIdx.x + threadIdx.x is in range [0, threads × blocks - 1].

I haven't understood the last point.

How is this being calculated: step = blockDim.x * blockIdx.x + threadIdx.x ?

What is the rationale?

You have two indices `blockIdx.x` and `threadIdx.x` that you want to map into a single linear index with consecutive threads of the same block being mapped to consecitive indices for e.g. coalescing memory access. So you multiply the outer index `blockIdx.x` by the size of the range of the inner index `threadIdx.x`.The size of that range is `blockDim.x`. Maybe just try it out for a small block size on paper. It is quite straightforward. — paleonix, Mar 25 '23 at 01:31

Robert Crovella · Accepted Answer · 2023-03-25T01:44:38.883

It is a canonical method to calculate a 1D globally unique thread index.

CUDA is a massively multithreaded computational system. In order to have each thread do something that is slightly different (for instance, calculate a unique output point in a vector), it is helpful/useful for each thread to be able to have a unique ordinal.

That is the purpose of that calculation, and you will find it extensively used in CUDA codes. It serves a similar purpose as the rank identification in MPI, or the thread identification in multi-threading systems like OpenMP (omp_get_thread_num()), std::thread (std::thread::id), etc.

To learn more about the components of this calculation, see here and here, also noting they both link to the CUDA programming guide. The set of all threads associated with a kernel launch (the grid) is hierarchically decomposed first into blocks, each block having an index (blockIdx.x) and each block into a uniform number of threads, each thread having an index in the block (threadIdx.x). Combined with the number of threads per block (blockDim.x), the globally unique index can be calculated by each thread.

You can get an orderly introduction to CUDA here. The first unit covers this calculation in some detail, and gives a rationale for it.

In the code example you have shown, the intent of that kernel is to populate a particular vector according to a particular formula or calculation. The kernel code intends to do that "in parallel" assigning one thread to each output position (step). The calculation of step ensures that each thread will be assigned to a unique position in the output vector (sums).

In the multidimensional case, each thread instead of having a single index (x) has an ordered pair (x,y) or an ordered 3-tuple (x,y,z) that describes its unique position in the multidimensional grid. The calculations shown in that other answer can provide a canonical set of calculations to assign each thread a unique tuple.

How is step calculated in CUDA?

1 Answers1