CUDA: Thread synchronization in the same block

Question

Im trying write program in CUDA but I have problem with synchronization in the same block between threads.

Here is model situation:

 10 __global__ void gpu_test_sync()
 11 {
 12     __shared__ int t;
 13     int tid = threadIdx.x;
 14
 15     t = 0;
 16     __threadfence();
 17     __syncthreads();
 18
 19     // for(int i=0; i<1000000 && t<tid; i++); // with fuse
 20     while(t<tid);
 21
 22     t++;
 23     __threadfence();
 24 }
 25
 26 void f_cpu()
 27 {
 28     printf("TEST ... ");
 29     int blocks = 1;
 30     int threads = 2;
 31     gpu_test_sync<<< blocks , threads >>>();
 32     printf("OK\n");
 33 }

If threads = 1, everything is ok. If threads > 1, infinite cycling.

Why? Function __threadfence(); should make visible value of t variable for other threads.

How I can solve it?

Joakim Nohlgård · Accepted Answer · 2012-11-15T13:17:58.917

I don't believe your kernel will be able to do what you are trying to do because of the divergent branch in while(t<tid) causing all threads of the warp to loop indefinitely and never arriving at the line ++t.

Long explanation

scroll to 'The important part' for the important stuff if you already know about threads and blocks and warps:

(I have no experience with the Kepler architecture, yet. Some of these numbers may be different if not using Fermi.)

Some terms need to be explained to understand the next section: The following terms relate to the logical (logical as in software constructs) threads:

thread – a single thread of execution.
block – a group of multiple threads that execute the same kernel.
grid – a group of blocks.

The following terms relate to the physical (physical as in hardware architecture dependent) threads:

core – a single compute core, one core runs exactly one instruction at a time.
warp – a group of threads that execute in parallel on the hardware, a warp consists of 32 threads on current generation CUDA hardware.

Kernels are executed by one or more Streaming Multiprocessors (SM). A typical mid-to-high-end GeForce card from the Fermi family (GeForce 400 and GeForce 500 series) has 8-16 SMs on a single GPU[Fermi whitepaper]. Each SM consists of 32 CUDA Cores (cores). Threads are scheduled for execution by the warp schedulers, each SM has two warp scheduler units that work in a lockstep fashion. The smallest unit that a warp scheduler can schedule is called a warp, which consists of 32 threads on all CUDA hardware released so far at the time of writing. Only one warp may execute at a time on each SM.

Threads in CUDA are much more lightweight than CPU threads, context switches are cheaper and all threads of a warp execute the same instruction or have to wait while the other threads in the warp execute the instruction, this is called Sin- gle Instruction Multiple Thread (SIMT) and is similar to traditional CPU Single Instruction Multiple Data (SIMD) instructions such as SSE, AVX, NEON, Al- tivec etc., this has consequences when using conditional statements as described further down.

To allow for problems which demand more than 32 threads to solve the CUDA threads are arranged into logical groups called blocks and grids of sizes that are defined by the software developer. A block is a 3-dimensional collection of threads, each thread in the block has its own individual 3-dimensional identification num- ber to allow the developer to distinguish between the threads in the kernel code. Threads within a single block can share data through shared memory, this reduces the load on global memory. Shared memory has a much lower latency than global memory but is a limited resource, the user can choose between (per block) 16 kB shared memory and 48 kB L1 cache or 48 kB shared memory and 16 kB L1 cache.

Several blocks of threads in turn can be grouped into a grid. Grids are 3-dimensional arrays of blocks. The maximum block size is tied to the available hardware resources while the grids can be of (almost) arbitrary size. Blocks within a grid can only share data through global memory, which is the on-GPU memory which has the highest latency.

A Fermi GPU can have 48 warps (1536 threads) active at once per SM, given that the threads use little enough local and shared memory to fit all at the same time. Context switches between threads are fast since registers are allocated to the threads and hence there is no need for saving and restoring registers and shared memory between thread switches. The result is that it is actually desired to over- allocate the hardware since it will hide memory stalls inside the kernels by letting the warp schedulers switch the currently active warp whenever a stall occurs.

The important part

The thread warp is a hardware group of threads that execute on the same Streaming Multiprocessor (SM). Threads of a warp can be compared to sharing a common program counter between the threads, hence all threads must execute the same line of program code. If the code has some brancing statements such as if ... then ... else the warp must first execute the threads that enter the first block, while the other threads of the warp wait, next the threads that enter the next block will execute while the other threads wait and so on. Because of this behaviour conditional statements should be avoided in GPU code if possible. When threads of a warp follow different lines of execution it is known as having divergent threads. While conditional blocks should be kept to a minimum inside CUDA kernels, it is sometimes possible to reorder statements so that all threads of the same warp follow only a single path of execution in an if ... then ... else block and mitigate this limitation.

The while and for statements are branching statements, so it is not limited to if.

I thought that threads are independment in the block. Function __syncthreads(); is for synchronization between threads from diferent blocks? — user1826543, Nov 15 '12 at 12:32
Every thread in a warp tries to execute the code in lockstep - the entire purpose of cuda is to do roughly the same thing lots and lots of times all at the same time. — 3Pi, Nov 15 '12 at 12:33
All threads of a warp execute the same instruction, or wait, they can not execute different parts of a kernel in parallel — Joakim Nohlgård, Nov 15 '12 at 12:34
Ok. What is different between block and warp? For example if I call gpu_f<<<1,64>>> then block have two warps(32 threads)? Or? — user1826543, Nov 15 '12 at 12:54
Yes, `gpu_f<<<1,64>>>` should execute using two warps. A block is a logical grouping of threads, the warp is the low level hardware grouping that comes from design choices when implementing the GPU in silicon. I've updated my answer to give an introduction to blocks, warps and threads. — Joakim Nohlgård, Nov 15 '12 at 13:21
Great answer but unrelated to his actual problem. The real problem is that the code contains an infinite loop because of the semicolon after `while(t — tera, Nov 15 '12 at 17:21
@JoakimNohlgård "Threads of a warp can be compared to sharing a common program counter between the threads, hence all threads must execute the same line of program code." Really? See: http://stackoverflow.com/questions/25473593/gpgpu-consequence-of-having-a-common-pc-in-a-warp#comment62050641_25473593 — Niklas Peter, May 16 '16 at 19:49

score 2 · Answer 2 · answered Nov 15 '12 at 12:43

You have an infinite loop when you launch the kernel with more than one thread because while(t<tid); is an infinite loop for any threads with an idx greater than zero.

At this point your problem is not related with synchronization of threads but with the loop you have implemented.

score 1 · Answer 3 · answered Nov 15 '12 at 12:29

If what you are trying to do is get a series of threads to execute in serial, then you are abusing CUDA.

It is also not going to work, because any threads past the first will never receive an updated t - you have to call __syncthreads() for shared variables to refresh, but you can only do that if all of the threads are executing the same thing - i.e. not waiting.

CUDA: Thread synchronization in the same block

3 Answers3

Long explanation

The important part