I read in a book that in a wavefront or warp, all threads share a common program counter. So what is its consequence? Why does that matter?
-
Can somebody provide a reference to prove the statement "a wavefront/warps has a common program counter"? – Niklas Peter May 16 '16 at 19:37
-
This book claims the opposite "Each GPU thread has its own scalar registers, thread private memory, thread execution state, thread ID, independent execution and branch path, and effective program counter, and can address memory independently. Although a group of threads (e.g., a warp of 32 threads) executes more efficiently when the PCs for the threads are the same, this is not necessary." (See https://books.google.de/books?id=3b63x-0P3_UC&lpg=SL1-PA26&ots=Nzij1_waO4&dq=cuda%20thread%20program%20counter&hl=de&pg=SL1-PA72#v=onepage&q=cuda%20thread%20program%20counter&f=false) – Niklas Peter May 16 '16 at 19:44
-
I found a solution myself. Important is the word "effective" in the quotation above. In fact, there is just 1 program counter per warp/wavefront, because the hardware is organized as SIMD and there you can only issue 1 instruction for all SIMD lanes (disabling those lanes, whose control path diverged and executing them in the next clock cycle). However logically it appears as if there was 1 PC per CUDA thread/OpenCL workitem. That is, why SIMT is somethimes called SPMD on SIMD hardware. (Computer Architecture: A quantitative approach and http://haifux.org/lectures/267/Introduction-to-GPUs.pdf) – Niklas Peter May 19 '16 at 08:04
4 Answers
NVIDIA GPUs execute 32-threads at a time (warps) and AMD GPUs execute 64-threads at time (wavefronts). The sharing of control logic, fetch, and data paths reduces area and increases perf/area and perf/watt.
In order to take advantage of the design programming languages and developers need to understand how to coalesce memory accesses and how to manage control flow divergence. If each thread in a warp/wavefront takes a different execution path or if each thread accesses significantly divergent memory then the benefits of the design are lost and performance will significantly degrade.

- 11,007
- 2
- 36
- 37
This means that all threads run the same commands at the same time. This is very important for insuring that all threads have completed the previous line when processing the current line. For instance if you need to pass data from one thread to another you need to make sure that the data was already written by the first thread. Because the program counter is shared you know that once the write data line completes the data exists in all threads.

- 7,356
- 6
- 41
- 69
-
2Warp synchronous programming is an implementation specific optimization. Developers can realize significant gains by leveraging warp synchronous programming. However, warp synchronous programming is not well defined by most GPGPU compute languages and developers are encouraged to use all required thread fences and barriers as if all threads executed independently. – Greg Smith Aug 25 '14 at 04:07
As some of the other answers have stated, the threads (warps/wavefronts) are executed in sync with each other on a per-workgroup basis. To a developer this means that you need to pay special attention to any branching / conditional logic, because if at least one work item in a group hits the 'else' condition, all other work items pause while that code is executed.
So why would gpu manufacturers want to do this? The lack of individual program counters, branch prediction, and large cache memory save a lot of silicon for more Arithmetic Logic Units (ALUs) in the chip. More ALUs equals more work groups or concurrent threads.
Related: CPU vs GPU hardware.
As usual, knowing how things work under the hood helps you to increase performance. From the OCL developer point of view we only know that
The work-items in a given work-group execute concurrently on the processing elements of a single compute unit. (OCL specs 1.2 - section 3.2).
This and the way SIMT architecture works nowadays leads to this kind of consideration when speaking about branches (from this post):
Executing both branches happens only if the condition is not coherent between threads in a local work group, that means if the condition evaluates to different values between work items in a local work group, current generation GPUs will execute both branches, but only the correct branches will write values and have side effects.
This is quite correct but doesn't give you any inside on how to avoid divergence (note that here we're still at the work-group level).
But knowing that a work-group is composed of 1 or more warps within which work-items are sharing a PC (not at the work-group level) can sometimes help you to avoid divergence. It's only if some work-items within a warp take different paths that you'll have divergence (both branches are executed). Consider this (source):
if (threadIdx.x > 2) {...} else {...}
and this:
if (threadIdx.x / WARP_SIZE > 2) {...} else {...}
In the first case there will be divergence within the first warp (of 32 threads for NVIDIA). But not in the second case where it'll always be a multiple of the warp size whatever the size of the work-group. Obviously these 2 examples do not do the same thing. But in some case you might be able to rearrange your data (or find another trick) to keep the philosophy of the second example.
This seems remote from the reality but a real life example is reduction. By ordering your operation in a "SIMD friendly structure" you can at each stage drop some warps (hence let the room for some others from another work-group). See the "Taking Advantage Of Commutativity" section from this whitepaper for the full explanation and code.

- 1
- 1

- 2,525
- 20
- 26