2

My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?

For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?

Since by doing this, one can minimize the divergence of threads within a given warp.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
user0002128
  • 2,785
  • 2
  • 23
  • 40
  • 1
    You may be interested in my answer to [this question](http://stackoverflow.com/questions/15044671/how-is-the-2d-thread-blocks-padded-for-warp-scheduling/15044884#15044884). – Robert Crovella Mar 07 '13 at 16:14
  • This question is not actually about CUDA specifically, it applies when using OpenCL as well. – einpoklum Nov 09 '16 at 09:26

2 Answers2

5

The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:

A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)

  • If you use 1D blocks (only threadIdx.x dimension is valid) then the warp scheduller will issue 1 instruction for threadIdx.x = (0..15) (16..31) ... etc

  • If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:

threadIdx.y = 0 threadIdx.x = (0 ..15) (16..31) ... etc

so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.

Dredok
  • 807
  • 1
  • 9
  • 30
  • This answer is correct only for devices with compute architecture 1.x. In devices with compute architecture 2.x and 3.x the warps are issued in groups of 32. – RoBiK Mar 07 '13 at 10:07
  • well yes, and no. 1 warp scheduller will issue for only half a warp anyways, so for example the 2 warp scheduller present in Fermi can issue 1 instruction for warp X and another one for warp Y, each one will be executed by 16 threads ... – Dredok Mar 07 '13 at 10:21
  • yes and no indeed :) To be precise, Fermi scheduler schedules the whole warp but the warp is issued in two half warps. Kepler schedules and issues the whole warp. But when talking about thread divergence, the important thing is how the warp is scheduled, if it is afterward issued at once or in two steps makes no difference. The important thing is that a warp PC can not have two different values for the two half-warps, they have to execute the same instruction. – RoBiK Mar 07 '13 at 10:53
  • Oh and when talking about how parts of a warp are actually issued over multiple clock cycles, device of compute capability 1.x actually issued a warp in 4 cycles. – RoBiK Mar 07 '13 at 10:58
  • yep you are right. The entire warp will have to "wait" due to any divergence in any half warp. – Dredok Mar 07 '13 at 11:14
  • In nVIDIA's Pascal cards (Compute Capability 6.x, GTX 1080 and Tesla P100), issues are of quarter-warps rather than half-warps. – einpoklum Nov 09 '16 at 09:28
2

A warp consists of 32 threads that will be executed at the same time. At any given time a batch of 32 will be executing on the GPU, and this is called a warp.

I haven't found anywhere that states that you can control what warp is going to execute next, the only thing you know is that it consists of 32 threads and that a threadblock should always be a multiple of that number.

Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block.

There is also this, with regards to memory operations and latency:

When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. A Tesla supports up to 32 active warps on each multiprocessor, and a Fermi supports up to 48. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.

source

With regards to dividing up threadblocks into warps, I have found this:

if the block is 2D or 3D, the threads are ordered by first dimension, then second, then third – then split into warps of 32

source

Tony The Lion
  • 61,704
  • 67
  • 242
  • 415