Questions tagged [gpu-warp]

A warp or wavefront is a logical unit in GPU kernel scheduling - the largest set of threads within the grid which are (logically) instruction-locked and always synchronized with each other..

Some references:

40 questions
19
votes
2 answers

How are 2D / 3D CUDA blocks divided into warps?

If I start my kernel with a grid whose blocks have dimensions: dim3 block_dims(16,16); How are the grid blocks now split into warps? Do the first two rows of such a block form one warp, or the first two columns, or is this arbitrarily-ordered?…
Gabriel
  • 8,990
  • 6
  • 57
  • 101
16
votes
2 answers

Why bother to know about CUDA Warps?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores. It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp. That is, in a single multiprocessor (SM) can…
Alex
  • 12,578
  • 15
  • 99
  • 195
15
votes
1 answer

How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1: In compute capability 2.1 devices, each SM has 48 SP (cores) for integer and floating point…
Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
7
votes
1 answer

__activemask() vs __ballot_sync()

After read this post on CUDA Developer Blog I am struggling to understand when is safe\correct use __activemask() in place of __ballot_sync(). In section Active Mask Query, the authors wrote: This is incorrect, as it would result in partial sums…
Fabio T.
  • 109
  • 1
  • 6
5
votes
2 answers

CUDA Warp Synchronization Problem

In generalizing a kernel thats shifts the values of a 2D array one space to the right (wrapping around the row boundaries), I have come across a warp synchronization problem. The full code is attached and included below. The code is meant to work…
dmc
  • 63
  • 1
  • 5
5
votes
1 answer

Is CUDA warp scheduling deterministic?

I am wondering if the warp scheduling order of a CUDA application is deterministic. Specifically I am wondering if the ordering of warp execution will stay the same with multiple runs of the same kernel with the same input data on the same device.…
NothingMore
  • 1,211
  • 9
  • 19
5
votes
2 answers

Removing __syncthreads() in CUDA warp-level reduction

The following code sums every 32 elements in an array to the very first element of each 32 element group: int i = threadIdx.x; int warpid = i&31; if(warpid < 16){ s_buf[i] += s_buf[i+16];__syncthreads(); s_buf[i] +=…
small_potato
  • 3,127
  • 5
  • 39
  • 45
4
votes
1 answer

Why is my CUDA warp shuffle sum using the wrong offset for one shuffle step?

Edit: I've filed this as a bug at https://developer.nvidia.com/nvidia_bug/3711214. I'm writing a numerical simulation program that is giving subtly-incorrect results in Release mode, but seemingly correct results in Debug mode. The original program…
nanofarad
  • 40,330
  • 4
  • 86
  • 117
4
votes
1 answer

What's the alternative for __match_any_sync on compute capability 6?

In the cuda examples, e.g. here, __match_all_sync __match_any_sync is used. Here is an example where a warp is split into multiple (one or more) groups that each keep track of their own atomic counter. // increment the value at ptr by 1 and return…
Johan
  • 74,508
  • 24
  • 191
  • 319
3
votes
1 answer

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed? ... similar…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
3
votes
1 answer

CUDA coalesced access of FP64 data

I am a bit confused with how memory access issued by a warp is affected by FP64 data. A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right? I have read that each time a thread in a warp tries…
AstrOne
  • 3,569
  • 7
  • 32
  • 54
2
votes
1 answer

What is warp shuffling in CUDA and why is it useful?

From the CUDA Prgramming Guide: [Warp shuffle functions] exchange a variable between threads within a warp. I understand that this is an alternative to shared memory, thus it's being used for threads within a warp to "exchange" or share values.…
gonidelis
  • 885
  • 10
  • 32
2
votes
2 answers

Compute per-warp histogram without shared memory

Problem Compute a per-warp histogram of sorted sequence of numbers held by individual threads in a warp. Example: lane: 0123456789... 31 val: 222244455777799999 .. The result must be held by N lower threads in a warp (where N is the…
pem
  • 365
  • 2
  • 12
2
votes
2 answers

OpenGL compute shader mapping to nVidia warps

Let's say I have an OpenGL compute shader with local_size=8*8*8. How do the invocations map to nVidia GPU warps? Would invocations with the same gl_LocalInvocationID.x be in the same warp? Or y? Or z? I don't mean all invocations, I just mean…
Danol
  • 368
  • 1
  • 15
2
votes
2 answers

Do the threads in a CUDA warp execute in parallel on a multiprocessor?

A warp is 32 threads. Does the 32 threads execute in parallel in a Multiprocessor? If 32 threads are not executing in parallel then there is no race condition in the warp. I got this doubt after going through the some examples.
kar
  • 2,505
  • 9
  • 30
  • 32
1
2 3