Questions tagged [gpu-warp]

A warp or wavefront is a logical unit in GPU kernel scheduling - the largest set of threads within the grid which are (logically) instruction-locked and always synchronized with each other..

Some references:

Warps or wavefronts of GPU threads
Why bother knowing about GPU warps? here on the site
What is a warp in CUDA?
Understanding the CUDA parallel threading model at the Portland Group's PGI insider blog.

40 questions

votes

2 answers

How are 2D / 3D CUDA blocks divided into warps?

If I start my kernel with a grid whose blocks have dimensions: dim3 block_dims(16,16); How are the grid blocks now split into warps? Do the first two rows of such a block form one warp, or the first two columns, or is this arbitrarily-ordered?…

cuda gpgpu gpu-warp

asked May 30 '11 at 13:54

Gabriel

8,990
6
57
101

votes

2 answers

Why bother to know about CUDA Warps?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores. It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp. That is, in a single multiprocessor (SM) can…

cuda gpu gpu-warp

asked Aug 05 '12 at 13:22

Alex

12,578
15
99
195

votes

1 answer

How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1: In compute capability 2.1 devices, each SM has 48 SP (cores) for integer and floating point…

cuda gpu gpu-warp

asked Mar 27 '12 at 08:39

Ashwin Nanjappa

76,204
83
211
292

votes

1 answer

activemask() vs ballot_sync()

After read this post on CUDA Developer Blog I am struggling to understand when is safe\correct use __activemask() in place of __ballot_sync(). In section Active Mask Query, the authors wrote: This is incorrect, as it would result in partial sums…

cuda gpu-warp

asked Jan 05 '19 at 18:54

Fabio T.

votes

2 answers

CUDA Warp Synchronization Problem

In generalizing a kernel thats shifts the values of a 2D array one space to the right (wrapping around the row boundaries), I have come across a warp synchronization problem. The full code is attached and included below. The code is meant to work…

cuda gpu-warp

asked Feb 23 '11 at 23:30

dmc

votes

1 answer

Is CUDA warp scheduling deterministic?

I am wondering if the warp scheduling order of a CUDA application is deterministic. Specifically I am wondering if the ordering of warp execution will stay the same with multiple runs of the same kernel with the same input data on the same device.…

cuda gpu-warp

asked Jul 27 '14 at 02:13

NothingMore

1,211
9
19

votes

2 answers

Removing __syncthreads() in CUDA warp-level reduction

The following code sums every 32 elements in an array to the very first element of each 32 element group: int i = threadIdx.x; int warpid = i&31; if(warpid < 16){ s_buf[i] += s_buf[i+16];__syncthreads(); s_buf[i] +=…

cuda gpu-warp

asked May 23 '12 at 23:17

small_potato

3,127
5
39
45

votes

1 answer

Why is my CUDA warp shuffle sum using the wrong offset for one shuffle step?

Edit: I've filed this as a bug at https://developer.nvidia.com/nvidia_bug/3711214. I'm writing a numerical simulation program that is giving subtly-incorrect results in Release mode, but seemingly correct results in Debug mode. The original program…

cuda compiler-optimization gpu-warp

asked Jul 08 '22 at 04:42

nanofarad

40,330
4
86
117

votes

1 answer

What's the alternative for __match_any_sync on compute capability 6?

In the cuda examples, e.g. here, __match_all_sync __match_any_sync is used. Here is an example where a warp is split into multiple (one or more) groups that each keep track of their own atomic counter. // increment the value at ptr by 1 and return…

cuda gpu-warp

asked Jan 23 '20 at 13:06

Johan

74,508
24
191
319

votes

1 answer

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed? ... similar…

cuda ptx gpu-warp

asked Sep 27 '17 at 22:15

einpoklum

118,144
57
340
684

votes

1 answer

CUDA coalesced access of FP64 data

I am a bit confused with how memory access issued by a warp is affected by FP64 data. A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right? I have read that each time a thread in a warp tries…

cuda double gpgpu gpu-warp

asked Feb 09 '17 at 11:25

AstrOne

3,569
7
32
54

votes

1 answer

What is warp shuffling in CUDA and why is it useful?

From the CUDA Prgramming Guide: [Warp shuffle functions] exchange a variable between threads within a warp. I understand that this is an alternative to shared memory, thus it's being used for threads within a warp to "exchange" or share values.…

cuda gpu gpu-shared-memory gpu-warp

asked Apr 27 '23 at 18:54

gonidelis

votes

2 answers

Compute per-warp histogram without shared memory

Problem Compute a per-warp histogram of sorted sequence of numbers held by individual threads in a warp. Example: lane: 0123456789... 31 val: 222244455777799999 .. The result must be held by N lower threads in a warp (where N is the…

c++ cuda histogram gpu-warp

asked Nov 15 '22 at 12:20

pem

votes

2 answers

OpenGL compute shader mapping to nVidia warps

Let's say I have an OpenGL compute shader with local_size=8*8*8. How do the invocations map to nVidia GPU warps? Would invocations with the same gl_LocalInvocationID.x be in the same warp? Or y? Or z? I don't mean all invocations, I just mean…

opengl compute-shader gpu-warp

asked Dec 08 '18 at 18:06

Danol

votes

2 answers

Do the threads in a CUDA warp execute in parallel on a multiprocessor?

A warp is 32 threads. Does the 32 threads execute in parallel in a Multiprocessor? If 32 threads are not executing in parallel then there is no race condition in the warp. I got this doubt after going through the some examples.

cuda gpgpu gpu-warp

asked Mar 11 '11 at 04:35

kar

2,505
9
30
32

2 3 Next