Questions tagged [gpu-warp]

A warp or wavefront is a logical unit in GPU kernel scheduling - the largest set of threads within the grid which are (logically) instruction-locked and always synchronized with each other..

Some references:

40 questions
2
votes
2 answers

How do I do the converse of shfl.idx (i.e. warp scatter instead of warp gather)?

With CUDA's shfl.idx instruction, we perform what is essentially an intra-warp gather: Each lane provides a datum and an origin lane, and gets the datum of the origin lane. What about the converse operation, scatter? I mean, not scattering to…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
2
votes
1 answer

cuda warp size and control divergence

I have query about following question: Suppose, we have a 9*7 picture (7 pixels in the x direction and 9 pixels in the y direction), how many warps will have control divergence assuming block of 4*4 threads and 8 threads per warp? How will the…
user915783
  • 689
  • 1
  • 9
  • 27
2
votes
2 answers

How does a GPU group threads into warps/wavefronts?

My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will…
user0002128
  • 2,785
  • 2
  • 23
  • 40
2
votes
1 answer

CUDA warp / block finalization

When a warp finishes a kernel, but another warp of the same block is still running, will the finished warp be blocked until the other warps of the same block finish, or will the finished warp be available for immediate reuse by another block while…
bubnikv
  • 41
  • 5
1
vote
1 answer

Are threads in a multi-dimensional CUDA kernel blocks packed to fill warps?

NVIDIA GPUs have schedule complete warps to execute instructions together (well, sort of; see also this question). Thus, if we have a "linear" block of, say, 90 threads (or X x Y x Z = 90 x 1 x 1) - a GPU core will have three warps to schedule…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

Pre 8.x equivalent of __reduce_max_sync() in CUDA

cuda-memcheck has detected a race condition in the code that does the following: condition = /*different in each thread*/; shared int owner[nWarps]; /* ... owner[i] is initialized to blockDim.x+1 */ if(condition) { owner[threadIdx.x/32] =…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
1
vote
3 answers

Warp shuffling for CUDA

I need to make a warp shuffling that look like this: On this picture, the number of threads is limited to 8 to make it readable. If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p],…
Timocafé
  • 765
  • 6
  • 18
1
vote
1 answer

Do modern nVIDIA GPUs perform sub-warp scheduling of work?

In recent nVIDIA GPU uarchitectures, a single streaming multiprocessor seems to be broken up into 4 sub-units; with each of them having horizontal or vertical 'bars' of 8 'squares', corresponding to different functional units: integer ops, 32-bit…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

Questions of resident warps of CUDA

I have been using CUDA for a month, now i'm trying to make it clear that how many warps/blocks are needed to hide the latency of memory accesses. I think it is related to the maximum of resident warps on a multiprocessor. According to Table.13 in…
Falofter
  • 41
  • 3
1
vote
2 answers

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
2 answers

Cuda : warp and running time

I have a question about the warps in GPU. I used the following configuration: GeForce 210 Cuda capability major/minor : 1.2 2 multiprocessors, 8 CUDA Cores/MP : 16 CUDA Cores Warp size : 32 Below are the running times (I used…
proxiajd
  • 11
  • 1
0
votes
1 answer

In CUDA, how can I get this warp's thread mask in conditionally executed code (in order to execute e.g., __shfl_sync or .shfl?

I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl). Basically the relevant part of the kernel might be something like this: int f = d[threadIdx.x]; int warpLeader =…
sg_man
  • 763
  • 1
  • 6
  • 14
0
votes
2 answers

Monitor active warps and threads during a divergent CUDA run

I implemented some CUDA code. It runs fine but the alogrithm inherently produces a strong thread divergence. This is expected. I will later try to reduce divergence. But for the moment I would be happy to be able to measure it. Is there an easy way…
Silicomancer
  • 8,604
  • 10
  • 63
  • 130
0
votes
1 answer

Why use thread blocks larger than the number of cores per multiprocessor

I have a Nvidia GeForce GTX 960M graphics card, which has the following specs: Multiprocessors: 5 Cores per multiprocessor: 128 (i.e. 5 x 128 = 640 cores in total) Max threads per multiprocessor: 2048 Max block size (x, y, z): (1024, 1024,…
Numaerius
  • 383
  • 3
  • 10
0
votes
2 answers

CUDA shared memory and warp synchronization

Following host code test.c and device code test0.cu are intended to give the same result. test.c $ cat test.c #include #include int main() { int data[32]; int dummy[32]; for (int i = 0; i < 32; i++) …
nglee
  • 1,913
  • 9
  • 32