How do I do the converse of shfl.idx (i.e. warp scatter instead of warp gather)?

Question

With CUDA's shfl.idx instruction, we perform what is essentially an intra-warp gather: Each lane provides a datum and an origin lane, and gets the datum of the origin lane.

What about the converse operation, scatter? I mean, not scattering to memory, but to lanes. That is, each lane provides a datum and a destination lane, and for lanes with exactly one other lane targeting them - they end up with the targeting lane's value; other lanes end up with an undefined/arbitrary value.

I'm pretty sure PTX doesn't have something like this. Does it perhaps exist in SASS somehow? If not, is there a better way of implementing this than, say, scattering to shared memory and loading from shared memory, both by lane index?

score 2 · Answer 1 · answered Feb 08 '18 at 03:35

2

The shuffle operations are all defined in terms of the lane to read from. The CUDA functions map almost directly to the ptx instructions, which themselves map almost directly to the SASS. They're all variations on the operation "Make this value available for others to read, and read the value from the given target lane", with various convenient ways to specify the target lane.

In general, you should attempt to rejig your function so you don't need the "scatter" operation. There isn't an instruction that does what you want.

Implementing this using the existing warp intrinsics is probably possible, but not obvious. You could use a sequence of shuffles similar to what you'd use for a warp reduction to transmit source lane IDs, and follow up with a final shuffle to fetch the payloads into the needed lanes.

answered Feb 08 '18 at 03:35

Chris Kitching

2,559
23
37

You haven't quite spelled out a concrete way to do this... about using multiple shuffles - that would be prohibitively expensive even if we could limit their number of log(warp size), which doesn't seem to be the case. Also, scatter is an operation which can certainly come up in all sorts of workloads, and is not quite "rejiggable away"; thus, for example Intel's AVX512 seems to offer some sort of scatter. – einpoklum Feb 08 '18 at 09:27
The shared memory approach you outlined in your question might be quickest. Therefore expecting someone to provide a more contorted method using only shuffle ops seems silly. I'm reasonably confident that there is no native scatter of the type you are supposing. If you want to see a new feature in a future NVIDIA GPU, you could request it via the bug reporting portal at http://developer.nvidia.com, simply mark your bug with the the RFE keyword in the description. In most respects I view this answer to be spot-on. – Robert Crovella Feb 08 '18 at 23:51

score 1 · Answer 2 · answered Jan 23 '23 at 14:22

1

As things stand in today's GPU's (Hopper and earlier) - you just don't. There's no hardware support for inter-lane scattering.

So, maybe just do this the straightforward way - via shared memory:

Get a shared memory scratch buffer of 32 elements
Have each lane write its datum to the buffer, at an offset equal to its intended destination
Synchronize (__syncthreads() I guess)
Have each lane read its corresponding buffer element

Assuming no two lanes write to the same place (otherwise the scatter itself would be have undefined result) - this will work, and require two shared memory operations and the sync.

answered Jan 23 '23 at 14:22

einpoklum

118,144
57
340
684

I would expect this to be very significantly more expensive than the "contorted shuffle" I suggested. Note: you don't need `__syncthreads()` between an smem write and an smem read if the reader and writer are both the same warp (you may need one of the memory barrier instructions: check the manual :D . Conceptually you're just bouncing the values off of the L2 (or 3?) cache). This fact is hidden somewhere deep in the manual (although I haven't checked if they changed it for very recent architectures). – Chris Kitching Jan 23 '23 at 20:47
"you don't need __syncthreads() between an smem write and an smem read if the reader and writer are both the same warp " <- are you sure that's true? AFAICR, NVIDIA doesn't guarantee this. I'm reminded of [this discussion](https://forums.developer.nvidia.com/t/is-syncthreads-required-within-a-warp/31311/10). – einpoklum Jan 23 '23 at 20:58
You'd need a `__syncwarp()` first, certainly: but I'm pretty sure this is true if your warp is converged. I've written kernels that depend on this and `cuda-memcheck` hasn't screamed at me yet :D. The compiler wouldn't be able to optimise out the load, and if you insert a memory barrier than it will be prevented from reordering the load and store in a way that breaks the program. – Chris Kitching Jan 24 '23 at 14:15
@ChrisKitching: 1. Wouldn't I also need at least another `__syncwarp()` afterwards? 2. "Pretty sure" is not as good as "NVIDIA guarantees"... – einpoklum Jan 24 '23 at 14:18
I'm "pretty sure" based on having read the documentation years ago and written kernels (and a compiler) that exploit these properties. As usual when rules-lawyering in software, having another check is always a good idea ;). The concern being raised in the thread you linked is that the compiler would optimise out the memory operations, which cannot occur if a memory barrier is used. The only remaining issues are divergence (trivial) and memory coherency (you may not even been a memory barrier if you use the correct cache coherency flag). – Chris Kitching Jan 25 '23 at 23:00

How do I do the converse of shfl.idx (i.e. warp scatter instead of warp gather)?

2 Answers2