With CUDA's shfl.idx instruction, we perform what is essentially an intra-warp gather: Each lane provides a datum and an origin lane, and gets the datum of the origin lane.
What about the converse operation, scatter? I mean, not scattering to memory, but to lanes. That is, each lane provides a datum and a destination lane, and for lanes with exactly one other lane targeting them - they end up with the targeting lane's value; other lanes end up with an undefined/arbitrary value.
I'm pretty sure PTX doesn't have something like this. Does it perhaps exist in SASS somehow? If not, is there a better way of implementing this than, say, scattering to shared memory and loading from shared memory, both by lane index?