I need to make a warp shuffling that look like this:
On this picture, the number of threads is limited to 8
to make it readable.
If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p], a, b, c;
ptx instruction.
From the manual I read:
Each thread in the currently executing warp will compute a source lane
index j based on input operands b and c and the mode. If the computed
source lane index j is in range, the thread will copy the input operand
a from lane j into its own destination register d;
So, providing proper values of b
and c
, I should be able to do it by writing a function like this (inspired from CUDA SDK __shufl
primitive implementation).
__forceinline__ __device __ float shuffle(float var){
float ret;
int srcLane = ???
int c = ???
asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
return ret;
}
If it is possible, what is the constant for srcLane
and c
? I am not able to determine them (I am using CUDA 8.0) .
Best,
Timocafe