CUDA 9 shfl vs. shfl_sync

Question

Since CUDA 9 the shfl instructions are deprecated and should be replaced by shfl_sync.

But how should i replace them, when they behave differently?

Code Example:

__global__
static void shflTest(){
    int tid = threadIdx.x;
    float value = tid + 0.1f;
    int* ivalue = reinterpret_cast<int*>(&value);

    //use the integer shfl
    int ix = __shfl(ivalue[0],5,32);
    int iy = __shfl_sync(ivalue[0],5,32);

    float x = reinterpret_cast<float*>(&ix)[0];
    float y = reinterpret_cast<float*>(&iy)[0];

    if(tid == 0){
        printf("shfl tmp %d %d\n",ix,iy);
        printf("shfl final %f %f\n",x,y);
    }
}

int main()
{
    shflTest<<<1,32>>>();
    cudaDeviceSynchronize();
    return 0;
}

Output:

shfl tmp 1084437299 5
shfl final 5.100000 0.000000

score 13 · Accepted Answer · answered Sep 21 '17 at 14:50

13

If you read the CUDA 9RC programming guide (section B.15), installed with your copy of CUDA 9RC, you will see that the new __shfl_sync() function has an additional mask parameter which you are not accounting for:

CUDA 8:

int __shfl(int var, int srcLane, int width=warpSize);

CUDA 9:

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
              ^^^^^^^^^^^^^

The expectation for this mask parameter is also indicated:

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread's lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

Therefore, if we modify your code to conform with this, we get the expected result:

$ cat t419.cu
#include <stdio.h>

__global__
static void shflTest(int lid){
    int tid = threadIdx.x;
    float value = tid + 0.1f;
    int* ivalue = reinterpret_cast<int*>(&value);

    //use the integer shfl
    int ix = __shfl(ivalue[0],5,32);
    int iy = __shfl_sync(0xFFFFFFFF, ivalue[0],5,32);

    float x = reinterpret_cast<float*>(&ix)[0];
    float y = reinterpret_cast<float*>(&iy)[0];

    if(tid == lid){
        printf("shfl tmp %d %d\n",ix,iy);
        printf("shfl final %f %f\n",x,y);
    }
}

int main()
{
    shflTest<<<1,32>>>(0);
    cudaDeviceSynchronize();
    return 0;
}
$ nvcc -arch=sm_61 -o t419 t419.cu
t419.cu(10): warning: function "__shfl(int, int, int)"
/usr/local/cuda/bin/..//include/sm_30_intrinsics.hpp(152): here was declared deprecated ("__shfl() is deprecated in favor of __shfl_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

$ cuda-memcheck ./t419
========= CUDA-MEMCHECK
shfl tmp 1084437299 1084437299
shfl final 5.100000 5.100000
========= ERROR SUMMARY: 0 errors
$

answered Sep 21 '17 at 14:50

Robert Crovella

143,785
11
213
257

4

Is it save to replace all `shfl_xx(...)` with `shfl_xx_sync(0xFFFFFFFF,...)` even when the width is not 32? – dari Sep 21 '17 at 15:04
Yes, that makes sense to me. Do you know how the width parameter works? – Robert Crovella Sep 21 '17 at 15:08
The width parameter does some sort of 'modulo' operation on the target lane id of a shuffle. So a `shfl(tid,0,16)` should give 0 for the first 16 threads and 16 for the last 16 threads. So my guess is that it does something like this: `actualTargetLaneId = targetLaneId % width + targetLaneId / width * width` – dari Sep 21 '17 at 15:23
Yes, the width parameter breaks the full warp width (32 lanes) into a set of equal-sized power-of-2 subgroups, e.g. 4 groups of 8, or 8 groups of 4. Within each subgroup, the same shuffle operation takes place, modulo the group width. So if you have 4 groups of 8, but you want the shuffle operation to be performed in each of the 4 groups of 8, then you should still pass a mask parameter of `0xFFFFFFFF`. In short, the mask parameter and the width parameter are mostly orthogonal to each other. – Robert Crovella Sep 21 '17 at 15:37
Ok thanks, I think I understand everything now. I replaced all shfl instructions with shfl_sync and my code still works. By the way the compiler seems to ignore the `mask` parameter completely for non volta architectures. I tested it by using the mask `0x0` for all shfl instructions and it still works :). I mean this kinda makes sense because all threads in a warp are guaranteed to execute the same instruction. – dari Sep 21 '17 at 16:01
Any idea why you get a warning from sm_30_intrinsics.hpp when you are compiling for sm_61? – Simon Huckett Sep 18 '18 at 15:57
Because I have both `__shfl` and `__shfl_sync` in the code, and `__shfl` is deprecated, for all architectures. Since `__shfl` is available from sm_30 forward, the `sm_30_intrinsics.hpp` is included (by `nvcc`) to provide the prototype for all architectures. – Robert Crovella Sep 18 '18 at 16:00

CUDA 9 shfl vs. shfl_sync

1 Answers1

Linked