Warp shuffling for CUDA

Question

I need to make a warp shuffling that look like this:

On this picture, the number of threads is limited to 8 to make it readable. If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p], a, b, c; ptx instruction.

From the manual I read:

Each thread in the currently executing warp will compute a source lane
index j based on input operands b and c and the mode. If the computed
source lane index j is in range, the thread will copy the input operand
a from lane j into its own destination register d;

So, providing proper values of b and c, I should be able to do it by writing a function like this (inspired from CUDA SDK __shufl primitive implementation).

  __forceinline__ __device __ float shuffle(float var){
   float ret;
   int srcLane = ???
   int c = ???
   asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
  return ret;

}

If it is possible, what is the constant for srcLane and c? I am not able to determine them (I am using CUDA 8.0) .

Best,

Timocafe

What is the point in writing your own `shuffle` function that calls the ptx `shfl`? If i see this correctly your problem can be solved by calling the normal `__shufl` primitive with the correct parameters. — dari, Mar 09 '18 at 22:39
I was confuse by the __shufl reading documentation I just found instruction to make a a broadcast and not a shuffle. But you are write it is really better to us it directly, — Timocafé, Mar 10 '18 at 06:49

score 4 · Accepted Answer · edited Jul 10 '22 at 13:34

I would recommend doing this with the CUDA intrinsic rather than with PTX (or inline ASM). However the following code demonstrates both methods:

// cat t54.cu
#include <stdio.h>

__global__ void k(){
    int i = threadIdx.x;
    int j = i;
    if (i<4) j*=2;
    if ((i>3) && (i<8)) j-=(7-i);
    int k = __shfl_sync(0x0FFU, i+100, j);
    printf("lane: %d, result: %d\n", i, k);
}

__forceinline__ __device__ float shuffle(float var, int lane){
    float ret;
    int srcLane = lane;
    int c = 0x1F;
    asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
    return ret;
}

__global__ void k1(){
    int i = threadIdx.x;
    int j = i;
    if (i<4) j*=2;
    if ((i>3) && (i<8)) j-=(7-i);
    float k = shuffle((float)(i+100), j);
    printf("lane: %d, result: %f\n", i, k);
}


int main(){
    k<<<1,8>>>();
    cudaDeviceSynchronize();
    k1<<<1,8>>>();
    cudaDeviceSynchronize();
}

$ nvcc -arch=sm_35 -o t54 t54.cu
$ cuda-memcheck ./t54
========= CUDA-MEMCHECK
lane: 0, result: 100
lane: 1, result: 102
lane: 2, result: 104
lane: 3, result: 106
lane: 4, result: 101
lane: 5, result: 103
lane: 6, result: 105
lane: 7, result: 107
lane: 0, result: 100.000000
lane: 1, result: 102.000000
lane: 2, result: 104.000000
lane: 3, result: 106.000000
lane: 4, result: 101.000000
lane: 5, result: 103.000000
lane: 6, result: 105.000000
lane: 7, result: 107.000000
========= ERROR SUMMARY: 0 errors
$

Using the CUDA intrinsic (the first method) the only real task is to compute the source lane index. Based on your pattern I wrote some code to do that and put it in the variable j.

score 1 · Answer 2 · answered Mar 15 '18 at 18:03

Robert has already and satisfactorily answered this question. I had implemented the code below, showing permutation of a full warp.

#include <stdio.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) { getchar(); exit(code); }
    }
}

__global__ void shufflingKernel(double *d_data, double *d_result, int *d_perm){

    unsigned mask = __activemask(); 
    int tid = threadIdx.x;
    int srcLane = d_perm[tid];
    double var = d_data[tid];
    //d_result[tid] = __shfl_sync(0xFFFFFFFF, var, srcLane);
    d_result[tid] = __shfl_sync(mask, var, srcLane);
}

int main(){

    const int N = 32;

    double h_data[32] = { 3.4, 42.2, 2., -1., 10., 11., 2., -1., 10., 33., 2.3, 11., 44., 0., -33., -21.,
        4.4, 43.2, 3., -2., 13., 15., 222., -90., 17., 30., 11.3, 7., 22., 100., -30., -91. };
    double *h_result = (double *)malloc(N * sizeof(double));
    int h_perm[32] = { 6, 11, 9, 2, 5, 23, 31, 0, 3, 27, 29, 1, 28, 30, 17, 13, 10, 8, 4, 22, 7, 18, 24, 12, 20,
        19, 16, 26, 21, 15, 25, 14 };

    int *d_perm; gpuErrchk(cudaMalloc(&d_perm, N * sizeof(int)));
    double *d_data; gpuErrchk(cudaMalloc(&d_data, N * sizeof(double)));
    double *d_result; gpuErrchk(cudaMalloc(&d_result, N * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_perm, &h_perm[0], N * sizeof(int), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_data, &h_data[0], N * sizeof(double), cudaMemcpyHostToDevice));

    shufflingKernel << <1, 32>> >(d_data, d_result, d_perm);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_result, d_result, N * sizeof(double), cudaMemcpyDeviceToHost));

    for (int k = 0; k < N; k++) {
        printf("k = %d; Original = %f; New = %f; Check = %f\n", k, h_data[k], h_result[k], h_data[h_perm[k]]);
    }

}

Notice that, instead of using 0xFFFFFFFF for the mask of active threads, it is safer using the warp-level primitive __activemask() in the sense of Shuffle instruction in CUDA not working.

score -2 · Answer 3 · answered Mar 11 '18 at 17:08

What you are trying to do in your shuffle operation is to be able to have dynamically index source lanes on which shuffle operates. One needs to understand that any variation of shuffle command (__shfl, __shfl_up, __shfl_down, __shfl_xor) needs a constant value for its second parameter and this parameter is the same for all lanes within a warp. You can play with grouping of threads within a warp by specifying width. Thus, for example, by specifying

float var = ...
__shfl_xor(var, 3, 4);

the lane permutation will look like:

0 1 2 3
   |
3 2 1 0

So, to answer your question, it's not possible to do it with a single __shuffle operation of any kind. But you can implement it by combining several __shuffle commands with different second parameters.

This is not correct. The source lane parameter need not be the same for all lanes within the warp. Indexed shuffle for example is an **any-to-any** operation, see slide 4 [here](http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf). My answer demonstrates how the operation may be achieved with a single shuffle. — Robert Crovella, Mar 12 '18 at 01:10

Warp shuffling for CUDA

3 Answers3