Overloading the CUDA shuffle function makes the original ones invisible

Question

I'm trying to implement my own 64-bit shuffle function in CUDA. However, if I do it like this:

static __inline__ __device__ double __shfl_xor(double var, int laneMask, int width=warpSize)
{
    int hi, lo;
    asm volatile( "mov.b64 { %0, %1 }, %2;" : "=r"(lo), "=r"(hi) : "d"(var) );
    hi = __shfl_xor( hi, laneMask, width );
    lo = __shfl_xor( lo, laneMask, width );
    return __hiloint2double( hi, lo );
}

All subsequent calls to __shfl_xor will be instantiated from this 64-bit version, no matter what the type of the argument is. For example, if I am doing

int a;
a = __shfl_xor( a, 16 );

It would still use the double version. A work-around might be using different function names. But since I'm calling this shuffle function from a template function, using different names means that I have to make a different version for 64-bit floating points, which is not quite neat.

So how can I overload the __shfl_xor(double,...) function while on the same time still make sure the __shfl_xor(int,...) can be called appropriately?

score 2 · Answer 1 · answered Apr 12 '13 at 06:31

All integral types and float can be upcasted to double. When given a choice between in-built function and your specialized double function, the compiler here might be picking yours for all types.

Have you tried creating a function with a different name and using that to create both your specialized double variant and as dummies for the other types?

For example:

static __inline__ __device__ double foo_shfl_xor(double var, int laneMask, int width=warpSize)
{
    // Your double shuffle implementation
}

static __inline__ __device__ int foo_shfl_xor(int var, int laneMask, int width=warpSize)
{
    // For every non-double data type you use
    // Just call the original shuffle function
    return __shfl_xor(var, laneMask, width);
}

// Your code that uses shuffle
double d;
int a;
foo_shfl_xor(d, ...); // Calls your custom shuffle
foo_shfl_xor(a, ...); // Calls default shuffle

It works. Thanks a lot. Yet I am still wondering if there is any simpler solution. — Rainn, Apr 12 '13 at 15:16

Overloading the CUDA shuffle function makes the original ones invisible

1 Answers1