I'm trying to implement my own 64-bit shuffle function in CUDA. However, if I do it like this:
static __inline__ __device__ double __shfl_xor(double var, int laneMask, int width=warpSize)
{
int hi, lo;
asm volatile( "mov.b64 { %0, %1 }, %2;" : "=r"(lo), "=r"(hi) : "d"(var) );
hi = __shfl_xor( hi, laneMask, width );
lo = __shfl_xor( lo, laneMask, width );
return __hiloint2double( hi, lo );
}
All subsequent calls to __shfl_xor will be instantiated from this 64-bit version, no matter what the type of the argument is. For example, if I am doing
int a;
a = __shfl_xor( a, 16 );
It would still use the double version. A work-around might be using different function names. But since I'm calling this shuffle function from a template function, using different names means that I have to make a different version for 64-bit floating points, which is not quite neat.
So how can I overload the __shfl_xor(double,...) function while on the same time still make sure the __shfl_xor(int,...) can be called appropriately?