CUDA's warp shuffle for double-precision data

Question

A CUDA program should do reduction for double-precision data, I use Julien Demouth's slides named "Shuffle: Tips and Tricks"

the shuffle function is below:

/*for shuffle of double-precision point */
__device__ __inline__ double shfl(double x, int lane)
{
    int warpSize = 32;
    // Split the double number into 2 32b registers.
    int lo, hi;
    asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
    // Shuffle the two 32b registers.
    lo = __shfl_xor(lo,lane,warpSize);
    hi = __shfl_xor(hi,lane,warpSize);
    // Recreate the 64b number.
    asm volatile("mov.b64 %0,{%1,%2};":"=d"(x):"r"(lo),"r"(hi));
    return x;
}

At present, I got the errors below while compiling the program.

ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 71; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 271; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 287; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 302; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 317; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 332; error   : Arguments mismatch for instruction 'mov'
ptxas fatal   : Ptx assembly aborted due to errors
make: *** [csr_double] error 255

Could someone give some advice?

I believe your problem may be the braces. I believe they have a specific meaning to gcc (at least on the i386) and get consumed before being passed on to the assembler. It would be interesting to see what the -S output looks like. Specifically the APP section for your inline asm. — David Wohlferd, Jun 07 '14 at 08:43
@DavidWohlferd: The braces are perfectly correct, and it isn't gcc which compile the code, it is NVIDIA's lvmm based GPU compiler. — talonmies, Jun 07 '14 at 08:53
@talonmies That makes more sense. It sure looks like gcc, but if it was, it could never have worked. — David Wohlferd, Jun 07 '14 at 09:00
@DavidWohlferd: The CUDA inline asembler started out life as being "unofficial" and "gcc like" because the first support came via the vestigial ASM supported by the gcc 2.95 front end in the open64 derived compiler NVIDIA orginally used. Over time it evolved quite a lot, now even nested braces for register scoping is supported. PTX is an SSA language, so the requirements are a bit different to a more conventional instruction set. — talonmies, Jun 07 '14 at 09:08

score 4 · Accepted Answer · answered Jun 07 '14 at 08:52

4

There is a syntax error in the inline assembly instruction for the load of the double argument to 32 bit registers. This:

asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));

should be:

asm volatile("mov.b64 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));

Using a "d" (ie 64 bit floating point register) as the source in a 32 bit load is illegal (and a mov.b32 makes no sense here, the code must load 64 bits to two 32 bit registers).

answered Jun 07 '14 at 08:52

talonmies

70,661
34
192
269

@user3201449: Also worth pointing out that the local variable `warpSize` in that function aliases the built-in variable `warpSize`. That line should be removed. It won't break anything today, but it might one day, depending on what future architectures look like. – talonmies Jun 07 '14 at 10:23
@tolonmies:I removed it upon your direction.Thanks anyway! – taoyuan Jun 07 '14 at 12:13

score 4 · Answer 2 · answered Mar 08 '18 at 07:37

As of CUDA 9.0, __shfl, __shfl_up, __shfl_down and __shfl_xor have been deprecated.

The newly introduced functions __shfl_sync, __shfl_up_sync, __shfl_down_sync and __shfl_xor_sync have the following prototypes:

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int
width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);

where T can be int, unsigned int, long, unsigned long, long long, unsigned long long, float or double.

You no longer need to create your own shuffle instructions for double-precision arithmetics.

CUDA's warp shuffle for double-precision data

2 Answers2