3

I have some kernel code which calls memcpy(my_dst, my_src, my_num_bytes) - and sometimes I have my_num_bytes equal to 0. Strangely enough, some sporadic experimentation (with Titan X, CUDA 7.5, driver 358.16) suggests data does get written into the destination when I use such a call.

  1. Have you encountered this behavior in CUDA?
  2. Is this specified anywhere? The programming guide entry for memcpy() doesn't say.
Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
einpoklum
  • 118,144
  • 57
  • 340
  • 684

1 Answers1

5

This would appear to be a bug in the (current, i.e. CUDA 7.5) implementation of device-side memcpy().

A kernel like this:

__global__ void kernel(char* source, char* dst, int len, int sz)
{

    int i = threadIdx.x * len;

    memcpy(source+i, dst+i, sz);
}

leads the toolchain to emit PTX like this:

        // .globl       _Z6kernelPcS_ii
.visible .entry _Z6kernelPcS_ii(
        .param .u64 _Z6kernelPcS_ii_param_0,
        .param .u64 _Z6kernelPcS_ii_param_1,
        .param .u32 _Z6kernelPcS_ii_param_2,
        .param .u32 _Z6kernelPcS_ii_param_3
)
{
        .reg .pred      %p<2>;
        .reg .b16       %rs<2>;
        .reg .b32       %r<4>;
        .reg .b64       %rd<15>;


        ld.param.u64    %rd7, [_Z6kernelPcS_ii_param_0];
        ld.param.u64    %rd8, [_Z6kernelPcS_ii_param_1];
        ld.param.u32    %r1, [_Z6kernelPcS_ii_param_2];
        cvta.to.global.u64      %rd9, %rd8;
        cvta.to.global.u64      %rd10, %rd7;
        mov.u32         %r2, %tid.x;
        mul.lo.s32      %r3, %r2, %r1;
        cvt.s64.s32     %rd11, %r3;
        add.s64         %rd1, %rd10, %rd11;
        add.s64         %rd2, %rd9, %rd11;
        mov.u64         %rd14, 0;
        ld.param.s32    %rd3, [_Z6kernelPcS_ii_param_3];

BB6_1:
        add.s64         %rd12, %rd2, %rd14;
        ld.global.u8    %rs1, [%rd12];
        add.s64         %rd13, %rd1, %rd14;
        st.global.u8    [%rd13], %rs1;
        add.s64         %rd14, %rd14, 1;
        setp.lt.u64     %p1, %rd14, %rd3;
        @%p1 bra        BB6_1;

        ret;
}

My reading is that this code will always copy at least one byte because the value of the length argument isn't tested until after the byte copy. Something like this:

BB6_1:
        setp.ge.u64     %p1, %rd14, %rd3;
        @%p1 bra        Done;
        add.s64         %rd12, %rd2, %rd14;
        ld.global.u8    %rs1, [%rd12];
        add.s64         %rd13, %rd1, %rd14;
        st.global.u8    [%rd13], %rs1;
        add.s64         %rd14, %rd14, 1;
        bra             BB6_1;
Done:

would probably work as expected.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
talonmies
  • 70,661
  • 34
  • 192
  • 269