3

(There are similar questions about device (global) memory arrays, e.g. my own question.)

Suppose I have a CUDA kernel code going like this:

my_arr[MyCompileTimeConstant];

/* ... */

for(unsigned i = 0; i < foo(); i++) {
   my_arr[bar(i)] += some_value;
}

Now, I want to initialize my_arr to all-zeros before I start adding to its entries. Can I do any better than the trivial loop

for(unsigned i = 0; i < MyCompileTimeConstant; i++) {
   my_arr[i] = 0;
}

?

Note: I specifically made the loop range and array size constants known in compile-time. The question would have been slightly different had they been passed at run-time. Of course, it may not change the answer for CUDA like it does for code running on the CPU

Tsyvarev
  • 60,011
  • 17
  • 110
  • 153
einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • I believe `memset` will also work, although it's not [officially documented](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations) AFAIK. Whether it's "better" or not, I don't know. – Robert Crovella May 17 '14 at 14:48
  • 1
    @RobertCrovella: Like memcpy, it will be worse because the compiler emits simple loops doing byte size transfers. – talonmies May 17 '14 at 14:58
  • 2
    @talonmies I don't know about GPU code generation, but at least for CPU code that's not true in general. Compilers recognize `memcpy` and the like. For small constant sizes they are inlined, unrolled and optimized as well as any ordinary assignment. For large dynamic sizes the implementation in the run-time library is usually vectorized with only minimal overhead (need to check alignemnt and perhaps copy a few bytes manually to fix it). –  May 17 '14 at 15:41
  • @delnan: I was specifically referring to the NVIDIA GPU compiler, as witnessed here: http://stackoverflow.com/a/10468720/681865 . This is a CUDA question and I am only referring to that. – talonmies May 17 '14 at 15:58

1 Answers1

5

A simple loop should be the "best" approach (but see final comment below). Using the following kernel as an example:

template<int version>
__global__
void tkernel(int *A, int *B, int *C, int n)
{
    int biglocal[100];

    switch(version) {
        case 1:
            for(int i=0; i<100; i++) {
                biglocal[i] = 0;
            };

            break;

        case 2:
            memset(&biglocal[0], 0, 100*sizeof(int));
            break;


        case 3:
            const int4 zero = {0, 0, 0, 0};
            int4 *p = reinterpret_cast<int4*>(&biglocal[0]);
#pragma unroll
            for(int i=0; i<100/4; i++) {
                p[i] = zero;
            }

            break;
    }

    if (n>0) {
        for(int i=0; i<100; i++) {
            biglocal[A[threadIdx.x*i]] += B[threadIdx.x*i];
        }
        C[threadIdx.x] = biglocal[n];
    }
}

template __global__ void tkernel<1>(int *, int *, int *, int);
template __global__ void tkernel<2>(int *, int *, int *, int);
template __global__ void tkernel<3>(int *, int *, int *, int);

Here we have three different ways to zero a large local memory array, plus some code to convince the compiler that the whole initialisation sequence and local array shouldn't be optimised away.

Looking at the PTX emitted for compute 2.1 targets with the CUDA 6 release compiler, both versions 1 & 3 look like this:

.local .align 4 .b8     __local_depot0[400];
.reg .b64   %SP;
.reg .b64   %SPL;
.reg .pred  %p<3>;
.reg .s32   %r<67>;
.reg .s64   %rd<73>;


mov.u64     %SPL, __local_depot0;
ld.param.u64    %rd4, [_Z7tkernelILi1EEvPiS0_S0_i_param_0];
ld.param.u64    %rd5, [_Z7tkernelILi1EEvPiS0_S0_i_param_1];
ld.param.u64    %rd6, [_Z7tkernelILi1EEvPiS0_S0_i_param_2];
ld.param.u32    %r21, [_Z7tkernelILi1EEvPiS0_S0_i_param_3];
add.u64     %rd7, %SPL, 0;
mov.u32     %r66, 0;
st.local.u32    [%rd7], %r66;
st.local.u32    [%rd7+4], %r66;
st.local.u32    [%rd7+8], %r66;
st.local.u32    [%rd7+12], %r66;
st.local.u32    [%rd7+16], %r66;
st.local.u32    [%rd7+20], %r66; 

    // etc

ie. the compiler unrolled the loop and emitted a string of 32 bit store instructions. The int4 trick in version 3 produced identical code as the simple loop, which is a little surprising. Version 2, however, gets this:

.local .align 4 .b8     __local_depot1[400];
.reg .b64   %SP;
.reg .b64   %SPL;
.reg .pred  %p<4>;
.reg .s16   %rs<2>;
.reg .s32   %r<66>;
.reg .s64   %rd<79>;


mov.u64     %SPL, __local_depot1;
ld.param.u64    %rd7, [_Z7tkernelILi2EEvPiS0_S0_i_param_0];
ld.param.u64    %rd8, [_Z7tkernelILi2EEvPiS0_S0_i_param_1];
ld.param.u64    %rd9, [_Z7tkernelILi2EEvPiS0_S0_i_param_2];
ld.param.u32    %r21, [_Z7tkernelILi2EEvPiS0_S0_i_param_3];
add.u64     %rd11, %SPL, 0;
mov.u64     %rd78, 0;

BB1_1:
add.s64     %rd12, %rd11, %rd78;
mov.u16     %rs1, 0;
st.local.u8     [%rd12], %rs1;
add.s64     %rd78, %rd78, 1;
setp.lt.u64 %p1, %rd78, 400;
@%p1 bra    BB1_1;

ie. a loop which is performing 8 bit writes (comments indicate that simple list initialisation will also yield this type of copy loop). The latter will be a lot slower that then former. Apart from the size difference of the stores, the unrolled stream of writes are fully independent and could be issued in whatever order will keep the instruction pipeline full, and should lead to higher instruction throughput. I don't believe it will be possible to beat the compiler in the unrolled case, and a simple loop looks to yield the same code as a simple attempt at vectorization. If you were really keen, I guess you could try inline PTX to generate wider stores. I don't know whether there would be any performance advantage in doing so.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 1
    For completeness I also tried initialization lists like 'int biglocal[100] = {0};' which gives the same ptx as the memset variant... – kunzmi May 18 '14 at 13:55
  • 1
    @kunzmi: Thanks for the additional data point, that is useful to know. I edited a note into the question reflecting this. Presumably the memset style code is using 32 bit word rather than 8 bit word transfers? – talonmies May 18 '14 at 14:44
  • 1
    I re-checked this and it depends on what you actually do. The important point here is that the list provided is not complete. The given array members will be set as with the method with loops (with 32 bit words), whereas the undetermined array members will be set to zero in “memset-style” using 8 bit words. As in my previous example the value given is also zero, the first step is optimized away by the compiler and only the set-to-zero part remains. – kunzmi May 18 '14 at 15:13