0

Is it possible to unroll a loop on a triangular region, such as:

for (int i = 0; i < ROW_LENGTH; i++)
{
    for (int j = 0; j < i; j++)
    {
        // Some array operation here
    }
}

where ROW_LENGTH is a constant defined at compile time? As it stands now, I don't think this is possible because i is changing as the program executes (and more importantly, it's not a constant at compile time). I suppose you could treat the 2D array as a 1D array, iterate from 0 to (ROW_LENGTH^2)/2, and then try a couple math tricks to get the indices, but the extra operations defeat the purpose of the loop unrolling in the first place.

AnimatedRNG
  • 1,859
  • 3
  • 26
  • 39
  • 2
    Is the purpose of the loop unrolling for code optimisation inside a kernel, or are you trying to turn the loop into an execution grid and are really looking for indexing rules? – talonmies Apr 28 '15 at 07:22
  • Is it possible? Yes. For every iteration of the loop the compiler knows the value of i and j. That doesn't mean the compiler will do it though. It's worth also noting that the additional operations may not defeat the point of unrolling. In a lot of CUDA applications arithmetic is not a meaningful contributor to runtime. – Jez Apr 28 '15 at 13:06
  • @talonmies I'm curious if it's possible to optimize this. ROW_LENGTH is going to be in the 100-300 range -- so it's small enough that memory isn't an issue, but large enough that I can't manually unroll the loops. – AnimatedRNG Apr 28 '15 at 22:12
  • @Jez How does the compiler know the values of i and j prior to runtime? While ROW_LENGTH is a constant, i (limit of the inner loop) is not, hence it can't optimize that loop. Unless the compiler uses a more complex method of predicting the values of i and j which allows variable bounds by assuming that the rules which determine bounds are determinable at compile time? – AnimatedRNG Apr 28 '15 at 22:31
  • 1
    If the first loop is unrolled then explicit values of `i` are used for each unrolled iteration. Therefore the compiler knows the bounds of `j` in all of these unrolled iterations. This allows it to unroll the second loop within the unrolled first loop. – Jez Apr 29 '15 at 09:38

1 Answers1

2

The CUDA 7.0 compiler will unroll this in my test. The loop indices are all known at compile time so there's no reason why it shouldn't be able to.

Consider the following code, which sets a triangular portion of a to be 1.

#define ROW_LENGTH 4
__global__ void triUnrollTest1(float* a) {
   #pragma unroll
   for (int i = 0; i < ROW_LENGTH; i++)
   {
      #pragma unroll
      for (int j = 0; j < i; j++)
      {
         a[i * ROW_LENGTH + j] = 1.f;
      }
   }
}

As ROW_LENGTH only 4 we can unroll this ourselves:

__global__ void triUnrollTest2(float* a) {
   a[1 * ROW_LENGTH + 0] = 1.f;
   a[2 * ROW_LENGTH + 0] = 1.f;
   a[2 * ROW_LENGTH + 1] = 1.f;
   a[3 * ROW_LENGTH + 0] = 1.f;
   a[3 * ROW_LENGTH + 1] = 1.f;
   a[3 * ROW_LENGTH + 2] = 1.f;
}

Compiling for SM 35 using CUDA 7.0: nvcc -arch=sm_35 -c triUnroll.cu

Then dumping the SASS assembler: cuobjdump --dump-sass triUnroll.o

We get:

code for sm_35
        Function : _Z14triUnrollTest1Pf
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                          /* 0x08b8b8a0b010a000 */
/*0008*/                   MOV R1, c[0x0][0x44];          /* 0x64c03c00089c0006 */
/*0010*/                   MOV R0, c[0x0][0x140];         /* 0x64c03c00281c0002 */
/*0018*/                   IADD R2.CC, R0, 0x10;          /* 0xc0840000081c0009 */
/*0020*/                   MOV32I R0, 0x3f800000;         /* 0x741fc000001fc002 */
/*0028*/                   IADD.X R3, RZ, c[0x0][0x144];  /* 0x60804000289ffc0e */
/*0030*/                   ST.E [R2], R0;                 /* 0xe4800000001c0800 */
/*0038*/                   ST.E [R2+0x10], R0;            /* 0xe4800000081c0800 */
                                                          /* 0x080000b810b8b8b8 */
/*0048*/                   ST.E [R2+0x14], R0;            /* 0xe48000000a1c0800 */
/*0050*/                   ST.E [R2+0x20], R0;            /* 0xe4800000101c0800 */
/*0058*/                   ST.E [R2+0x24], R0;            /* 0xe4800000121c0800 */
/*0060*/                   ST.E [R2+0x28], R0;            /* 0xe4800000141c0800 */
/*0068*/                   EXIT;                          /* 0x18000000001c003c */
/*0070*/                   BRA 0x70;                      /* 0x12007ffffc1c003c */
/*0078*/                   NOP;                           /* 0x85800000001c3c02 */
        .....................................


        Function : _Z14triUnrollTest2Pf
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                          /* 0x08b8b8a0b010a000 */
/*0008*/                   MOV R1, c[0x0][0x44];          /* 0x64c03c00089c0006 */
/*0010*/                   MOV R0, c[0x0][0x140];         /* 0x64c03c00281c0002 */
/*0018*/                   IADD R2.CC, R0, 0x10;          /* 0xc0840000081c0009 */
/*0020*/                   MOV32I R0, 0x3f800000;         /* 0x741fc000001fc002 */
/*0028*/                   IADD.X R3, RZ, c[0x0][0x144];  /* 0x60804000289ffc0e */
/*0030*/                   ST.E [R2], R0;                 /* 0xe4800000001c0800 */
/*0038*/                   ST.E [R2+0x10], R0;            /* 0xe4800000081c0800 */
                                                          /* 0x080000b810b8b8b8 */
/*0048*/                   ST.E [R2+0x14], R0;            /* 0xe48000000a1c0800 */
/*0050*/                   ST.E [R2+0x20], R0;            /* 0xe4800000101c0800 */
/*0058*/                   ST.E [R2+0x24], R0;            /* 0xe4800000121c0800 */
/*0060*/                   ST.E [R2+0x28], R0;            /* 0xe4800000141c0800 */
/*0068*/                   EXIT;                          /* 0x18000000001c003c */
/*0070*/                   BRA 0x70;                      /* 0x12007ffffc1c003c */
/*0078*/                   NOP;                           /* 0x85800000001c3c02 */
        .....................................

Obviously both are the same and nicely unrolled. Interestingly when I accidentally compiled with 6.5 for my first answer the compiler did not unroll, so I guess it pays to be up to date in this case!

Jez
  • 1,761
  • 11
  • 14