The CUDA 7.0 compiler will unroll this in my test. The loop indices are all known at compile time so there's no reason why it shouldn't be able to.
Consider the following code, which sets a triangular portion of a to be 1.
#define ROW_LENGTH 4
__global__ void triUnrollTest1(float* a) {
#pragma unroll
for (int i = 0; i < ROW_LENGTH; i++)
{
#pragma unroll
for (int j = 0; j < i; j++)
{
a[i * ROW_LENGTH + j] = 1.f;
}
}
}
As ROW_LENGTH
only 4 we can unroll this ourselves:
__global__ void triUnrollTest2(float* a) {
a[1 * ROW_LENGTH + 0] = 1.f;
a[2 * ROW_LENGTH + 0] = 1.f;
a[2 * ROW_LENGTH + 1] = 1.f;
a[3 * ROW_LENGTH + 0] = 1.f;
a[3 * ROW_LENGTH + 1] = 1.f;
a[3 * ROW_LENGTH + 2] = 1.f;
}
Compiling for SM 35 using CUDA 7.0:
nvcc -arch=sm_35 -c triUnroll.cu
Then dumping the SASS assembler:
cuobjdump --dump-sass triUnroll.o
We get:
code for sm_35
Function : _Z14triUnrollTest1Pf
.headerflags @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
/* 0x08b8b8a0b010a000 */
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x64c03c00089c0006 */
/*0010*/ MOV R0, c[0x0][0x140]; /* 0x64c03c00281c0002 */
/*0018*/ IADD R2.CC, R0, 0x10; /* 0xc0840000081c0009 */
/*0020*/ MOV32I R0, 0x3f800000; /* 0x741fc000001fc002 */
/*0028*/ IADD.X R3, RZ, c[0x0][0x144]; /* 0x60804000289ffc0e */
/*0030*/ ST.E [R2], R0; /* 0xe4800000001c0800 */
/*0038*/ ST.E [R2+0x10], R0; /* 0xe4800000081c0800 */
/* 0x080000b810b8b8b8 */
/*0048*/ ST.E [R2+0x14], R0; /* 0xe48000000a1c0800 */
/*0050*/ ST.E [R2+0x20], R0; /* 0xe4800000101c0800 */
/*0058*/ ST.E [R2+0x24], R0; /* 0xe4800000121c0800 */
/*0060*/ ST.E [R2+0x28], R0; /* 0xe4800000141c0800 */
/*0068*/ EXIT; /* 0x18000000001c003c */
/*0070*/ BRA 0x70; /* 0x12007ffffc1c003c */
/*0078*/ NOP; /* 0x85800000001c3c02 */
.....................................
Function : _Z14triUnrollTest2Pf
.headerflags @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
/* 0x08b8b8a0b010a000 */
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x64c03c00089c0006 */
/*0010*/ MOV R0, c[0x0][0x140]; /* 0x64c03c00281c0002 */
/*0018*/ IADD R2.CC, R0, 0x10; /* 0xc0840000081c0009 */
/*0020*/ MOV32I R0, 0x3f800000; /* 0x741fc000001fc002 */
/*0028*/ IADD.X R3, RZ, c[0x0][0x144]; /* 0x60804000289ffc0e */
/*0030*/ ST.E [R2], R0; /* 0xe4800000001c0800 */
/*0038*/ ST.E [R2+0x10], R0; /* 0xe4800000081c0800 */
/* 0x080000b810b8b8b8 */
/*0048*/ ST.E [R2+0x14], R0; /* 0xe48000000a1c0800 */
/*0050*/ ST.E [R2+0x20], R0; /* 0xe4800000101c0800 */
/*0058*/ ST.E [R2+0x24], R0; /* 0xe4800000121c0800 */
/*0060*/ ST.E [R2+0x28], R0; /* 0xe4800000141c0800 */
/*0068*/ EXIT; /* 0x18000000001c003c */
/*0070*/ BRA 0x70; /* 0x12007ffffc1c003c */
/*0078*/ NOP; /* 0x85800000001c3c02 */
.....................................
Obviously both are the same and nicely unrolled. Interestingly when I accidentally compiled with 6.5 for my first answer the compiler did not unroll, so I guess it pays to be up to date in this case!