Why does the following loop unrolling lead to a wrong result?

Question

I am currently trying to optimize some MIPS assembler that I've written for a program that triangulates a 24x24 matrix. My current goal is to utilize delayed branching and manual loop unrolling to try and cut down on the cycles. Note: I am using 32-bit single precision for all the matrix arithmetic.

Part of the algorithm involves the following loop that I'm trying to unroll (N will always be 24)

...
    float inv = 1/A[k][k]
    for (j = k + 1; j < N; j++) {
        /* divide by pivot element */
        A[k][j] = A[k][j] * inv;
    }
...

I want

...
    float inv = 1/A[k][k]
    for (j = k + 1; j < N; j +=2) {
        /* divide by pivot element */
        A[k][j]     = A[k][j]     * inv;
        A[k][j + 1] = A[k][j + 1] * inv;
    }
...

but it generates the incorrect result and I don't know why. The interesting thing is that the version with loop unrolling generates the first row of matrix correctly but the remaining ones incorrect. The version without loop unrolling correctly triangulates the matrix.

Here is my attempt at doing it.

...

# No loop unrolling
loop_2:
    move    $a3, $t2          # column number b = j (getelem A[k][j])
    jal     getelem           # Addr of A[k][j] in $v0 and val in $f0
    addiu   $t2, $t2, 1       ## j += 2
    mul.s   $f0, $f0, $f2     # Perform A[k][j] * inv
    bltu    $t2, 24, loop_2   # if j < N, jump to loop_2
    swc1    $f0, 0($v0)       ## Perform A[k][j] := A[k][j] * inv

    # The matrix triangulates without problem with this original code.

...

...

# One loop unrolling
loop_2:
    move    $a3, $t2         # column number b = j (getelem A[k][j])
    jal     getelem          # Addr of A[k][j] in $v0 and val in $f0
    addiu   $t2, $t2, 2      ## j += 2
    lwc1    $f1, 4($v0)      # $f1 <- A[k][j + 1]
    mul.s   $f0, $f0, $f2    # Perform A[k][j] * inv
    mul.s   $f1, $f1, $f2    # Perform A[k][j+1] * inv
    swc1    $f0, 0($v0)      # Perform A[k][j] := A[k][j] * inv
    bltu    $t2, 24, loop_2  # if j < N, jump to loop_2
    swc1    $f1, 4($v0)      ## Perform A[k][j + 1] := A[k][j + 1] * inv

    # The first row in the resulting matrix is correct, but the remaining ones not when using this once unrolled loop code.

...

In your unrolled version, are you sure that `k+1` and `N` are of the same parity (both even or both odd)? — Nate Eldredge, Apr 28 '22 at 14:38
The original C is potentially buggy. `j < N; j +=2` can start the loop body with `j = N-1`, accessing the array at `A[k][N-1]` (fine) and `A[k][N]` (not fine). The usual method is `j < N-1`, or in general `j < N-(unroll-1)`. But for unsigned N, you also have to separately check `N >= unroll` because `N-1` could wrap to a huge unsigned value. If `j` was starting at `0`, it would be always fine if N was a multiple of the unroll factor. But here it's different, as Nate points out. — Peter Cordes, Apr 28 '22 at 14:38
Aaaaaaaaaah, of course!!! I completely forgot about this aspect. Feel free any of you to post an actual answer and I'll accept. — NoName123, Apr 28 '22 at 14:40
BTW, added a section about your MIPS asm which should just use a pointer increment, not a function call inside the unrolled loop. — Peter Cordes, Apr 28 '22 at 15:05

Peter Cordes · Accepted Answer · 2022-04-28T15:04:32.517

The unrolled C loop condition is buggy.

j < N; j +=2 can start the loop body with j = N-1,
accessing the array at A[k][N-1] (fine) and A[k][N] (not fine).

One common method is j < N-1, or in general j < N-(unroll-1). But for unsigned N, you also have to separately check N >= unroll before starting the loop, because N-1 could wrap to a huge unsigned value.

Keeping the j < limit is generally good for C compilers vs. j + 1 < N which is a separate thing they'd have to calculate. And can also stop a compiler from proving that the loop isn't infinite for unsigned counts (like size_t), because that's well-defined as wrapping around, so N = UINT_MAX could lead to the condition always being true depending on the starting point. (e.g. j = UINT_MAX-2 makes UINT_MAX-1 < UINT_MAX, and j+=2 makes 0 < UINT_MAX, also true.) So it's a similar problem to using j <= limit for unsigned counters. Compilers really like to know when a loop is potentially infinite. For some, that it disables auto-vectorization if the trip-count isn't calculable ahead of the first iteration.

If j was starting at 0, you can get away with a sloppy condition if N was guaranteed to be a multiple of the unroll factor. But here it's different, as Nate points out.

efficiency of your MIPS asm

generally the point of loop unrolling is performance. A non-inline call to a helper function inside the loop is kind of defeating the purpose.

jal getelem I assume does a bunch of multiplies and stuff to redo the indexing with a pointer and two integers? Notice that you're scanning along contiguous memory in one row, so you can just increment a pointer.

Calculate an end-pointer to compare against, so your MIPS loop can look like

 # some checking outside the loop, maybe with a bxx to the end of it.
 looptop:                  # do{

    lwc1   $f2, 0($t0)
    lwc1   $f3, 4($t0)
    addiu  $t0, $t0, 4*2      # p+=2     advance by 8 bytes, 2 floats
    ...
    swc1   something, 0($t0)
    swc1   something, 4($t0)
    bne    $t0, $t1        # }while(p!=endp)

   # maybe another condition to check if you should run one last iteration.

MIPS bltu is only a pseudo-instruction (sltu/bnez); that's why it's better to calculate an exact end-pointer so you can use a single machine instruction as the loop branch.

And yes, this might mean rounding the iteration count down to a multiple of 2 to ensure correctness. Or doing a scalar iteration and rounding up to a multiple of 2. e.g. x++ / x&=-2;

With software pipelining, e.g. doing a load and divide but not a store yet, you could maybe let the rounding-up have the loop redo that element if odd. (If the chance of a branch mispredict costs more than an FP multiply and a redundant store.) Haven't fully thought this through, but it's a similar idea to SIMD doing a first unaligned vector, then a potentially-partially-overlapping aligned vector. (SIMD vectorization is like unrolling, but then you roll back up into a single instruction that does 4 elements, for example.)

Why does the following loop unrolling lead to a wrong result?

1 Answers1

efficiency of your MIPS asm