I am currently trying to optimize some MIPS assembler that I've written for a program that triangulates a 24x24 matrix. My current goal is to utilize delayed branching and manual loop unrolling to try and cut down on the cycles. Note: I am using 32-bit single precision for all the matrix arithmetic.
Part of the algorithm involves the following loop that I'm trying to unroll (N will always be 24)
...
float inv = 1/A[k][k]
for (j = k + 1; j < N; j++) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
}
...
I want
...
float inv = 1/A[k][k]
for (j = k + 1; j < N; j +=2) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
A[k][j + 1] = A[k][j + 1] * inv;
}
...
but it generates the incorrect result and I don't know why. The interesting thing is that the version with loop unrolling generates the first row of matrix correctly but the remaining ones incorrect. The version without loop unrolling correctly triangulates the matrix.
Here is my attempt at doing it.
...
# No loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 1 ## j += 2
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f0, 0($v0) ## Perform A[k][j] := A[k][j] * inv
# The matrix triangulates without problem with this original code.
...
...
# One loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 2 ## j += 2
lwc1 $f1, 4($v0) # $f1 <- A[k][j + 1]
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
mul.s $f1, $f1, $f2 # Perform A[k][j+1] * inv
swc1 $f0, 0($v0) # Perform A[k][j] := A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f1, 4($v0) ## Perform A[k][j + 1] := A[k][j + 1] * inv
# The first row in the resulting matrix is correct, but the remaining ones not when using this once unrolled loop code.
...