0

Which problems arise in the following assembly loop, if Predict Not Taken is chosen by default? Optimize the example to Predict not Taken.

addi $s1, $zero, 1024 // s1 := 1024
loop: addi $s1, $s1, -1 // s1--
jal subroutine // call subroutine()
bne $s1, $zero, loop // if (s1 != 0) jump loop

To me the most obvious answer is to copy the code of the subroutine 1024 times such that you don't have any branches at all. Problem solved. But this is too simple. Any ideas?

Iwan5050
  • 165
  • 6
  • 2
    You have two choices when branching on a condition, in general - branch if the condition is true, and branch if it is false, right? Consider if the code stays "in the loop" or "exits the loop" with Predict Not Taken in each of those cases, and look at the total amount of work done in each case... – BadZen May 12 '22 at 20:25
  • 1
    Unrolling the loop 100% is an appropriate thought, and will get rid of branch prediction (incorrect and otherwise). There's probably a trade off with I-cache misses, though, so some intermediate might be best, like maybe unroll 64 times, and suffer the miss-predicted backward branch ~16 times; YMMV depending on cache sizes. – Erik Eidt May 12 '22 at 21:55
  • If you maintain some looping, it should be able to handle a variable count vs. constant count as well. (See, for example, [Duff's device](https://en.wikipedia.org/wiki/Duff's_device), but other approaches can work to handle the leftover after modulo 64 iterations.) – Erik Eidt May 12 '22 at 22:44
  • @ErikEidt: I assume they want you to use an unconditional `j` or `b` at the bottom of the loop, and an `if()break` inside the loop. Like `while(){}` loop translated naively, not optimized for normal CPUs with the conditional branch at the bottom like this has. Of course, unrolling will also help mitigate the overhead of using a less-efficient loop structure to accommodate a horrible hypothetical CPU that doesn't even statically predict backward jumps as taken. – Peter Cordes May 12 '22 at 23:52

0 Answers0