Does a processor stall even if there is (theoretically) perfect branch prediction irresp. of whether the Branch is taken or not-taken?

Question

I am going through the textbook Computer Organization and Design and I am a bit confused with the Branch Prediction and how it works with a 5 stage pipeline scenario - IF ID EX MEM WB.

Consider the following sequence of instructions:

TOP: SUB X2, X2, X3
.
.

B.NE TOP
ADD X1, X1, X2

Assume the first case with no branch prediction and all possible forwarding paths. As per the textbook, when the Branch to the TOP is taken, the processor would incur a penalty of 1 stall. This is because after the B.NE instruction, the next instruction in the pipeline would be the ADD instruction when it should really have been the SUB instruction. The processor realizes that it inserted the incorrect instruction into the pipeline only at the end of the ID stage of the B.NE instruction and hence has to nop all the remaining stages of ADD (by the end of the ID stage it also manages to calculate the correct address to fetch the instruction from). So the pipeline in this case looks something like this:

However if the branch was not taken, there would have been no stalls. Because the next instruction would correctly have been the ADD instruction and the execution would have proceeded normally.

Now consider the same instructions and the same processor but with perfect branch prediction. Assume Branch is taken. The processor would know that the instruction is a Branch instruction only during the ID stage for the B.NE instruction. And the Branch Prediction would kick in only after that. By that time, the ADD instruction is already in the pipeline. Hence there would still be a penalty of 1 stall. So what is the advantage of even having the Branch Prediction? I am clearly missing something.

So I think I am confused with where exactly in the pipeline does the Branch Prediction kick in?

Fun fact: In real CPUs, branch predictions is used in multiple stages, including fetch itself. Given that you just fetched block X, what block to fetch next? So you need a prediction before even decoding any of the instructions in that block to even see if any of them are branches. And then separately you need a branch target or taken/not-taken prediction for every indirect or conditional branch instruction. — Peter Cordes, Dec 11 '19 at 16:44
But anyway, your question seems to be about branch latency. If there's only 1 cycle of branch latency, the `add` won't reach EX before `B.NE` decides if it's taken or not so the CPU shouldn't need to stall in the not-taken case. It can start decoding it and discard it if it turns out it shouldn't run it, even if it's "expensive" like a multiply. Fun fact: MIPS uses a branch-delay slot to hide branch latency, avoiding the need for prediction on classic MIPS pipelines. [Why does MIPS use one delay slot instead of two?](//stackoverflow.com/a/58425156) — Peter Cordes, Dec 11 '19 at 16:48

Does a processor stall even if there is (theoretically) perfect branch prediction irresp. of whether the Branch is taken or not-taken?

0 Answers0