I think the answer gives the right total (13 cycles) but put the stall in the wrong instruction.
I5 doesn't need to stall; I4 (ADD R5, R1, R2
) produces R5 in time to forward it to the next instruction's EX for address calculation (LOAD R1, 0(R5)
). (Your 5-stage classic RISC pipeline has bypass forwarding).
But I6 reads the result of a load instruction, and loads produce their result a cycle later than the ALU in EX. So like I3, I6 needs to stall, not I5.
(I7 depends on I6, but I6 is an ALU instruction so it can forward without stalling.)
They stalls in the D stage because the ID stage can't fetch registers that the I2 / I5 load hasn't produced yet.
Separately from that, your diagram shows I4 (and what should be I7) not even being fetched when the previous instruction stalls. That doesn't make sense to me. At the start of that cycle, the pipeline doesn't even know that it needs to stall because it hasn't yet decoded I3 (and I6) and detected that it reads a not-ready register so an interlock is needed.
Fetch doesn't wait until after decoding the previous instruction to see if it stalled or not; that would defeat the entire purpose of pipelining. It should look like
I3 IF D D EX MEM WB
I4 IF IF D EX MEM WB
BTW, load latency is the reason that classic MIPS has a load-delay slot (unpredictable behaviour if you try to use a register in the next instruction after loading into it). Later MIPS added interlocks to stall if you do that, instead of making it an error, so you can keep static code-size smaller (no NOP filler) in cases where you can't find any other instruction to put in that slot. (And some even later MIPS did out-of-order exec which can hide latency.)