When does the pipeline take 2 decode stages when there is a RAW dependency in 2 successive instructions

Question

Consider a RISC pipeline having 5 stages, Find how many cycles are required for the instruction given below, assume operand forwarding, branch prediction is used in which the branch is not taken, ACS is the branch instruction and the five stages are Instruction fetch, Decode, Execute, Memory and Write back.

I1: ACS R0, R1,X

I2: LOAD R2, 0(R3)

I3: SUB R4 R2, R2
I4: X: ADD R5, R1, R2
I5: LOAD R1, 0(R5)

I6: SUB R1, R1, R4
I7: ADD R1, R1, R5 

A. 11
B. 12
C. 13
D. 14

Solution:

In the solution, I coludn't understand why have they neglected 2 DECODE cycles in I6 and I7 although they have a RAW dependency?

Source of the question: Question 41 of https://practice.geeksforgeeks.org/contest-quiz/sudo-gate-2020-mock-iii

Ah GeeksForGeeks. That site is well known for having mistakes. There's some good stuff on there, but without some for quality control (like Stack Overflow's upvote / downvote) or other peer review / editors, you can't tell which stuff is good, or good with confusing minor mistakes, or just plain misleading. Sometimes different parts of the same article fall into different categories of quality. — Peter Cordes, Feb 02 '20 at 08:49

score 2 · Answer 1 · answered Feb 02 '20 at 05:53

I think the answer gives the right total (13 cycles) but put the stall in the wrong instruction.

I5 doesn't need to stall; I4 (ADD R5, R1, R2) produces R5 in time to forward it to the next instruction's EX for address calculation (LOAD R1, 0(R5)). (Your 5-stage classic RISC pipeline has bypass forwarding).

But I6 reads the result of a load instruction, and loads produce their result a cycle later than the ALU in EX. So like I3, I6 needs to stall, not I5.

(I7 depends on I6, but I6 is an ALU instruction so it can forward without stalling.)

They stalls in the D stage because the ID stage can't fetch registers that the I2 / I5 load hasn't produced yet.

Separately from that, your diagram shows I4 (and what should be I7) not even being fetched when the previous instruction stalls. That doesn't make sense to me. At the start of that cycle, the pipeline doesn't even know that it needs to stall because it hasn't yet decoded I3 (and I6) and detected that it reads a not-ready register so an interlock is needed.

Fetch doesn't wait until after decoding the previous instruction to see if it stalled or not; that would defeat the entire purpose of pipelining. It should look like

I3     IF   D   D  EX  MEM   WB
I4         IF  IF   D   EX  MEM  WB

BTW, load latency is the reason that classic MIPS has a load-delay slot (unpredictable behaviour if you try to use a register in the next instruction after loading into it). Later MIPS added interlocks to stall if you do that, instead of making it an error, so you can keep static code-size smaller (no NOP filler) in cases where you can't find any other instruction to put in that slot. (And some even later MIPS did out-of-order exec which can hide latency.)

I had overlooked that 4th and 7th instruction fetch should start in the same cycle as previous instruction's decode. Thanks for the explanation — Olivia Pearls, Feb 02 '20 at 07:05
@SohamChatterjee: It might help people find this question if you edit it to mention where you got that incorrect diagram. e.g. if it's from a book, people might be searching on the book title and problem number. — Peter Cordes, Feb 02 '20 at 07:14

When does the pipeline take 2 decode stages when there is a RAW dependency in 2 successive instructions

1 Answers1