The diagram is for a classic 5 stage MIPS pipelined architecture. Modern chips use superscalar design, but let's ignore that [at least for the moment].
The problem here is that the diagram shows the times for the various types of instructions [for each T-state T1-T5], but there is no sample program to execute, unless the diagram is also an example of the loop. If that's the case, continue on ...
The other problem is pipeline "hazards". That is, a particular stage (T-state) for a particular instruction must "stall" because it depends on the output of a prior instruction. For example:
L1: add $t1,$t2,$t3
L2: add $t6,$t4,$t1
The second instruction must stall its "register read" (T2) because it must wait for the completion of the prior instruction's "register write" (T5) stage to complete [because it needs the final value for $t1
].
So, instead of a nicely behaved pipeline like:
1: L1:T1
2: L1:T2 L2:T1
3: L1:T3 L2:T2
4: L1:T4 L2:T3
5: L1:T5 L2:T4
6: L2:T5
We end up with:
1: L1:T1
2: L1:T2 L2:T1
3: L1:T3 L2:stall
4: L1:T4 L2:stall
5: L1:T5 L2:stall
6: L2:T2
7: L2:T3
8: L2:T4
9: L2:T5
In modern implementations, there are architectural techniques to avoid this (e.g. "forwarding", out-of-order execution), but we have to know the particular architectural implementation to know what tools it has to ameliorate hazards.
My best guess is as follows ...
Once again, if we ignore hazards, we need a particular program/sequence to do the calculations on.
If we assume the program is the diagram, for 1,000,000 instructions, its number of loop iterations is 1,000,000 / 4
or 250,000
. And ... We're ignoring the branch delay slot as well.
The timing diagram for one loop iteration looks like:
label inst start exec end
time time time
----- ---- ----- ---- ----
L1: lw 0 800 800
L2: sw 200 700 900
L3: R 400 600 1000
L4: beq 600 500 1100
Notice that all instructions complete before the L4 does. So, the dominant time is the end time for L4. Thus, 250,000 * 1100 ps
or 275 us, more or less.
UPDATE:
But my professor is telling me the answer is 1,000,000 * 200 ps + 1400 ps
Well, you should [obviously ;-)] believe your prof not me [I did emphasize "guess"].
But, again, we have to know the implementation: branch prediction, etc. Mine assumes L1 on 2nd loop can't start until L4 on loop 1 completes.
If the loop/sequence were unrolled completely [and there was no branch], such as lw, sw, R, R
repeated 250,000 times, it would be 1,000,000 * 200 ps
, IMO.
I think prof's analysis assumes L1's T1 for loop 2 can start concurrent with L4's T2 for loop 1.
An example useful sequence could be a memmove
sequence with overlapping source/destination [the registers are already preset]:
L1: lw $t0,4($t1)
L2: sw $t0,0($t1)
L3: addu $t1,$t1,$t2
L4: bne $t1,$t3,L1
Again, this assumes no branch delay slots. To make this work with them and not just append a nop
, the sequence would be L1, L2, L4, L3
However, I just reread the fine print: This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no delay.
So, that may be the key as to why there is/was a discrepancy. Once again, when in doubt, believe your prof.