How to find the execution time for a pipelined machine?

Question

We are executing the following instructions on a machine

If the machine is pipelined, it would take about 200 ps * 3 = 600 ps.

I would like to what is the execution time, if the pipelined machine runs 1,000,003 instructions? Is it 1,000,000 * 200 ps + 600 ps?

@Yvette No, IMO, this is very much a computer engineering, computer architecture, computer programming question. Anybody doing mips asm would want to plan for the pipeline to get max performance. — Craig Estey, May 24 '16 at 03:58

Craig Estey · Answer 1 · 2016-05-24T18:14:31.117

The diagram is for a classic 5 stage MIPS pipelined architecture. Modern chips use superscalar design, but let's ignore that [at least for the moment].

The problem here is that the diagram shows the times for the various types of instructions [for each T-state T1-T5], but there is no sample program to execute, unless the diagram is also an example of the loop. If that's the case, continue on ...

The other problem is pipeline "hazards". That is, a particular stage (T-state) for a particular instruction must "stall" because it depends on the output of a prior instruction. For example:

L1: add $t1,$t2,$t3
L2: add $t6,$t4,$t1

The second instruction must stall its "register read" (T2) because it must wait for the completion of the prior instruction's "register write" (T5) stage to complete [because it needs the final value for $t1].

So, instead of a nicely behaved pipeline like:

1:      L1:T1
2:      L1:T2       L2:T1
3:      L1:T3       L2:T2
4:      L1:T4       L2:T3
5:      L1:T5       L2:T4
6:                  L2:T5

We end up with:

1:      L1:T1
2:      L1:T2       L2:T1
3:      L1:T3       L2:stall
4:      L1:T4       L2:stall
5:      L1:T5       L2:stall
6:                  L2:T2
7:                  L2:T3
8:                  L2:T4
9:                  L2:T5

In modern implementations, there are architectural techniques to avoid this (e.g. "forwarding", out-of-order execution), but we have to know the particular architectural implementation to know what tools it has to ameliorate hazards.

My best guess is as follows ...

Once again, if we ignore hazards, we need a particular program/sequence to do the calculations on.

If we assume the program is the diagram, for 1,000,000 instructions, its number of loop iterations is 1,000,000 / 4 or 250,000. And ... We're ignoring the branch delay slot as well.

The timing diagram for one loop iteration looks like:

label   inst    start   exec    end
                time    time    time
-----   ----    -----   ----    ----

L1:     lw      0       800     800
L2:     sw      200     700     900
L3:     R       400     600     1000
L4:     beq     600     500     1100

Notice that all instructions complete before the L4 does. So, the dominant time is the end time for L4. Thus, 250,000 * 1100 ps or 275 us, more or less.

UPDATE:

But my professor is telling me the answer is 1,000,000 * 200 ps + 1400 ps

Well, you should [obviously ;-)] believe your prof not me [I did emphasize "guess"].

But, again, we have to know the implementation: branch prediction, etc. Mine assumes L1 on 2nd loop can't start until L4 on loop 1 completes.

If the loop/sequence were unrolled completely [and there was no branch], such as lw, sw, R, R repeated 250,000 times, it would be 1,000,000 * 200 ps, IMO.

I think prof's analysis assumes L1's T1 for loop 2 can start concurrent with L4's T2 for loop 1.

An example useful sequence could be a memmove sequence with overlapping source/destination [the registers are already preset]:

L1:     lw      $t0,4($t1)
L2:     sw      $t0,0($t1)
L3:     addu    $t1,$t1,$t2
L4:     bne     $t1,$t3,L1

Again, this assumes no branch delay slots. To make this work with them and not just append a nop, the sequence would be L1, L2, L4, L3

However, I just reread the fine print: This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no delay.

So, that may be the key as to why there is/was a discrepancy. Once again, when in doubt, believe your prof.

But my professor is telling me the answer is 1,000,000 * 200 ps + 1400 ps — PYTHON PROGRAMMING, May 24 '16 at 14:43
Every 200ps an instruction but the last one finishes. The last instruction still needs to run through 4 stages at the end which adds another 4 cycles. So 10^6*200ps+3*200ps (3 inst) + 4*200 ps (pipeline drain of last instruction) = 10^6*200ps+7*200ps = 10^6*200ps+1400ps — crnv, Nov 30 '21 at 20:47

How to find the execution time for a pipelined machine?

1 Answers1