Six stage pipelining with superscalar processor with two execution units

Question

Need help in designing a six-stage pipelining with superscalar processor with two execution units. Six stages are Instruction Fetch (IF), Instruction Decode (ID), Read from Registers (RR), 2-cycle Execution (EX), Write back result (WB). Instructions cannot be reordered. In an execution cycle, at most one instruction issued could be memory (load or store) related and at most one instruction could be non-memory dealing with arithmetic operations with registers. Latency is 3 cycles for Load operations and 2 cycles for others. Latency is considered as the time-delay cycles between issue cycles for dependent instructions. Now, we have the following instruction sequence:

(1) LD R21, (R20)
(2) LD R18, (R17)
(3) ADD R16, R21, R18
(4) LD R15, (R14)
(5) ADD R13, R12, R11
(6) SUB R23, R22, R24
(7) ST (R23), R10
(8) ADD R4, R21, R18
(9) ST (R3), R2
(10) ST (R1), R4

How long does the program take to issue, considering that an instruction is said to have issued when it passes from the RR stage to the EX stage.

My workout is as follows:

RAW conflicts exist in (1) and (3), (2) and (3), (2) and (8), (2) and (8), (6) and (7). Thus, the timing diagram is :

     01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 
I01: IF ID RR EX EX WB
I02: -- IF ID RR EX EX WB
I03: -- -- -- -- IF ID RR EX EX WB
I04: -- -- -- -- IF ID RR EX EX WB
I05: -- -- -- -- -- IF ID RR EX EX WB
I06: -- -- -- -- -- -- IF ID RR EX EX WB
I07: -- -- -- -- -- -- -- -- IF ID RR EX EX WB
I08: -- -- -- -- -- -- -- -- IF ID RR EX EX WB
I09: -- -- -- -- -- -- -- -- -- IF ID RR EX EX WB
I10: -- -- -- -- -- -- -- -- -- -- IF ID RR EX EX WB

Please comment on the solution approach.

Fixed 2-cycle ALU latency for a CPU with a `mul` instruction? Sounds like a poor design (or overly simplified for example purposes). Single-cycle latency for simple ALU ops like `add` / `or` helps a lot. If you're spending enough transistors to do a `mul` in 2 cycles, surely you can include bypass forwarding logic and handle two different ALU latencies. — Peter Cordes, May 13 '18 at 05:02
That's showing the cycle when instructions enter the first EX stage? I meant a complete diagram showing the progress of every instruction through the pipe horizontally, with time horizontally and instructions vertically ([like the diagrams in this question](https://stackoverflow.com/questions/33941907/understanding-mips-assembly-with-pipelining)). But this works too if there are never any fetch bubbles, and you can keep the data dependencies + latencies in your head or write them down separately (e.g. as comments showing the most recent instruction each one depends on.) — Peter Cordes, May 13 '18 at 06:11
How does instruction (1) start executing in cycle (1)? Doesn't it have to go through IF / ID / RR first? If you're showing the IF stage, then your diagram doesn't make sense because the CPU can't know the data dependencies / resource conflicts until after decode (and probably not until RR). — Peter Cordes, May 13 '18 at 06:16
My previous comment still applies: the CPU can't predict future instructions, so it doesn't stall for resource conflicts / hazards until after ID or RR, after it knows what the next instruction is and discovers it can't issue the instruction into the EX stage, because the input operands aren't ready to forward. — Peter Cordes, May 13 '18 at 16:54
@PeterCordes you are correct, but, supposing cpu knows it beforehand, as an academic exercise, can this be put as is? — Dr. Debasish Jana, May 13 '18 at 17:06
If you want people to check your diagram carefully for you, write it in a way that makes sense. There's no plausible reason it wouldn't decode the next instruction, and it's not simpler than having stalls happen in the stage before EX like normal. — Peter Cordes, May 13 '18 at 17:10
The latency of load is 3 cycles. So for example the first instruction issues the load in cycle 04 and will get the value in cycle 07. So WB can only be done in cycle 07. The second instruction cannot issue the load until cycle 07 because there can be at most one outstanding load. The third instruction's IF and ID happen at cycles 03 and 04, respectively, then it gets stalled until the first two instructions complete WB. — Hadi Brais, May 13 '18 at 18:03
You should remove the `cpu` and `assembly` tags and use the `cpu-architecture` tag instead. — Hadi Brais, May 13 '18 at 18:04
@HadiBrais latency between an instruction I1 and an instruction I2 *dependent* on I1, is the time delay between their issue cycles. In this example, I2 is independent on I1. So, why should I2 execution wait until I1 wb gets over — Dr. Debasish Jana, May 14 '18 at 02:10
Maybe I was not clear enough. The second load can be issued in the same cycle as the WB of the first instruction. It's not necessary to wait until I1 WB gets over. You have a restriction of a single outstanding load, right? The dependency between I1 and I2 is only structural. — Hadi Brais, May 14 '18 at 02:28
@HadiBrais: my reading of the description is that memory ops are pipelined at 1 per clock: at most one can issue per clock, but I don't see a restriction on the number in flight at once. *at most one instruction* issued *could be memory related...* — Peter Cordes, May 14 '18 at 02:31
@PeterCordes Oh, you're right. I thought there can only be one outstanding load at any time. Thanks. — Hadi Brais, May 14 '18 at 02:33
@HadiBrais: But good point that memory latency delays WB for loads. I guess we need somewhere to track a memory instruction for an extra cycle, so that may imply an in-flight limit of 2. At least if they're both loads? Stores don't need to WB. — Peter Cordes, May 14 '18 at 02:34
@PeterCordes Yes. Or in-flight limit of 3 depending on how you count. So I think there should be three EX stages for each load instruction. The pipeline should be adjusted like that. — Hadi Brais, May 14 '18 at 02:38
@Dr.DebasishJana: Are you trying to design a good assignment question for students here? Or is the pipeline spec fixed? If so, where did you get it from? See the previous comments between Hadi and myself; it appears it might need to stall itself on 3 independent loads in a row because WB comes too soon for 3c load latency. Unless memory ops start counting their latency from the RR stage or something, instead of EX. (i.e. MEM unit earlier in the pipeline, like in a [classic RISC pipeline](https://en.wikipedia.org/wiki/Classic_RISC_pipeline) for the same reason.) — Peter Cordes, May 14 '18 at 02:41
Oops, MEM stage *later* in the pipeline, after EX. It just forwards ALU results, but does part of the work for memory access. (In a classic RISC, it does actual cache access after the EX stage does address-generation. But in this design, it's just the 3rd cycle of a memory op, whatever that is). — Peter Cordes, May 14 '18 at 02:56
@PeterCordes, this is a student assignment problem that I am trying to assist as well as for my own understanding too. It is mire of an academic exercise with the assumptions and restrictions as given in the problem statement — Dr. Debasish Jana, May 14 '18 at 04:40

Six stage pipelining with superscalar processor with two execution units

0 Answers0