Risc-V Assembly - amount of bubbles needed to make the code operational - [Hypothetical]

Question

I want to know why when executing this assembly code on a pipelined RiscV - that does not stall automatically - with forwarding (except for internal register file WB->DEC forwarding) we need to place two NOP commands immediatly after the third command, wouldn't one NOP suffice?

addi t0, x0, 0
addi t1, x0, 5
addi s1, x0, 0x200 //why are two NOPS required after this command?
beq t1, t0, finish

Here's my line of thinking - after one nop the first command finished compiling, and we can forward t1 from the second command's WB into the EXE of the beq. Where am I wrong?

In a valid processor implementation, `nop`s are never needed; they do not need to be inserted by the programmer or compiler. — Erik Eidt, Nov 20 '21 at 19:07
Maybe, you are speaking to a hypothetical processor without forwarding, which may require `nop`s inserted, though that would be some custom non standard implementation. — Erik Eidt, Nov 20 '21 at 19:09
Further, the third instruction targets `s1`, whichbis not even used by the `beq`. — Erik Eidt, Nov 20 '21 at 19:10
The CPU detects hazards and inserts bubbles for you if any are necessary. RISC-V doesn't allow the CPU to misbehave if you try to use a result "too early", so the only effect would be performance (stalls), not correctness. But anyway, yes on a classic RISC pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline), this code would need bypass forwarding from MEM to EX, rather than EX->EX since the `beq` is 2 instructions after the last one that generated input for it. Or one stall cycle would be sufficient for WB->decode on a normal design where register file writes happen in first half — Peter Cordes, Nov 20 '21 at 23:26
This is a hypothetical for my uni course, as I've explained above this is a risc-v pipelined unit with forwarding -EXCEPT for WB->decode. I've been tasked with adding nops so that the code is operational. I know I won't need to, and that the system does it automatically - the question's aim is to explain WHERE and WHY does the system stall. So why does it use two nops after the third command? hope I've made myself clear. — kal_elk122, Nov 21 '21 at 11:08
You keep saying "with forwarding". Do you mean "**without** forwarding" (except for WB->decode which isn't really *forwarding*, at least not bypass forwarding because the data still goes through the register file, just writing in the first half-cycle, reading in the 2nd.) — Peter Cordes, Nov 21 '21 at 12:04
Anyway, this CPU should only stall for 1 cycle, because the `addi s1, x0, 0x200` fills one of the slots of latency from the `addi` writing `t1` to the `beq` reading it. If you're claiming it stalls for 2 cycles, you'll need to cite a source for that surprising claim. — Peter Cordes, Nov 21 '21 at 12:07
I meant *with* forwarding, except for register file bypassing (which we count as WB->DEC forwarding. — kal_elk122, Nov 21 '21 at 15:37
as for a source, It's from my university's test: https://moodle.technion.ac.il/pluginfile.php/1000286/course/section/94590/%D7%91%D7%97%D7%99%D7%A0%D7%94%20%D7%A1%D7%95%D7%A4%D7%99%D7%AA%20%D7%AA%D7%A9%D7%A2%D7%98%20%D7%9E%D7%95%D7%A2%D7%93%20%D7%90%20-%D7%A4%D7%AA%D7%A8%D7%95%D7%9F.pdf ->the 7th question. it's in hebrew sadly — kal_elk122, Nov 21 '21 at 15:39
If your CPU does have forwarding like you say ("with forwarding"), it won't need to stall at all for this. It can just forward. IDK how you could have a CPU that can't do WB->DEC, if other forwarding is possible. Do you mean it has to forward some other way instead of reading the reg-file in the same cycle it wrote it? Unfortunately your URL doesn't work for people not enrolled in your course. — Peter Cordes, Nov 23 '21 at 10:52
It's a hypothetical scenario, it doesn't make sense and it isn't practical - welcome to my uni. no, there are no other ways to forward it beyond what is specified- you're trying to defy the concept of the question, which is - when this program runs on a Pipelined risc-V that only has forwarding of a certain kind, what augmentations - to the program -( in the form of inserting NOPS), not the RISCV - are needed to make the program operable. — kal_elk122, Nov 23 '21 at 17:55

score 0 · Answer 1 · answered Nov 23 '21 at 10:49

0

As Erik said, there should not be a need for a NOP instruction. The CPU implementation should handle the dependencies and stall the pipeline when needed. If for some reason, the implementation doesn't do it(I would refer to this as a BUG), there are workarounds to fix it on a later stage, compiler that injects nops when detecting dependencies etc.

If the CPU supports forwarding, as you said on a traditional 5 stage pipelined CPU, than there is no need for NOP. When BEQ instruction hits the CPU decode stage, t0 is already written to register file while t1 can be forwarded.

answered Nov 23 '21 at 10:49

maku lulaj

168
9

1

It's a hypothetical scenario because my uni hates the concept of practicality and usefullness. I know this won't ever be useful, but this is on the test and meant to teach how the controller handles exceptions - instead of automated stalls we're to add artificial nops. This is the question - as impractical as it is that's what I get graded on. – kal_elk122 Nov 23 '21 at 17:52
solved below - i figured it out – kal_elk122 Nov 23 '21 at 18:25

score 0 · Accepted Answer · answered Nov 23 '21 at 18:13

So after working on this for a few hours, here's the solution: two key facts are needed:

Beq can only be forwarded to from WB, since it's branch condition is calculated the branch comperator and forwarding only exists to the ALU.
as per the questions instructions, we can't forward from WB->DEC, so essentially we can't forward to Beq. Let's write the stages and "run the program":

IF DEC EXE MEM WB 
1
2   1
3   2   1
4   3   2   1 
    4   3   2  1

notice we can't execute 4 (beq t1, t0, finish) since it's dependant on t1's value from instruction 2. We have to wait for t1's value. MEM->DEC forwarding doesn't exist. we can only fetch a new t1 at the DEC stage since all the forwarding to EXE links up to the ALU and we calculate the branch condition at the comperator whch we can't effect, hence we must wait and place a single NOP. let's continue.

IF DEC EXE MEM WB
    4  NOP  3  2

notice we STILL can't do anything - we're waiting for t1 but we don't have WB->DEC forwarding (as was stated in the question), so we must wait for 2 to finish it's WB stage at the DEC so that we can take t1's updated value, hence we must place another NOP. Let's continue.

IF DEC EXE MEM WB
    4  NOP NOP 3 - notice 2 has finished, we can now continue with the correct t1.
        4  NOP NOP
            4  NOP
                4
DONE.

yup

Risc-V Assembly - amount of bubbles needed to make the code operational - [Hypothetical]

2 Answers2