How can I optimize this power-up program so that I don't get so much RAW?

Question

I have a problem with this code, I'm running it on winMips64 and I'm getting a lot of RAW errors

I'm new with this kind of coding and I'm still trying to learn it

.data 

n:  .word 8 

x:  .double 0.5  

 

.text 

LD      R1,n(R0) 

L.D         F0,x(R0) 

DADDI   R2, R0, 1   ; R2 = 1 
MTC1    R2,F11      ; F11 = 1 

CVT.L.D     F2,F11      ; F2 = 1 

loop:   MUL.D   F2, F2, F0  ; F2 = F2*F0 
DADDI   R1, R1, -1  ; decrement R1 by 1 
BNEZ    R1, loop        ; if R1 != 0 continue 

 

; result in F2 

 

HALT

I've added line of code after: DADDI R1, R1, -1 ; decrement R1 by 1 DADD R2, R2, R0 <---- and enabled delay slot. Is it a correct way? Now i don't had any RAW errors. — mikeyMike, Dec 01 '21 at 09:52
RAW hazards aren't errors, they're just slowdowns. And no, adding an extra useless instruction that fills that slot in the pipeline is not better than letting the CPU for that cycle, in fact it's worse (for code-size / L1 I-cache hit rate). Also, enabling branch delay slots means that HALT will run in the delay slot on the first iteration, so the high-latency instruction, the FP multiply, will only execute once. — Peter Cordes, Dec 01 '21 at 10:37
Is `x` always 0.5? If so, you can construct a `double` with value `0.5 ^ n` by integer-subtracting `n` from the exponent field of the bit-pattern for `1.0` with one `dsll` and one `dsub`, no looping. (Look at how the bits work for 32-bit floats in https://www.h-schmidt.net/FloatConverter/IEEE754.html). — Peter Cordes, Dec 01 '21 at 10:38
Otherwise, to reduce the amount of stalls, shorten the dependency chain or create more instruction-level parallelism. Looks like you can't really do that without a different algorithm, though, because your computation is inherently serial. For doing powers in fewer steps, see Alexander Stepanov's lecture where he shows how multiplication by shift and add (to itself or adding the original thing) is the same structure as power by shift and multiply: https://youtu.be/etZgaSjzqlU?t=2276. — Peter Cordes, Dec 01 '21 at 10:47
That will take more integer work branching, but should handle your `n=8` by doing `MUL.D F2, F2,F2` 3 times, rather than 8x `MUL.D F2, F2, F0`. On a simple CPU without out-of-order exec to hide latency, that's likely a win even for small `n` like this. — Peter Cordes, Dec 01 '21 at 10:49

How can I optimize this power-up program so that I don't get so much RAW?

0 Answers0