How does arithmetic floating point rounding in RISC-V work?

Question

I am currently working on my own RISC-V (rv64gc) emulator. Everything went smoothly so far, however the floating point rounding mode is giving me a headache.

The RV ISA comprises of the following 5 floating point rounding modes:

RNE (Round to Nearest, ties to Even)
RTZ (Round towards Zero)
RDN (Round down / towards negative infinity)
RUP (Round up / towards positive infinity)
RMM (Round to Nearest, ties to Max Magnitude)

When thinking about the instructions that convert floats to integers (e.g. FCVT.W.S), these modes all make sense. However these aren't the only ones with encoded rounding modes. The instructions converting integers to floats also have a 3 bit encoding space for the rounding mode, as well as all the floating point arithmetic instructions do.

Now lets say we got 2 floats and want to add them together. If one of them is a large number and the other is a small number with lots of digits after the floating point, we might exceed the storage capacity of a float. Whenever this happens, are the lowest bits/digits just discarded? If yes, why would there be a rounding mode given then? Otherwise, how would the different modes work and what would they round to?

Generally rounding after discarding (which kinda is a must do without any extra bits available) makes no sense really, since after the least significant bits have been discarded, there is no need to further decrease precision by rounding because the storage is now enough for what's left of the original number. So is the rounding happening before the last bits are cut off and the resulting zeros are then discarded?

Example:

Imagine we have a Mantissa of 011010111 after adding two numbers, but actually a Mantissa's size is 8 bits at max (So we have to get rid of 1 bit).

RNE: Option 1 is 011010110 (down), Option 2 is 011011000 (up)

This is a tie: Which option would it choose?

After any of both options no further data is lost because only a 0 is discarded.

RTZ: Only option is 011010110 (towards Zero / down)

The last zero can now be discarded without any further data lost.

RDN and RUP: Dependent on the sign bit, there always is only one way to go and the last bit will turn to 0 so no further data is lost when discarding that bit.

RMM: This always has only one option too (away from 0 / up in this example).

When looking at another example with a 0 currently set as least significant bit, does it simply not round because incrementing/decrementing the number would actually increase precision here?

In case there is rounding happening before bits are discarded, does the CPU just temporarily hold a bigger result when the instructions are executed which is then used to get the rounded result of the correct size?

If I got something wrong fundamentally please correct me, likewise any help is appreciated!!

There is no capricious discarding of bits. Conceptually, floating-point rounding is specified as a function of the exact result (the result one would obtain by doing actual real-number arithmetic on the operands, also called the “infinitely precise” result). If the operands to `+` are *x* and *y* and *x* is hugely larger than *y*, the exact result is *x* + *y*, and it is rounded according to the chosen rounding method. If that is to nearest with ties to even, the result is *x*, because the fact that *y* is small means there is no representable number closer to *x* + *y* than *x* is… — Eric Postpischil, Aug 22 '21 at 13:39
… If the rounding method is toward zero, the result is *x* if both *x* and *y* are positive or both are negative. Otherwise, it is the next representable value from *x* toward zero. If the rounding method is up, the result is the next representable value greater than *x* if *y* is positive. Otherwise it is *x*. Round down is symmetric, and round to nearest with ties to max magnitude is the same as to even because there are no ties when *y* is so small. — Eric Postpischil, Aug 22 '21 at 13:42
IEEE implementations use three extra bits for arithmetic: guard, round, sticky. See https://stackoverflow.com/questions/19146131/rounding-floating-point-numbers-after-addition-guard-sticky-and-round-bits, for example. — Erik Eidt, Aug 22 '21 at 14:58

score 0 · Answer 1 · answered Aug 23 '21 at 00:12

For RNE, 011010111 would round to 011010110 if and only if there were no other 1s that underflow. 011010110 is even because the LSb is 0. RNE is the most common and default mode (per IEEE):

       LGRS
011010111??
        ^^^
        underflow bits

L = LSb
G = "Guard bit"
R = "Rounding bit"
S = "Sticky bit" (is "latched", once set to 1 it stays at 1)

LGRS
x11x round up
x101 round up
0100 tie, round to even which in this case is down
1100 tie, round to even which in this case is up

How does arithmetic floating point rounding in RISC-V work?

1 Answers1