31

In regards to IEEE-754 single precision floating point, how do you perform round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)?

Basically I have the guard bit, round bit, and sticky bit. So if we form those into a vector and call it GRS, then the following rules apply:

  1. If G = 0, round down (do nothing)
  2. If G = 1, and RS == 10 or RS == 01, round up (add one to mantissa)
  3. if GSR = 111, round to even

So I am not sure how to perform the round to nearest. Any help is greatly appreciated.

Aaron
  • 124
  • 11
Veridian
  • 3,531
  • 12
  • 46
  • 80

4 Answers4

46

Just to make sure we're on the same page, G is the most significant bit of the three, R comes next and S can be thought of as the least significant bit because its value partially represents the even less significant bits that have been truncated in the calculations. These three bits are only used while doing calculations and aren't stored in the floating-point variable before or after the calculations.

This is what you should do in order to round the result to the nearest even number using G, R and S:

GRS - Action
0xx - round down = do nothing (x means any bit value, 0 or 1)
100 - this is a tie: round up if the mantissa's bit just before G is 1, else round down=do nothing
101 - round up
110 - round up
111 - round up

Rounding up is done by adding 1 to the mantissa in the mantissa's least significant bit position just before G. If the mantissa overflows (its 23 least significant bits that you will store become zeroes), you have to add 1 to the exponent. If the exponent overflows, you set the number to +infinity or -infinity depending on the number's sign.

In the case of a tie, you add 1 to the mantissa if the mantissa is odd and you add nothing if it's even. That's what makes the result rounded to the nearest even value.

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
  • Do you even need the round bit? Can't you just look at the guard bit and the sticky bit becomes the OR of the round and old sticky bit? – Veridian Feb 14 '13 at 21:08
  • 1
    You may use none of these bits if you aren't very concerned about rounding. They just help you get a more accurate result. – Alexey Frunze Feb 14 '13 at 21:33
  • well I am concerned about rounding, so does my comment "Can't you just look at the guard bit and the sticky bit becomes the OR of the round and old sticky bit?" make sense? – Veridian Feb 18 '13 at 16:09
  • It does not. The sticky bit does not depend on what's to the left of it, it depends on what gets into it or past it to the right. – Alexey Frunze Feb 18 '13 at 16:56
  • Could you explain how that table(GRS action) is derived? Or any material and link would be useful. – inherithandle Apr 21 '13 at 05:39
  • 1
    @inherithandle See, for example, the `On Rounding` section of this [page](http://pages.cs.wisc.edu/~cs354-1/cs354/karen.notes/flpt.apprec.html). – Alexey Frunze Apr 21 '13 at 05:59
  • @AlexeyFrunze, that link is broken. Do you have another reference? Thanks – Veridian Feb 16 '17 at 16:29
  • @Veridian [wayback machine](https://web.archive.org/web/20130124022505/http://pages.cs.wisc.edu/~cs354-1/cs354/karen.notes/flpt.apprec.html) ? – Alexey Frunze Feb 17 '17 at 06:24
  • 3
    @Veridian your observation “can't you just look at the guard bit and the sticky bit becomes the OR of the round and old sticky bit” is absolutely correct in the case you presented, and indeed the round bit is not necessary. *However*, it is required in the case of subtractions, when the result underflows and needs to be shifted left. When that happens, a third bit is needed to decide what to do in the tie case. – sam hocevar Jul 26 '17 at 10:22
  • In case it is 011, I think you also need to round up? – Aaron Jan 12 '23 at 21:12
  • @Aaron No. What you're suggesting is like rounding 0.375 to an integer, that is, up to 1 (or 1.375 up to 2). – Alexey Frunze Jan 13 '23 at 04:34
  • @AlexeyFrunze I think you are wrong. If we have 011, that means we want to place the "point" between the 0 and 11 (0.11). The first one means that the value behind the point is >= 0.5 and the second 1 means that it is actually > 0.5, so we round up. (the first bit after the point is 2^(-1) = 0.5) – Aaron Jan 16 '23 at 11:12
  • @Aaron The GRS bits are used to round the result and then they are discarded. So, the "point" is here: result_mantissa.GRS and not here: result_mantissaG.RS. – Alexey Frunze Jan 17 '23 at 06:05
  • @AlexeyFrunze it seems that there are two ways of doing this... I clarified my answer accordingly. Thanks for pointing it out! – Aaron Jan 17 '23 at 18:33
  • @Aaron But conceptually there's no difference. You look at the bits you're going to discard (3 or 2) and the tie case and all other cases are still there and they are handled the same way. – Alexey Frunze Jan 18 '23 at 04:51
  • @AlexeyFrunze of course, the outcome and the underlying mathematical reasoning is the same, but it is a different way of getting to the result. So maybe some prefer one over the other way :) – Aaron Jan 18 '23 at 12:36
10

Just wanted to add that S bit is not just a bit following GR bits. If there are bits available after GRS bits, it actually is a logical OR of those, including S bit.
In other words, if there is any bit following GR bits that is 1 then the S bit value will be 1.

nsh
  • 380
  • 3
  • 9
1

Consider the following for rounding when you have a set of bits lower than the precision you are keeping:

  1. If the least significant bit you are keeping is 0, just add 0x7ff....f to rounding bits.
  2. If the least significant bit you are keeping is 1, just add 0x800....0 to rounding bits.

I think this implements the desired behavior doing only one test.

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
0

These are the rules for rounding to even based on the guard, round and sticky bit (INCR Yes/No signifies whether to add or not add a 1).

GRS INCR
x0x N
010 N
110 Y
x11 Y

The two interesting cases are the ones with x10, because then we have the situation where we need to round to even. An even binary number has a 0 in the least significant bit, and we can only increment the number or leave it as it is. So to make it even (get a zero at the least significant bit) in case of x10, if the guard is 0 we leave it and if it is 1 we increment.

EDIT: Apparently, there are two variations to this. Either the guard bit is the least significant bit that is kept, or it is the most significant bit that is cut off. My answer uses the guard bit as the least significant bit that is kept, so it is: result_mantissaG.RS. My source for this is the course Systems Programming at ETH Zurich by Prof. Roscoe:

lecture slide

Aaron
  • 124
  • 11