1

So, I am trying to understand floating point operations better, and I understand that when arithmetic operations are performed (I'm looking primarily at RISC) a rounding, guard and sticky bit are used for rounding of the result during normalization.

My question is, when we first "create" a binary floating point value, are rounding, guard and sticky bits utilized at all, or is the mantissa simply truncated? Could you not potentially loop forever trying to populate the sticky bit with a fractional value?

For example, if I am using a half-precision float (10 bit mantissa), with the value of 2775.0 the binary the value is 1010 1101 0111b. The mantissa would therefore be 0101 1010 111. If the last bit is truncated, the mantissa becomes 0101 1010 11 (2774.). If rounding occurs, the mantissa would become 0101 1011 00 (2776.).

Which is it?

I'd also be really interested in understanding how the FPU knows when to "stop" looking for the sticky bit when processing a decimal input value.

I've tried reading up on this and I don't find much on rounding as it relates to the initial conversion from decimal to binary (as far as rounding goes).

  • 1
    Converting from a decimal string to a binary floating-point number is an operation, similar to multiplication, taking a square root, or computing a cosine. The same rounding rules apply for it as for other operations. A rounding attribute specified by the programmer should be used. Any of the rounding attributes (to nearest with ties to even, toward +∞, toward zero, and so on) may be chosen. The most common default is round to nearest with ties to even. – Eric Postpischil Feb 13 '23 at 17:38
  • 1
    “Significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Mantissas are logarithmic. Significands are linear. – Eric Postpischil Feb 13 '23 at 17:40

1 Answers1

1

When converting from decimal to binary floating point values, is rounding used, or truncation?

Potentially both and the rules are highly specification dependent.

IEEE 754 offers various rounding modes. When rounding decimal text or decimal floating point values to binary FP, often round-to-nearest with ties-to-even is used. The use of rounding, guard and sticky bits depend on the rounding mode selected. Even if some of these bits do not affect the value, they affect flags like inexact.


IEEE 754 also allows, when converting decimal text to binary floating point, to ignore/truncate (treat as zeros) significant digits past a certain point. When converting to double this is at least +3 past the number for double-text-double round tripping (17) or 20+.


Could you not potentially loop forever trying to populate the sticky bit with a fractional value?

A sticky bit only needs, a most, to loop to the last digit on the input encoding. It might involve looking at maybe digits, but not forever. Often the search can quit early once a non-zero digit found. For modern FPUs, rarely is a loop used, but a large simultaneous or.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256