What happens to the bits of a float during division?

Question

I have a homework assignment which requires me to divide a 32-bit single precision floating point integer by 2 in C using bitwise operations (if statements and for loops can also be used). The float's bits are represented as unsigned integers so we can modify them with bitwise operators. My issue is that I'm having trouble understanding what exactly happens to the bits during division. My initial plan was simply to right shift the exponent bits by 1 while keeping the sign and mantissa bits the same, but that has not worked. For instance, when my function is given the bits represented by 0x800000, my function returns 0x00000000, since right shifting the exponent would cause all bits to be 0. However, according to the test driver for the homework, the correct answer in this scenario is 0x00400000. This really confuses me, because I'm not sure how or why the exponent bits would seemingly shift over into the mantissa bits.

unsigned divideFloatBy2(unsigned uf){
//copy the sign bit
unsigned signBit = (1 << 31) & uf;

//copy mantissa
unsigned mantissa = ~0;
mantissa >>= 9;
mantissa &= uf;

//copy exponent
unsigned mask = 0xFF;
mask <<= 23;
unsigned exponent = (uf & mask);
exponent >>= 23;

exponent >>= 1; //right shift to divide by 2;

exponent <<= 24;

//combine all again
unsigned ret = signBit | exponent | mantissa;
return ret; //will be interpreted as float later
}

This function works correctly for some inputs but not all, such as the input given above. Keep in mind that I'm more asking about what happens to a float's bits during division than I am simply asking for the code to make this work.

John Bollinger · Answer 1 · 2017-10-05T04:49:56.537

You have a good insight that scaling normalized, radix-two, floating-point numbers by powers of two affects only the exponent (supposing that you neither overflow nor underflow), but you are performing the wrong manipulation. Right-shifting the exponent by 1 is equivalent to dividing it -- the exponent -- by two. The result is of the same magnitude as the square root of the original number. That's not at all what you're after, unless the original number is around 4.

It might help you to write out an example in binary scientific notation, since that corresponds closely to the machine representation. Suppose, then, that your original number, N, is 1.01010x2¹¹⁰.

N / 2 = N * 2^-1
      = 1.01010x2¹¹⁰ * 2^-1
      = 1.01010x2^110-1
      = 1.01010x2¹⁰¹

So yes, the mantissa and sign don't change, but the effect on the exponent is simply to reduce it by 1.

With respect to your original program, do note that it does not, in fact, correctly implement the approach you describe. It shifts the exponent bits right by 23 to bring the least significant to the units place, then right by one more to implement your operation, but then it shifts back left by 24 bits. It ought to shift back left by only 23, reversing the original right shift, to bring the result bits back to the correct position.

The effect of the operation you actually perform is to clear the least-significant exponent bit, which happens to be equivalent to subtracting 1 when the biased exponent is odd. That's why it produces the right answer half the time.

Probably worth pointing out that division-by-2 of an IEEE-754 `binary32` value via simple decrement of the exponent only works correctly for normalized floating-point numbers. Additional scaling is necessary to handle denormals (subnormals) correctly. — njuffa, Oct 05 '17 at 04:18
Point taken, @njuffa. I've qualified my comments to clarify that they apply to normalized inputs. — John Bollinger, Oct 05 '17 at 04:52

chux - Reinstate Monica · Answer 2 · 2017-10-05T20:27:41.613

when ... given 0x800000, my function returns 0 ...., the correct answer ... is 0x00400000.

This is dividing the minimum normal float value by 2 and is detailed in #3 below.

There are many issues with the code.

For most finite numbers, decrementing rather than shifting the exponent is correct as pointed out by @John Bollinger good answer when the exponent is > 1.
When the exponent == 0, the number is sub-normal (or denormal) and needs to have its mantissa field shifted right (/2). The exponent remains 0. If the bit shifted out is 1, then the divided-by-2 it not exact. Depending on rounding more, then, mantissa is adjusted - perhaps by adding 1.
When the exponent == 1, the result will be sub-normal and the implied bit of normal numbers needs to be created in the mantissa field and shifted right (/2). This shift may incur a rounding as discussed above. The exponent becomes 0. Note that "rounding" mant may exceed mant max value of 0x7FFFFF and then require adjustments to the fields.
When the exponent == MAX (255), the the number is not finite (it is infinity or Not-a-Number) and should be left alone.

Code like 1 << 31 is better defined as:

// unsigned signBit = (1 << 31) & uf;
unsigned signBit = (1u << 31) & uf;   // Use an unsigned mask
unsigned signBit = (1LU << 31) & uf;  // unsigned may be 16 bit.
// or better yet
unsigned signBit = uf & 0x80000000;

Corner weaknesses with the mantissa derivation in that it relies on the (overwhelmingly common) 2's complement. Portable alternative:

// unsigned mantissa = ~0;  Incorrect mask in `mantissa` when `int` is not 2's comp.
// unsigned mantissa = -1;  correct all bits set.
// mantissa >>= 9;
// mantissa &= uf;
// or simply use
unsigned mantissa = 0x7FFFFF & uf;

unsigned may be 16, 32, 64, bit etc. Better to use minimum or exact width types.

#define SIGN_MASK 0x80000000
#define EXPO_MASK 0x7F800000
#define MANT_MASK 0x007FFFFF

#define EXPO_SHIFT 23
#define EXPO_MAX         (EXPO_MASK >> EXPO_SHIFT)
#define MANT_IMPLIED_BIT (MANT_MASK + 1u)

uint32_t divideFloatBy2(uint32_t uf){
  unsigned sign = uf & SIGN_MASK;
  unsigned expo = uf & EXPO_MASK;
  unsigned mant = uf & MANT_MASK;

  expo >>= EXPO_SHIFT;
  // when the number is not an infinity nor NaN
  if (expo != EXPO_MAX) {
    if (expo > 1) {
      expo--;  // this is the usual case
    } else {
      if (expo == 1) {
        mant |= MANT_IMPLIED_BIT;
      }
      expo = 0;
      unsigned round_bit = mant & 1;
      mant /= 2;

      if (round_bit) {
        TBD_CODE_Handle_Rounding(round_mode, sign, &expo, &mant);
      }
    }
    expo <<= EXPO_SHIFT;
    uf = sign | expo | mant;
  }
  return uf;
}

OP later commented exponent ,sign 0, mantissa == 0x3, expected result is 0x2, but my returning 1. so rounding mode is likely FE_TONEAREST or possibly FE_UPWARD.

Re-write of the case when expo <= 1 follows. It is tested code - going through many of the 2³² combinations and with 4 rounding modes.

Note that when some_float/2.0f computes, it may affect the floating-point environment status bits. I have initially done like-wise but since eliminated that code from this post - contact if interested.

    } else {
      if (expo == 1) {
        expo = 0;
        mant |= MANT_IMPLIED_BIT;
      }
      // Divided by 2 result inexact?
      if (mant % 2) {
        mant /= 2;
        // Determine how to round
        switch (fegetround()) {
          case FE_DOWNWARD:
            if (sign) mant++;
            break;
          case FE_TOWARDZERO:
            break;
          case FE_UPWARD:
            if (!sign) mant++;
            break;
          default: // When mode is not known, act like FE_TONEAREST
            // fall through
          case FE_TONEAREST:
            if (mant & 1) mant++;
            break;
        }
        if (mant >= MANT_IMPLIED_BIT) {
          mant = 0;
          expo++;
        }
      } else {
        mant /= 2;
      }
    }

For details on the rounding modes, search on the FE_... macros or here.

What exactly does rounding mean in this scenario? i.e., what does TBD_CODE_Handle_Rounding() do if round_bit is true? — jburn7, Oct 05 '17 at 17:27
@jburn7 Its code left for you to write. As answered here "perhaps by adding 1" to `mant`. Your post does not specify how rounding should be handled. To handle all rounding modes (there's is at [least 4](https://www.gnu.org/software/libc/manual/html_node/Rounding.html)) is a fair amount of more code. Better if the post clearly specified the rounding goals. — chux - Reinstate Monica, Oct 05 '17 at 17:34
Oh I see now. We weren't told how to round, other than that we need to be performing the bitwise equivalent of 0.5*f where f is a 32 bit single precision float. For reference, I have another example where when the exponent and sign bits are all 0, and the mantissa == 0x3, the expected result is 0x2, but my function is returning 0x1. However, I'm still not entirely sure what the mantissa bits mean in terms of decimal numbers, so I'm not sure which way this example is rounding. — jburn7, Oct 05 '17 at 17:44
@jburn7 In your [example](https://stackoverflow.com/questions/46577317/what-happens-to-the-bits-of-a-float-during-division/46590003?noredirect=1#comment80136618_46590003), the value is rounding up. +0x3 divide by 2 would be +1.5, but with whole number math, that becomes +2 when rounded up. Research `FE_TONEAREST` for added details. — chux - Reinstate Monica, Oct 05 '17 at 18:22

What happens to the bits of a float during division?

2 Answers2