3

I am struggling to convert 32bit floating point to 16bit floating point with C.

I understand the concept of normalizing, denormalizing, etc.

But I failed to understand the below result.

This conversion complies with IEEE 754 standard. (using round-to-even mode)

32bit floating point
00110011 01000000 00000000 00000000 

converted 16bit floating point
00000000 00000001

This is the step what I've taken.

Given 32bit floating point's sign bit is 0, exp field is 102, rest is fraction bits field.

So exp field 102 has to be -127 bias, so it becomes -25, and it goes like below.

// since exp field is not zero, there will be leading 1.
1.1000000 00000000 00000000 * 2^(-25)

When converting above number to half precision floating point, we have to plus bias (15) to the exponent to encode exp field.

so exp field is -10.

Since encoded exp field is smaller than 0, given 32bit floating point cannot be expressed successfully to the half precision floating point.

So I thought half precision floating point bit pattern will go like below

00000000 00000000

But Why 00000000 00000001?

I have read many articles that have been uploaded on stackoverflow, but they are just the code samples, not actually dealing with the internal behavior.

Can someone please contradict my misconception?

jwkoo
  • 2,393
  • 5
  • 22
  • 35

1 Answers1

3

Getting the biased exponent of -10, you need to create a denormalized number (with 0 in the exponent field), by shifting the mantissa bits right by 11. That gives you 00000 00000 11000... for the mantissa bits, which you then round up to 00000 00001 -- the smallest possible denorm number.


An IEEE fp number has a 1 bit sign, an n bit exponent field, and a m bit mantissa field. For the n bit exponent field, an all 1s value represent Inf or Nan and an all 0s value represents a denorm or zero (which depends on the mantissa bits). So only exponents in the range 1..2n-2 are valid for normalized numbers.

So when you calculate your "Normalized and biased" exponent, if it is ≤ 0, you need to generate a denorm (or zero) instead. The value for a normalized number is

-1S(1.0 + 2-mM)2E-bias

(where M is the value in the mantissa field treated as an unsigned integer and m is the number of mantissa bits -- some descriptions write this as 1.M). The value for a denorm is

-1S(0.0 + 2-mM)21-bias

That is, the exponent is the same as for a biased exponent value of 1, but the "hidden bit" (the extra bit added to the top of the mantissa) is treated as 0 instead of 1. So to convert your normalized number with the (biased) exponent of -10 to a denorm, you need to shift the mantissa (including the hidden 1 bit that is normally not stored) by 1 - -10 bits (that is, 11 bits) to get the mantissa value you want for denorm. Since this will always shift by at least one bit (for any biased exponent ≤ 0), it will shift a 0 into the hidden bit position, matching the denorm meaning of the mantissa. If the exponent is small enough it will shift completely out of the mantissa, giving you a 0 mantissa (which is a zero). But in you specific case, even though it shifts entirely out of the 10 (representable in fp16 format) bits, the guard bits are still 1s, so it rounds up to 1.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • 1
    Appreciate for your help. Actually I don't understand "by shifting the mantissa bits right by 11". Could you please tell me more in detail? Furthermore, do we need to create a denormalized number since biased exponent of -10 is smaller than zero? – jwkoo Oct 06 '19 at 06:43
  • A normalized number has a (biased) exponent in the range 1..2n-2 -- the 0 coed is reserved for denormalized numbers (and 0). So if the biased exponent you calculate >= 0, you can't make a normailzed number, and you have to make a denorm instead. – Chris Dodd Oct 06 '19 at 18:11
  • Thank you sir, you are more than a professor to me. – jwkoo Oct 07 '19 at 05:58
  • The reason why I'm asking this issue is that I am making a C program that converts a single precision floating point to half precision floating point. Professor adviced me to using union, but actually I hardly understand how union helps to solve this problem. So now I am trying to make an algorithm that takes every step that you've told(step by step). I may not ask this question to the stackoverflow directly because there might be any possibility to be treated to broad question or etc. May I ask you an idea which gives efficient problem solving algorithm with this issue? Thank you. – jwkoo Oct 07 '19 at 06:05