Normalization part of a code of Packing a Float (IEEE-754) into uint64_t

Question

I have been researching about portable way to store a float in a binary format (in uint64_t), so that it can be shared over network to various microcontroller. It should be independent of float's memory layout and endianness of the system.

I came across this answer. However, I am unable to understand few lines in the code which are shown below:

while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;

// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);

I am aware that the code above tries to normalize the significand. The first line in the above code fragment is trying to get the exponent of the float. I am not sure why second, third and fourth line are necessary. I am able to understand that second and third line of code tries to make fnorm variable lie between 0.0 and 1.0 but why it is necesarry? Does having fnorm (in decimal format) between 0.0 and 1.0 makes sure it's binary representation will be 1.xxxxxx... .

Please help me understanding what each step is trying to achieve what and how it achieves that? I want to understand how it changes bit-pattern of the float variable to get normalized significant (left-most bit set to 1).

OT: My preference would be to use the ubiquitous text format and not invent another one, which needs thorough documentation and understanding by users. A `float` can be stored textually in about 10 bytes and can be read by humans too. — Weather Vane, Jul 23 '19 at 16:37
I need to communicate it over a stream which can accept 20 bytes in a message. I can't use text format. — abhiarora, Jul 23 '19 at 16:46
The first two lines are ranging the (absolute) `float` value as `1.0 ≤ fnorm < 2.0` with an adjustment to the exponent. From there on is the "new" representation under rules that every single user will have to learn and I'll pass on that, but presumably `1.0` is subtracted so that the significand can be in the range `0.0 ≤ fnorm < 1.0` It's going to be hardly any quicker than converting to text, when the 20 byte message size is adequate. — Weather Vane, Jul 23 '19 at 17:00
The code in the question was not written merely for extracting the significand from a `float` or `double`. It was written to convert a floating-point value to a type with a narrower significand. The `+ 0.5f` rounds the value (with ties toward infinity). When merely extracting a significand, there would be nothing to round. — Eric Postpischil, Jul 23 '19 at 17:13
In any case, the first two lines are intended to adjust the scale of the number so it is in the half-open interval [1, 2). Once it is known to be in that interval, multiplying by 2\*\*`significandbits` and converting to an integer produces a copy of the significand bits. Subtracting 1 prior to that multiplication removes the leading 1 bit that is implicit in the significands of normal numbers. That code is not suitable for handling subnormal numbers. Also, the loop is unnecessary; C provides the `frexp` function for extracting the fraction (significand) and exponent of a floating-point number. — Eric Postpischil, Jul 23 '19 at 17:16
Most commonly a `float` is 32-bits in C implementations. Why do you want to store it in a `uint64_t`? Do you want to store a `double` in a `uint64_t` or `float` in a `uint32_t`? Would it suffice to support big endian and little endian or must you support any memory layout of a `float` even with some mixed-order storage of its bits? — Eric Postpischil, Jul 23 '19 at 17:18
IEEE-754 is already completely specified to the bit level. Just choose an endiannes for the transmission and make sure each end converts if needed. — Lee Daniel Crocker, Jul 23 '19 at 17:18
It's certainly going to be less work if you use a common standard, otherwise you'll force *every* user to convert your home-brew format; not just those systems that use a different format. — Weather Vane, Jul 23 '19 at 17:36
Just making sure `fnorm` lies between `0.0` and `1.0` (in decimal), making it to convert to form `1.xxxxxx...` in binary? — abhiarora, Jul 23 '19 at 17:50
I am considering to use `frexp` in my implementation. @EricPostpischil — abhiarora, Jul 23 '19 at 17:52
The whole thing is a thoroughly bad idea. All you are doing is creating *more* complication by inventing another format. Those who inherit your code will sigh and and mutter "Oh, the abhiarora format". — Weather Vane, Jul 23 '19 at 17:54
@WeatherVane I am just trying to understand that code. The code above tries to pack the code in `IEEE 754 format`. I am not trying to invent something of my own. I am having only 20 bytes message size and it will have another parameters as well (like `int`, `char`). I have already a microcontroller platform which accepts the data in binary messages. I need to write a `Desktop` application which can communicate with these microcontroller units using my own libraries and I have to use IEEE 754 only to send float across the network. — abhiarora, Jul 23 '19 at 17:59
@abhiarora: If you have a number in an IEEE-754 format, such as in a `float` or `double` where your implementation uses IEEE-754 for those types, then the easiest way to prepare it for network transmission is to read its bytes using an `unsigned char *`, and the only issue is what order to send the bytes in. You should not write code to manipulate the number mathematically in order to encode it into a format for transmission—that is already done for you by the C `float` or `double` type. — Eric Postpischil, Jul 23 '19 at 18:02
You are right but what if the code has to be used in a platform which doesn't use IEEE754 and it doesn't use Binary32 format? — abhiarora, Jul 23 '19 at 18:04
@abhiarora: If you do not have a number in an IEEE-754 format, then code somewhat like the above can be used to transform the number into parts that you could use construct an IEEE-754 format, which you would then send over the network. (But it is unlikely your C implementation is not using an IEEE-754 format for `float` and `double`. Or, if you are writing for great portability, so you do not know the formats of the target C implementation and must accommodate any C floating-point type generally, then you have considerably more work to do.) — Eric Postpischil, Jul 23 '19 at 18:04
Because the code can be used in another weird microcontroller to communicate with the modules. So, the problem is those system may be using some other implementation and they need a library support to communicate with those modules. I hope you understand my situation I am in. — abhiarora, Jul 23 '19 at 18:08
To supporting floating-point types in any C implementation, including those that do not use IEEE-754, you should refer to the characteristics of C floating types in C 2018 5.2.4.2.2. Each floating-point type has a sign, a base/radix *b*, an exponent, a precision *p*, and *p* digits in the base. So you need a format that can accommodate a bit for the sign, an positive integer for the base (which can theoretically be any size), an integer for the exponent (which can range between implementation-dependent values), an integer for the precision, and an arbitrary number *p* of base-*b* digits. — Eric Postpischil, Jul 23 '19 at 18:08
So writing code to convert *any* C-conforming floating-point type to a single format (like IEEE-754 binary32) for transmission involves a considerable amount of code—it would need to perform conversions from one base to another, possibly for very large or small numbers, requiring thousands of digits. You really do not want that. Most likely, the solution you want uses the preprocessor to test values defined in `` to ensure an IEEE-754 format is being used and then proceeds to send the bytes of a `float` or `double` without any transformation needed, just adjustments for byte order. — Eric Postpischil, Jul 23 '19 at 18:12
@abhiarora: Don't forget the best way is to avoid/forbid the use of floating point in the first place. It's almost never needed and almost always annoying, and some microcontrollers don't support floating point (and need to emulate it in software, which is slow). Typical you can just use integers with an implied scale everywhere instead (e.g. rather than have "floating point height of person in meters" use "integer height of person in millimeters" or ....), and avoid a lot of problems (caused by the "precision depends on magnitude" nature of floating point) and avoid a lot of buggy bloat. — Brendan, Jul 23 '19 at 19:58

prog-fh · Answer 1 · 2019-07-23T19:07:30.413

The while loops adjust the exponent in order to place the first binary 1 of fnorm just before the dot (in base 2).
So at most fnorm is 1.1111111... in base 2, which is almost 2.0 in base 10.
At least fnorm is 1.000000... in base 2, which is 1.0 in base 10.

In IEEE754, the significand of a normalised number (not subnormal) has the form 1.xxxxxx... (base 2), which conforms to the previous loops.
The first bit, before the dot, is always 1 that's why it is not necessary to memorize it.
(may be this last remark is the main point of your question)

After normalisation, your algorithm substracts 1.0, which leads to 0.xxxxx... as you saw.
Substracting 1.0 does not lose any information as long as we remember this substraction is systematic.
Multiplying this float value (strictly less than 1.0, but not negative) by the integer 1LL<<significandbits gives a float which is strictly less than this big integer.
Thus, converting it into an integer will give a value that does not overflow the significant bits.
(I guess the 0.5 increment helps rounding the last bit)

This integer contains all the significant bits that were originally in the significand of the floating point value.
Knowing it, the shift, and the sign makes possible the reconstitution of the original floating point value.

But, as suggested in the comments, since IEEE754 bit pattern is well defined, all of this may not be necessary.

The significand does not always have the form “1.xxxxxx…” It has a leading 1 for normal numbers and a leading 0 for subnormal numbers. Code to serialize and deserialize floating-point data must take this into account. (“Significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction part of a logarithm.) — Eric Postpischil, Jul 23 '19 at 17:29
@EricPostpischil yes, thanks, and since I don't know the exact motivation behind this, may be should I delete... I'm afraid this answer is more confusing than helpful. Depends on what was the real point of the question; I guess it was about the -1.0 but, out of this context I am not sure the answer is useful. — prog-fh, Jul 23 '19 at 17:36
Just making sure fnorm lies between 0.0 and 1.0 (in decimal), making it to convert to form 1.xxxxxx... in binary? — abhiarora, Jul 23 '19 at 18:00

Normalization part of a code of Packing a Float (IEEE-754) into uint64_t

1 Answers1

Linked