Decimal to IEEE Single Precision Floating Point

Question

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.

Given an int, say 15, we have:

Binary: 1111

-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.

E = Exp - Bias Therefore, Exp = 130 = 10000010

And the significand will be: 111000000000000000000000

However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?

Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.

Also, could someone explain why rounding is needed for values greater than 23 bits? Thanks in advance!

I guess it depends on whether you want to do this completely manually, or do something fancy and hackish, and whether you want it to work with negative values, or values larger than fit in the mantissa. — Joe Z, Dec 01 '13 at 00:24
Oh, and more importantly, it seems you're interested in converting an _integer_ into a floating point number (possibly already in an `int`), as opposed to a number written in decimal, stored in, say, `std::string`. Is that correct? — Joe Z, Dec 01 '13 at 00:26
That is correct, I'm assuming that the value to be converted will be in an integer format. And we are assuming that the values are non-negative, non-zero, and are able to be fit into the mantissa. Thanks for bringing that up, I will clarify that in my question — Andrew T, Dec 01 '13 at 00:31

score 5 · Accepted Answer · edited Feb 08 '17 at 14:37

First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf

And now to some meat.

The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 2²⁴. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.

IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:

floating point format

The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.

(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )

Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:

Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 2¹)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).

So the value is 1.0 x 2¹ = 2.0.

To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:

Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.

There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:

float uint_to_float(unsigned int significand)
{
    // Only support 0 < significand < 1 << 24.
    if (significand == 0 || significand >= 1 << 24)
        return -1.0;  // or abort(); or whatever you'd like here.

    int shifts = 0;

    //  Align the leading 1 of the significand to the hidden-1 
    //  position.  Count the number of shifts required.
    while ((significand & (1 << 23)) == 0)
    {
        significand <<= 1;
        shifts++;
    }

    //  The number 1.0 has an exponent of 0, and would need to be
    //  shifted left 23 times.  The number 2.0, however, has an
    //  exponent of 1 and needs to be shifted left only 22 times.
    //  Therefore, the exponent should be (23 - shifts).  IEEE-754
    //  format requires a bias of 127, though, so the exponent field
    //  is given by the following expression:
    unsigned int exponent = 127 + 23 - shifts;

    //  Now merge significand and exponent.  Be sure to strip away
    //  the hidden 1 in the significand.
    unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);


    //  Reinterpret as a float and return.  This is an evil hack.
    return *reinterpret_cast< float* >( &merged );
}

You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)

You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.

For integers >= 2²⁴, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.

You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 2²⁴, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

Thank you very much Joe Z for your help! :) I have a couple more questions for you though, that will hopefully solve the matter! When asking my question above, I used the example of '15'. To find the exponent, I used repeated division by 2 in a loop and a counter. However, for the significand, if I used a logical shift left, that will simply give me 0000000000001110 stored. However, the value that I want to store in the significand is 1110000000000000. Is that what is done when you '&'the significand with 0x7FFFFF? And as for the rounding, is there a way as to check how many times we must — Andrew T, Dec 01 '13 at 15:58
shift right? Again, thanks very much for the help, I really appreciate it! You have definitely increased my understanding of the process :) — Andrew T, Dec 01 '13 at 15:59
The `while` loop above keeps shifting the significand _left_ until its left-most 1 aligns with where the hidden 1 _would_ be. If you look at the diagram for the floating point format, the hidden 1 would be in bit 23 if it weren't hidden. When the loop ends, the leading 1 still isn't "hidden" yet. That's why the loop keeps shifting until there's a 1 in bit 23. The `& 0x7FFFFF` strips off that leading 1, making it "hidden". That is, it clears bit 23. (It also clears bits 24 .. 31, but they should already be clear.) Make sense? — Joe Z, Dec 01 '13 at 16:11
Ahhh, I understand the purpose of the loop now. It will keep shifting until the hidden '1' is reached, and then the significand will already be in the desired format. The entirety of the syntax is still questionable at this point, however. So the condition of the while loop states that if the 23rd bit is zero, the shift counter will increase. But what does the "significand <<= 1" do exactly? I haven't seen that sort of operator used before... — Andrew T, Dec 01 '13 at 16:39
In C and C++, there's an entire family of `op=` operators. When you see `x op= y`, it's roughly equivalent to `x = x op y`. So, `significand <<= 1` is roughly equivalent to `significand = significand << 1`. ie. it left shifts `significand` by 1 bit. (I say "roughly equivalent" because there's big differences when you get into overloaded operators in C++. But for the base types like `unsigned int` you can think of them as equivalent.) — Joe Z, Dec 01 '13 at 16:43
I apologize if I seem like an idiot since I'm having difficulties understanding this! But the code given here is assuming that we already have the value in the IEEE format, correct? Otherwise, trying to gain access to the 23rd bit of an inputted integer will yield much different results. Going from the binary value for an int to a floating point will require different steps than this I suppose — Andrew T, Dec 01 '13 at 17:03
No, the function above accepts an integer value as an `unsigned int`. Try it! It's manipulating the integer with plain logical operations (left shift, bitwise-and) to align the integer value as a significand in a floating point representation, and only at the end does it do some compiler magic to make it re-interpret the bit pattern as a floating point number. — Joe Z, Dec 01 '13 at 17:06
You might try adding print (cout) statements throughout the above code so you can see each of the individual steps in action. Also, if you need to return just the formatted bit-pattern as an `unsigned int`, rather than a `float`, change the return type to `unsigned int`, and just do `return merged;` at the end. — Joe Z, Dec 01 '13 at 17:11
Oh, wait a minute! I think I'm starting to get what you are trying to say. So an unsigned int is brought into the function, and then the value is compared so that it is will be within the range of the amount of allowable bits. Then the unsigned int value, dubbed "significand" will be bitwise-anded with a '1' at the 23rd position. If it finds a zero, it will continue to shift the unsigned int left until it reaches the point at which it finds the hidden '1'? — Andrew T, Dec 01 '13 at 17:40
@AndrewT : More or less correct, yes. Try adding a `cout` statement to that loop to watch it happen. Call it with a few different integers (1, 2, 16777215), and see what happens! — Joe Z, Dec 01 '13 at 17:57
Epiphany. The 23rd bit of the significand will always be compared to '1', and since it is being 'anded', it will ONLY return a 1 if the significand has a value of one in the 23rd bit as well. Otherwise, it will always result in a zero, in which case the loop continues to iterate, and the significand is continually shifted left until that case is true. Is it possible when comparing, that the 23rd bit will be a '1' before it reaches the hidden one? I will definitely try this code out for sure! I just wanted to make sure I fully understand why the results are the way they are :) — Andrew T, Dec 01 '13 at 18:10
The number is always smaller than 2^24. That means bits 31 .. 24 are all zero. The worst that happens is that bit 23 is already 1 when you start, in which case it won't shift and it won't loop, and `shifts` will be zero. (Try 16777215 to see what it does.) — Joe Z, Dec 01 '13 at 18:20
Brilliant. Thank you for all of your help Joe, your patience with me has been a true virtue. — Andrew T, Dec 01 '13 at 18:34

Decimal to IEEE Single Precision Floating Point

1 Answers1