Converting given mantissa, exponent, and sign to float?

Question

I am given the mantissa, exponent, and sign and I have to convert it into the corresponding float. I am using 22 bits for mantissa, 9 bits for exponent, and 1 bit for the sign.

I conceptually know how to convert them into a float, first adjusting the exponent back to its place, then converting the resulting number back into a float, but I'm having trouble implementing this in C. I saw this thread, but I couldn't understand the code, and I'm not sure the answer is even right. Can anyone point me in the right direction? I need to code it in C

Edit: I've made some progress by first converting the mantissa into binary, then adjusting the decimal point of the binary, then converting the decimal-point binary back into the actual float. I based my conversion functions off these two GeekforGeek pages (one, two) But it seems like doing all these binary conversions is doing it the long and hard way. The link above apparently does it in very little steps by using the >> operators, but I don't understand exactly how that results in a float.

Do you need to code it in C or C++? Also what exactly don't you understand from the link you provided, and what have you tried to implement so far? — Cerenia, Apr 19 '20 at 07:30
@Cerenia, Sorry I didn't mention I need to code in C. What I didn't understand in the link was what the Converted[ ] variable was and what it did exactly. Convert has four elements and I don't understand the author uses it to produce the final float. — curiouscoder, Apr 19 '20 at 07:41
you just need an enough long int for 32b and to use arithmetic << and | to fill your IEEE 32b float — bruno, Apr 19 '20 at 07:46
I think all the op does is write down the input he has into this datatype for ease of use. Unfortunately I am not comfortable enough in pure C to help. But maybe this resource will give you an idea: https://indepth.dev/the-simple-math-behind-decimal-binary-conversion-algorithms/ Relevant for you would be the *Converting fraction integer to decimal* and to get to fractions in the first place also *Base-q expansion of a number*. — Cerenia, Apr 19 '20 at 07:48
In the link *Converted* is an array of byte containing the 4 bytes making the float, as you can see each part is shift by 24 and 16 and 8 and (not done of course) 0. In your case this is different because you do not have the 4 bytes but the mantissa, exponent and sign separately. But the process is similar. What blocks you ? what you tried ? — bruno, Apr 19 '20 at 07:53
@bruno, what blocked me was that I didn't understand what exactly Converted did. I sort of understand what you explained though. How could I adapt what the link has to what I need to do? I've made some progress by first converting the mantissa into binary, then adjusting the decimal point of the binary, then converting the decimal-point binary back into the actual float. But it seems like doing all these binary conversions is doing it the long and hard way. Could you explain the process I need to do from the link? — curiouscoder, Apr 19 '20 at 08:05
"long and hard way" : how is possible ? Edit your question showing what you did. "Could you explain the process I need to do from the link" : in both case you have bit fields and have to group them placing each at the right place in the float — bruno, Apr 19 '20 at 08:16
Are you sure you need to use `22:9:1` for mantissa, biased exponent and sign-bit. IEEE-754 Single-Precision Floating-Point uses `23:8:1`? — David C. Rankin, Apr 19 '20 at 08:17

score 2 · Accepted Answer · answered Apr 19 '20 at 11:36

Here is a program with comments explaining the decoding:

#include <inttypes.h>
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>


//  Define constants describing the floating-point encoding.
enum
{
    SignificandBits = 22,   //  Number of bits in signficand field.
    ExponentBits    =  9,   //  Number of bits in exponent field.

    ExponentMaximum = (1 << ExponentBits) - 1,
    ExponentBias    = (1 << ExponentBits-1) - 1,
};


/*  Given the contents of the sign, exponent, and significand fields that
    encode a floating-point number following IEEE-754 patterns for binary
    floating-point, return the encoded number.

    "double" is used for the return type as not all values represented by the
    sample format (9 exponent bits, 22 significand bits) will fit in a "float"
    when it is the commonly used IEEE-754 binary32 format.
*/
double DecodeCustomFloat(
    unsigned SignField, uint32_t ExponentField, uint32_t SignificandField)
{
    /*  We are given a significand field as an integer, but it is used as the
        value of a binary numeral consisting of “.” followed by the significand
        bits.  That value equals the integer divided by 2 to the power of the
        number of significand bits.  Define a constant with that value to be
        used for converting the significand field to represented value.
    */
    static const double SignificandRatio = (uint32_t) 1 << SignificandBits;

    /*  Decode the sign field:

            If the sign bit is 0, the sign is +, for which we use +1.
            If the sign bit is 1, the sign is -, for which we use -1.
    */
    double Sign = SignField ? -1. : +1.;

    //  Dispatch to handle the different categories of exponent field.
    switch (ExponentField)
    {
        /*  When the exponent field is all ones, the value represented is a
            NaN or infinity:

                If the significand field is zero, it is an infinity.
                Otherwise, it is a NaN.  In either case, the sign should be
                preserved.

            Note this is a simple demonstration implementation that does not
            preserve the bits in the significand field of a NaN -- we just
            return the generic NAN without attempting to set its significand
            bits.
        */
        case ExponentMaximum:
        {
            return Sign * (SignificandField ? NAN : INFINITY);
        }

        /*  When the exponent field is not all zeros or all ones, the value
            represented is a normal number:

                The exponent represented is ExponentField - ExponentBias, and
                the significand represented is the value given by the binary
                numeral “1.” followed by the significand bits.
        */
        default:
        {
            int    Exponent = ExponentField - ExponentBias;
            double Significand = 1 + SignificandField / SignificandRatio;
            return Sign * ldexp(Significand, Exponent);
        }

        /*  When the exponent field is zero, the value represented is subnormal:

                The exponent represented is 1 - ExponentBias, and the
                significand represented is the value given by the binary
                numeral “0.” followed by the significand bits.
        */
        case 0:
        {
            int    Exponent = 1 - ExponentBias;
            double Significand = 0 + SignificandField / SignificandRatio;
            return Sign * ldexp(Significand, Exponent);
        }
    }
}


/*  Test that a given set of fields decodes to the expected value and
    print the fields and the decoded value.
*/
static void Demonstrate(
    unsigned SignField, uint32_t SignificandField, uint32_t ExponentField,
    double Expected)
{
    double Observed
        = DecodeCustomFloat(SignField, SignificandField, ExponentField);

    if (! (Observed == Expected) && ! (isnan(Observed) && isnan(Expected)))
    {
        fprintf(stderr,
            "Error, expected (%u, %" PRIu32 ", %" PRIu32 ") to represent "
            "%g (hexadecimal %a) but got %g (hexadecimal %a).\n",
            SignField, SignificandField, ExponentField,
            Expected, Expected,
            Observed, Observed);
        exit(EXIT_FAILURE);
    }

    printf(
        "(%u, %" PRIu32 ", %" PRIu32 ") represents %g (hexadecimal %a).\n",
        SignField, SignificandField, ExponentField, Observed, Observed);
}


int main(void)
{
    Demonstrate(0, 0, 0, +0.);
    Demonstrate(1, 0, 0, -0.);
    Demonstrate(0, 255, 0, +1.);
    Demonstrate(1, 255, 0, -1.);
    Demonstrate(0, 511, 0, +INFINITY);
    Demonstrate(1, 511, 0, -INFINITY);
    Demonstrate(0, 511, 1, +NAN);
    Demonstrate(1, 511, 1, -NAN);
    Demonstrate(0, 0, 1, +0x1p-276);
    Demonstrate(1, 0, 1, -0x1p-276);
    Demonstrate(0, 255, 1, +1. + 0x1p-22);
    Demonstrate(1, 255, 1, -1. - 0x1p-22);
    Demonstrate(0, 1, 0, +0x1p-254);
    Demonstrate(1, 1, 0, -0x1p-254);
    Demonstrate(0, 510, 0x3fffff, +0x1p256 - 0x1p233);
    Demonstrate(1, 510, 0x3fffff, -0x1p256 + 0x1p233);
}

Some notes:

ldexp is a standard C library function. ldexp(x, e) returns x multiplied by 2 to the power of e.
uint32_t is an unsigned 32-bit integer type. It is defined in stdint.h.
"%" PRIu32 provides a printf conversion specification for formatting a uint32_t.

score 0 · Answer 2 · answered Apr 19 '20 at 14:27

Here is a simple program to illustrate how to break a float into its components and how to compose a float value from a (sign, exponent, mantissa) triplet:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

void dumpbits(uint32_t bits, int n) {
    while (n--)
        printf("%d%c", (bits >> n) & 1, ".|"[!n]);
}

int main(int argc, char *argv[]) {
    unsigned sign = 0;
    unsigned exponent = 127;
    unsigned long mantissa = 0;
    union {
        float f32;
        uint32_t u32;
    } u;

    if (argc == 2) {
        u.f32 = strtof(argv[1], NULL);
        sign = u.u32 >> 31;
        exponent = (u.u32 >> 23) & 0xff;
        mantissa = (u.u32) & 0x7fffff;
        printf("%.8g -> sign:%u, exponent:%u, mantissa:0x%06lx\n",
               (double)u.f32, sign, exponent, mantissa);
        printf("+s+----exponent---+------------------mantissa-------------------+\n");
        printf("|");
        dumpbits(sign, 1);
        dumpbits(exponent, 8);
        dumpbits(mantissa, 23);
        printf("\n");
        printf("+-+---------------+---------------------------------------------+\n");
    } else {
        if (argc > 1) sign = strtol(argv[1], NULL, 0);
        if (argc > 2) exponent = strtol(argv[2], NULL, 0);
        if (argc > 3) mantissa = strtol(argv[3], NULL, 0);
        u.u32 = (sign << 31) | (exponent << 23) | mantissa;
        printf("sign:%u, exponent:%u, mantissa:0x%06lx -> %.8g\n",
               sign, exponent, mantissa, (double)u.f32);
    }
    return 0;
}

Note that contrary to your assignment, the size of the mantissa is 23 bits and the exponent has 8 bits, which correspond to the IEEE 754 Standard for 32-bit aka single-precision float. See the Wikipedia article on Single-precision floating-point format.

It is not uncommon for class assignments to use non-standard formats so that students have to write their own code for encoding and decoding, since the native format will not work. — Eric Postpischil, Apr 19 '20 at 20:43
@EricPostpischil: If indeed it is a requirement, I leave it up to the OP to adjust the code :) — chqrlie, Apr 20 '20 at 07:40

Program man · Answer 3 · 2020-04-19T16:10:51.937

-1

Linked question is C++ not C. To convert between datatypes in C preserving bits, a tool to use is the union. Something like

union float_or_int {
  uint32_t i;
  float f;
}

float to_float(uint32_t mantissa, uint32_t exponent, uint32_t sign)
{
  union float_or_int result;
  result.i = (sign << 31) | (exponent << 22) | mantissa;
  return result.f;
}

Sorry for typos, it's been a while since I've coded in C

edited Apr 19 '20 at 16:10

answered Apr 19 '20 at 08:19

Program man

393
3
13

Sorry, it keeps returning 0.000 floats, what could be the problem? – curiouscoder Apr 19 '20 at 08:47
The question describes a custom floating-point format. This answer simply puts the bits of the custom floating-point format into the bits of a native `float`. Unless the native `float` uses the same format as the custom format, this cannot work. And no common C implementations use the custom format described in the question. – Eric Postpischil Apr 19 '20 at 11:38
2

Additionally, if `int` is 32 bits or less, then `sign << 31` overflows when `sign` is 1, and the resulting behavior is not defined by the C standard. – Eric Postpischil Apr 19 '20 at 11:38
Eric is right about the overflow. Function can take unsigned int to fix this. – Program man Apr 19 '20 at 16:10

Converting given mantissa, exponent, and sign to float?

3 Answers3