cast 32bit-float to 64bit-double on system where sizeof double == sizeof float == 4

Question

I am trying to serialize a float according to the BSON spec which only has support for 64bit double. so i need to cast my float to a double.

On a system where sizeof(double) == 8 i would just do

float f = 3.14;
serialize((double)f);

but since sizeof(double) == 4 on the my target system i have to do something like

float f = 3.14;
uint64_t d;
float32_to_float64(f, &d);
serialize(d);

i have written some test code (on a machine where sizeof(double) == 8) trying to correctly converting the float32 to float64 and storing the result as a uint64_t but i am not getting the expected result.

#include <stdio.h>
#include <stdint.h>

#define FLOAT_FRACTION_MSK  0xFFFFFF

#define DOUBLE_FRACTION_S   52 // Fraction is 52 bits
#define DOUBLE_EXPONENT_S   11 // Exponent is 11 bits

#define FLOAT_FRACTION_S    23 // Fraction is 23 bits
#define FLOAT_EXPONENT_S    8  // Exponent is  8 bits

int main(void) {
    // float af = 3.14;
    float af = 0.15625;

    double bd = 0;
    //uint8_t buff[sizeof(int64_t)] = {0};

    *(uint64_t*)&bd |= (*(uint32_t*)&af & (1UL << 31)) << 32; // check sign bit


    uint8_t exponent32 = (*(uint32_t*)&af & 0x7F800000) >> (FLOAT_FRACTION_S+1);
    if (exponent32 == 0xFF) return 1; // Error (infiniti if fraction is zero,
                                      // Nan ortherwise)


    printf("exponent32=%.4x\n", exponent32);
    int64_t temp = *(uint64_t*)&bd;
    *(uint64_t*)&bd |= ((uint64_t)exponent32 << (DOUBLE_FRACTION_S+4)); //& 0x7FF0000000000000; // (33); // 28
    printf("exponent64=%llx, %d\n", *(uint64_t*)&bd, (DOUBLE_FRACTION_S+4));

// Do the fraction
{
    printf("fraction64=%#.8llx\n", (
        (uint64_t)(
            (*(uint32_t*)&af & FLOAT_FRACTION_MSK) // + ((exponent32 != 0) ? (1<<24) : 0)
        ) << (DOUBLE_FRACTION_S-FLOAT_FRACTION_S-4)//((52-22)-1) // 33
    ) );

    *(uint64_t*)&bd |= (
        (uint64_t)(
            (*(uint32_t*)&af & FLOAT_FRACTION_MSK) // + ((exponent32 != 0) ? (1<<24) : 0)
        ) << (DOUBLE_FRACTION_S-FLOAT_FRACTION_S)
    ) ;
}


    double expected = af;
    printf("Original float=%#.4x, converted double=%#.8llx expected=%.8llx,\n", *(uint32_t*)&af, *(uint64_t*)&bd, *(uint64_t*)&expected);
    printf("Original float=%f, converted double=%lf\n\n", af, bd);

    *(uint64_t*)&bd = temp;

    return 0;
}

The output of this gives Original float=0x3e200000, converted double=0x3e04000000000000 expected=3fc4000000000000,

So it seems i am missing something when converting the exponent but i am at a loss to what that is.

One clear mistake is that you are not taking care of the exponent bias. 127 for float, 1023 for double. Do keep in mind that this code doesn't necessarily performs correctly on a platform that ignores IEEE-754. — Hans Passant, Nov 18 '14 at 12:35
Why will it not perform correctly if the platform ignores IEEE-754? Also i thought the C standard specified float and double according to IEE-754 so a standard compliant compiler would never ignore IEEE-754 — FnuGk, Nov 18 '14 at 12:45
So, is `CHAR_BIT == 16`? Otherwise, how can a ieee754 double-precision floating-point number fit into 32 bits? — EOF, Nov 18 '14 at 12:50
No. I am trying to covert a float(32bits) into a double(64bits) and store it in an uint64_t and then serialize the uint64_t LSB in the serialized buffer — FnuGk, Nov 18 '14 at 12:54
Let's assume `CHAR_BIT == 8`. Now, if `sizeof(double) == 4`, the size of the double *in bits* will be `CHAR_BIT * sizeof(double) == 32`. Not `64`. You see the problem? — EOF, Nov 18 '14 at 13:01
Yes and there in lies the reason for why i cant just cast my float to a double. I need to manually to the bit twiddling to do the cast and store the result in a 64-bit container. Otherwise i don't think i understand the problem case you are presenting — FnuGk, Nov 18 '14 at 13:07
I'll put it in simple terms: If `sizeof(double) == 4` and `CHAR_BIT == 8`, your double is not a ieee754 double. **End of story.** If `sizeof(double) == sizeof(float)`, `double` may well be an alias for `float`. — EOF, Nov 18 '14 at 13:14
Exactly that is why i need to convert it. (The anon answered with a working solution btw). My float is a correct IEEE-754 float as per the C standard. My double however is just an alias for float. This is also fully standard compliant as the C standard just states that double must have **atleast** as much precision as a float. An alias to float does indeed has as much precision as a float. — FnuGk, Nov 18 '14 at 13:24
@FnuGk, 1) the anon solution is a working solution for most, but not all `float`.2) post title is confusing, the goal is more like "Convert IEEE-754 binary32 to binary64 on system where double is not binary64". — chux - Reinstate Monica, Nov 18 '14 at 15:56
@EOF Agree with the confusion. OP's `double` is certainly not an [IEEE-754 binary64](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) — chux - Reinstate Monica, Nov 18 '14 at 15:58
@chux: And I agree that Anonymous' solution fails for denormals... — EOF, Nov 18 '14 at 16:11
@FnuGk: No, the C standard does not require IEEE-754 floating-point. It *permits* it, and allows an implementation to claim that it supports it (by predefining the macro `)_STDC_IEC_559__`, but a non-IEEE-754 system can support a conforming C implementation. C's general requirements for floating-point are much looser than those imposed by IEEE-754. **However** ... (see next comment) — Keith Thompson, Nov 18 '14 at 20:15
I'm fairly sure that C's minimal requirements for type `double` cannot be met by a 32-bit type, which means that the C implementation you're using does not conform to the C standard. What compiler are you using, and for what target system? — Keith Thompson, Nov 18 '14 at 20:18
@Keith Thompson Agree that conforming `double` cannot fit in 32-bit. `DBL_DIG` alone implies at least 10 decimal (33+ binary) digits. I think `double` takes at least 33+8+1 bits. — chux - Reinstate Monica, Nov 18 '14 at 20:29
Ok it seems i was misinformed then. The platform is an Atmel AVR at90can128 8-bit micro-controller. The compiler is avr-gcc 4.8.1 — FnuGk, Nov 19 '14 at 11:19

Anonymous · Accepted Answer · 2014-11-18T20:11:30.117

2

fixed denormals, infinites & nans

unsigned __int64 Float2Double(float v)
{
    unsigned int f = *(unsigned int*)&v; // reinterpret 
    if ( !(f&0x7fffffff) )
        return (unsigned __int64)f<<32; // return +/-0.0

    unsigned int s = f>>31; // get sign
    unsigned int e = ((f&0x7f800000)>>23) -128; // get exponent and unbias from 128

    unsigned int m = f&0x007fffff; // get mantisa

    if (e==-128)
    {
        // handle denormals
        while ( !(m&0x00800000) )
        {
            m<<=1;
            e--;
        }
        m&=0x007fffff; // remove implicit 1
        e++;           //
    }
    else
    if (e==127)
    {
        // +/-infinity
        e = 1023;
    }

    unsigned __int64 d = s; // store sign (in lowest bit)

    d <<= 11; // make space for exponent
    d |= e +1024;   // store rebiased exponent

    d <<= 23; // add space for 23 most significant bits of mantisa
    d |= m;   // store 23 bits of mantisa

    d <<= 52-23; // trail zeros in place of lower significant bit of mantisa

    return d;
}

edited Nov 18 '14 at 20:11

answered Nov 18 '14 at 13:11

Anonymous

2,122
19
26

1

I think problem with your code is that it lacks of exponent bias conversion. – Anonymous Nov 18 '14 at 13:26
Yes it does. I had trouble figuring out how to unbias the float and rebias it as a double. Your solution seems much more simple that what i thought it would be – FnuGk Nov 18 '14 at 13:29
Better. Portability problem when `unsigned` is 16-bit and where is `__int64` is defined? - maybe not a problem on OP's platform. Has infinite loop - tried all float combinations, never completes. – chux - Reinstate Monica Nov 18 '14 at 19:17
inifinite loop was caused by -0.0f , now should be better, thanks again! – Anonymous Nov 18 '14 at 19:54
passed all 2^32 tests – Anonymous Nov 18 '14 at 20:12
@Anonymous Agree, passes all `float`. Minor: mantissa vs. mantisa – chux - Reinstate Monica Nov 18 '14 at 20:21
1

The `//reinterpret`-line at the beginning is undefined behaviour according to the C standard. Use `memcpy(&int, &float, min(sizeof(float),sizeof(int))`. – EOF Nov 19 '14 at 10:08

chux - Reinstate Monica · Answer 2 · 2014-11-18T19:21:31.733

After accept answer that works with all float.

Tested successfully with all float including typical normal finites, sub normals, +/- zero, +/- infinity and NaN.

#include <assert.h>
#include <math.h>
#include <stdint.h>

#define F_SIGN_SHIFT (31)
#define F_EXPO_MAX (0xFF)
#define F_EXPO_SHIFT (23)
#define F_EXPO_MASK ((uint32_t) F_EXPO_MAX << F_EXPO_SHIFT)
#define F_EXPO_BIAS (127)
#define F_SFCT_MASK (0x7FFFFF)
#define F_SFCT_IMPLIEDBIT (F_SFCT_MASK + 1)

#define D_SIGN_SHIFT (63)
#define D_EXPO_MAX (0x7FF)
#define D_EXPO_SHIFT (52)
#define D_EXPO_MASK ((uint64_t) D_EXPO_MAX << D_EXPO_SHIFT)
#define D_EXPO_BIAS (1023)

uint64_t IEEEbinary32float_to_IEEEbinary64int(float f) {
  assert(sizeof f == sizeof(uint32_t));
  union {
    float f;
    uint32_t u;
  } x = { f };
  uint64_t y;

  y = (uint64_t) (x.u >> F_SIGN_SHIFT) << D_SIGN_SHIFT;
  unsigned expo = (x.u & F_EXPO_MASK) >> F_EXPO_SHIFT;
  uint32_t significant = x.u & F_SFCT_MASK;
  if (expo > 0) {
    if (expo == F_EXPO_MAX) {    // Infinity NaN
      expo = D_EXPO_MAX;
    } else {                     // typical normal finite numbers
      expo += D_EXPO_BIAS - F_EXPO_BIAS;
    }
  } else {
    if (significant) {           // Subnormal
      expo += D_EXPO_BIAS - F_EXPO_BIAS + 1;
      while ((significant & F_SFCT_IMPLIEDBIT) == 0) {
        significant <<= 1;
        expo--;
      }
      significant &= F_SFCT_MASK;
    } else {                    // Zero
      expo = 0;
    }
  }
  y |= (uint64_t) expo << D_EXPO_SHIFT;
  y |= (uint64_t) significant << (D_EXPO_SHIFT - F_EXPO_SHIFT);
  return y;
}

cast 32bit-float to 64bit-double on system where sizeof double == sizeof float == 4

2 Answers2