bit shifting to pack float to 13 bit float in c?

Question

For example lets say 18.xxx is read in as the input of a function as a float value. It will be truncated down to 18.0. From then, encoded it to: 0 10011 0010000 which satisfies the 13 bit float desired, and will be returned as an int with decimal value 2448. Anyone know how this can be accomplished using shifts?

Well, I suppose you might reinterpet the bit representation and extract the top 13-bits assuming a 32-bit IEEE-754 representation, perhaps along with a bit of rounding added in. Doing so wastes a full 8-bits on the exponent range, leaving only an effective 5-bit precision. With such a limited space I would suggest to considering a specialized representation tailored to your data set, perhaps straight fixed-point data. — doynax, Sep 29 '16 at 06:00

cdlane · Accepted Answer · 2016-09-29T07:55:41.810

This might do what you want if your floating point number is represented in 32-bit IEEE 754 single-precision binary format with an unsigned exponent:

#include <stdio.h>
#include <string.h>
#include <assert.h>

unsigned short float32to13(float f) {

    assert(sizeof(float) == sizeof(unsigned int));

    unsigned int g;
    memcpy(&g, &f, sizeof(float)); // allow us to examine a float bit by bit

    unsigned int sign32 = (g >> 0x1f) & 0x1; // one bit sign
    unsigned int exponent32 = ((g >> 0x17) & 0xff) - 0x7f; // unbias 8 bits of exponent
    unsigned int fraction32 = g & 0x7fffff; // 23 bits of significand 

    assert(((exponent32 + 0xf) & ~ 0x1f) == 0); // don't overflow smaller exponent

    unsigned short sign13 = sign32;
    unsigned short exponent13 = exponent32 + 0xf; // rebias exponent by smaller amount
    unsigned short fraction13 = fraction32 >> 0x10; // drop lower 16 bits of significand precision

    return sign13 << 0xc | exponent13 << 0x7 | fraction13; // assemble a float13
}

int main() {

    float f = 18.0;

    printf("%u\n", float32to13(f));

    return 0;
}

OUTPUT

> ./a.out
2448
>

I leave any endian issues and additional error checking to the end user. This example is provided only to demonstrate to the OP the types of shifts necessary to convert between floating point formats. Any resemblance to actual floating point formats is purely coincidental.

Undefined behavior. You break the strict aliasing rule. The rest of the code works only on specific platforms. — 2501, Sep 29 '16 at 07:25
@2501, I've revised the code to address the strict aliasing rule issue per [this post on float bits and strict aliasing](http://stackoverflow.com/questions/4328342/float-bits-and-strict-aliasing) — cdlane, Sep 29 '16 at 07:36
Great! The code still makes assumptions, specifically about implementation of real types. Please address them. — 2501, Sep 29 '16 at 07:39
I can see what you've done here @cdlane , looks great, I will have to tweak it and implement it slightly different due to my specifications for this. Thanks a lot! — Mau, Sep 29 '16 at 20:13

bit shifting to pack float to 13 bit float in c?

1 Answers1