8bit floating point to decimal fraction

Question

I have a float number stored in 1 byte( as floatin point in 8 bits). Do we have a library function in boost or c++11 (or 14), which would convert the floating point number into decimal fraction?

I know how to convert the 8 bits( sign bit, exponent, mantissa) into decimal fraction. I just wanted to make use of library function instead of writing a new one ?

Referencing an existing function also would be helpful

If you make up your own floating-point format (which will be needed for anything but the standard `float`, `double` and `long double`) then you need to write all the functions needed for their use yourself. — Some programmer dude, Apr 26 '20 at 06:23
C++ have `float` which is a single-precision floating point type, it's 32 bits, 4 bytes. Then it has `double` which is a double-precision floating point type, it's 64 bits, 8 *bytes*. Then there's the implementation-defined `long double`. [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) also defines a half-precision type with 16 bits, 2 bytes. There's no standard C++ floating point type using 8 bits. And an 8-bit floating point type would make no sense unless you have very small values and very little precision. — Some programmer dude, Apr 26 '20 at 06:53
Perhaps it's time you [edit] your question to show us a [mcve] of what you have? I also recommend that you take some time to refresh [the help pages](http://stackoverflow.com/help), retake the SO [tour], and reread [ask] as well as [this question checklist](https://codeblog.jonskeet.uk/2012/11/24/stack-overflow-question-checklist/). — Some programmer dude, Apr 26 '20 at 06:59
What standard? There's no format specified in IEEE 754 smaller than 16 bits. Do you mean 8 *bytes*? — user207421, Apr 26 '20 at 07:32
@sakthivp there are some 8-bit floating-point formats for education purposes, but none are standardized because they are too useless in practice — phuclv, Apr 26 '20 at 08:47
@Someprogrammerdude I will rephrase and post again with clear cut information. Thanks for quick reply — sakthivp, Apr 26 '20 at 10:17

score 1 · Answer 1 · answered Apr 26 '20 at 15:48

The standard approach would not be very efficient, but exists:

friend std::ostream& operator<<(std::ostream& os, Num n) {
    return os << n.mantissa * pow(2.0f, n.exp) * (n.sign? -1:1);
}

Of course, this is cheating by using the built-in floating point serialization code. But that seems to be precisely what you are asking for.

For fun, I put together a very limited fixed-point type. Note the constructor is very flawed (it doesn't know (de)normal, NaN, and doesn't scale small mantissa's well at all). But it does demonstrate the conversions above enough so I could check they worked right:

Live On Coliru

#include <iostream>
#include <limits>
#include <cmath>

template <typename Underlying = std::uint8_t, unsigned expbits = 4>
struct Num {
    constexpr Num() noexcept : sign{}, raw_exp{}, mantissa{} {} // NSMI is c++20 for bitfield

    template <typename F> Num(F d) {
        // This is a lame constructor, for demo only
        // DO NOT USE FOR PRODUCTION/SERIOUS CODE
        sign = std::signbit(d);

        int e=0;
        d = std::frexp(std::abs(d), &e);
        effective_exp(e - manbits);

        mantissa = std::ldexp(d, manbits);
    }

    explicit constexpr operator double() const { return mantissa * pow(2.0, effective_exp()) * (sign? -1:1); }
    explicit constexpr operator float() const { return mantissa * pow(2.0f, effective_exp()) * (sign? -1:1); }

  private:
    friend std::ostream& operator<<(std::ostream& os, Num n) {
        return os << static_cast<double>(n);
    }

    constexpr auto effective_exp() const { return raw_exp - (1<<(expbits - 1)); }
    void effective_exp(int e) {
        if (e>maxexp||e<minexp) throw std::range_error("overflow");
        raw_exp = e + (1<<(expbits - 1));
    }

    // storage and dimensioning
    static_assert(not std::numeric_limits<Underlying>::is_signed);
    static constexpr unsigned bits     = std::numeric_limits<Underlying>::digits;
    static constexpr unsigned signbits = 1;
    static constexpr unsigned manbits  = bits - expbits - signbits;
    static constexpr int maxexp        = 1<<(expbits-1);
    static constexpr int minexp        = 1 - (1<<(expbits-1));

    Underlying sign: signbits, raw_exp: expbits, mantissa: manbits;
};


namespace { // just for demo, very inefficient because not essential
    template <typename U, unsigned s>
    static inline bool operator<(Num<U, s> const& lhs, double rhs) { return lhs.operator double() < rhs; }
    template <typename U, unsigned s>
    static inline bool operator<(Num<U, s> const& lhs, Num<U, s> const& rhs) { return lhs < rhs.operator double();
    }
    template <typename U, unsigned s>
    static inline auto& operator+=(Num<U, s>& lhs, double rhs) {
        return lhs = lhs.operator float() + rhs;
    }
} // namespace

int main() {
    {
        static_assert(sizeof(Num<>) == sizeof(char));
        Num x = 1.8;
        std::cout << "Proof of pudding: " << x << "\n";
    }

    // just more paces
    std::cout << "----- 24 bits, 7 expbits: \n";
    for (Num<uint32_t, 7> n = -10.0; n < 10.0; n += 1.1)
        std::cout << n << "\n";
    std::cout << "----- 10 bits, 5 expbits: \n";
    for (Num<uint16_t, 5> n = -10.0; n < 10.0; n += 1.1)
        std::cout << n << "\n";
    // don't try with 8bit because the flawed ctor will underflow, oh well
}

Prints

Proof of pudding: 1.75
----- 24 bits, 7 expbits: 
-10
-8.9
-7.8
-6.7
-5.6
-4.5
-3.4
-2.3
-1.2
-0.0999978
1
2.1
3.2
4.3
5.4
6.5
7.6
8.7
9.8
----- 10 bits, 5 expbits: 
-10
-8.89062
-7.78906
-6.6875
-5.58594
-4.48438
-3.38281
-2.28125
-1.17969
-0.0795898
1.01953
2.11719
3.21484
4.3125
5.40625
6.5
7.59375
8.6875
9.78125

8bit floating point to decimal fraction

1 Answers1