Parsing fixed width numbers with boost spirit

Question

I'm using spirit to parse fortran-like text file filled with fixed width numbers:

1234 0.000000000000D+001234
1234 7.654321000000D+001234
1234                   1234
1234-7.654321000000D+001234

There are parsers for signed and unsigned integers, but I can not find a parser for fixed width real numbers, can someone help with it ?

Here's what I have Live On Coliru

#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
#include <iomanip>
namespace qi = boost::spirit::qi;

struct RECORD {
    uint16_t a{};
    double   b{};
    uint16_t c{};
};

BOOST_FUSION_ADAPT_STRUCT(RECORD, a,b,c)

int main() {
    using It = std::string::const_iterator;
    using namespace qi::labels;

    qi::uint_parser<uint16_t, 10, 4, 4> i4;

    qi::rule<It, double()> X19 = qi::double_ //
        | qi::repeat(19)[' '] >> qi::attr(0.0);

    for (std::string const str : {
             "1234 0.000000000000D+001234",
             "1234 7.654321000000D+001234",
             "1234                   1234",
             "1234-7.654321000000D+001234",
         }) {

        It f = str.cbegin(), l = str.cend();

        RECORD rec;
        if (qi::parse(f, l, (i4 >> X19 >> i4), rec)) {
            std::cout << "{a:" << rec.a << ", b:" << rec.b << ", c:" << rec.c
                      << "}\n";
        } else {
            std::cout << "Parse fail (" << std::quoted(str) << ")\n";
        }
    }
}

Which obviously doesn't parse most records:

Parse fail ("1234 0.000000000000D+001234")
Parse fail ("1234 7.654321000000D+001234")
{a:1234, b:0, c:1234}
Parse fail ("1234-7.654321000000D+001234")

sehe · Accepted Answer · 2021-10-05T16:16:11.430

The mechanism exists, but it's hidden more deeply because there are many more details to parsing floating point numbers than integers.

qi::double_ (and float_) are actually instances of qi::real_parser<double, qi::real_policies<double> >.

The policies are the key. They govern all the details of what format is accepted.

Here are the RealPolicies Expression Requirements

Expression	Semantics
`RP::allow_leading_dot`	Allow leading dot.
`RP::allow_trailing_dot`	Allow trailing dot.
`RP::expect_dot`	Require a dot.
`RP::parse_sign(f, l)`	Parse the prefix sign (e.g. '-'). Return `true` if successful, otherwise `false`.
`RP::parse_n(f, l, n)`	Parse the integer at the left of the decimal point. Return `true` if successful, otherwise `false`. If successful, place the result into n.
`RP::parse_dot(f, l)`	Parse the decimal point. Return `true` if successful, otherwise `false`.
`RP::parse_frac_n(f, l, n, d)`	Parse the fraction after the decimal point. Return `true` if successful, otherwise `false`. If successful, place the result into n and the number of digits into d
`RP::parse_exp(f, l)`	Parse the exponent prefix (e.g. 'e'). Return `true` if successful, otherwise `false`.
`RP::parse_exp_n(f, l, n)`	Parse the actual exponent. Return `true` if successful, otherwise `false`. If successful, place the result into n.
`RP::parse_nan(f, l, n)`	Parse a NaN. Return `true` if successful, otherwise `false`. If successful, place the result into n.
`RP::parse_inf(f, l, n)`	Parse an Inf. Return `true` if successful, otherwise `false`. If successful, place the result into n.

Let's implement your policies:

namespace policies {
    /* mandatory sign (or space) fixed widths, 'D+' or 'D-' exponent leader */
    template <typename T, int IDigits, int FDigits, int EDigits = 2>
    struct fixed_widths_D : qi::strict_ureal_policies<T> {
        template <typename It> static bool parse_sign(It& f, It const& l);

        template <typename It, typename Attr>
        static bool parse_n(It& f, It const& l, Attr& a);

        template <typename It> static bool parse_exp(It& f, It const& l);

        template <typename It>
        static bool parse_exp_n(It& f, It const& l, int& a);

        template <typename It, typename Attr>
        static bool parse_frac_n(It& f, It const& l, Attr& a, int& n);
    };
} // namespace policies

Note:

I keep the attribute type generic.
I also base the implementation on the strict strict_urealpolicies to reduce the effort. The base class doesn't support signs, and requires a mandatory decimal separator ('.'), which makes it "strict" and rejecting just integral numbers
Your question format expects 1 digit for the integral part, 12 digits for the fraction and 2 for the exponent, but I don't hardcode so we can reuse the policies for other fixed-width formats (IDigits, FDigits, EDigits)

Let's go through our overrides one-by-one:

`bool parse_sign(f, l)`

The format is fixed-width, so want to accept

a leading space or '+' for positive
a leading '-' for negative

That way the sign always takes one input character:

template <typename It> static bool parse_sign(It& f, It const&l)
{
    if (f != l) {
        switch (*f) {
        case '+':
        case ' ': ++f; break;
        case '-': ++f; return true;
        }
    }
    return false;
}

`bool parse_n(f, l, Attr& a)`

The simplest part: we allow only a single-digit (IDigits) unsigned integer part before the separator. Luckily, integer parsing is relatively common and trivial:

template <typename It, typename Attr>
static bool parse_n(It& f, It const& l, Attr& a)
{
    return qi::extract_uint<Attr, 10, IDigits, IDigits, false, true>::call(f, l, a);
}

`bool parse_exp(f, l)`

Also trivial: we require a 'D' always:

template <typename It> static bool parse_exp(It& f, It const& l)
{
    if (f == l || *f != 'D')
        return false;
    ++f;
    return true;
}

`bool parse_exp_n(f, l, int& a)`

As for the exponent, we want it to be fixed-width meaning that the sign is mandatory. So, before extracting the signed integer of width 2 (EDigits), we make sure a sign is present:

template <typename It>
static bool parse_exp_n(It& f, It const& l, int& a)
{
    if (f == l || !(*f == '+' || *f == '-'))
        return false;
    return qi::extract_int<int, 10, EDigits, EDigits>::call(f, l, a);
}

`bool parse_frac_n(f, l, Attr&, int& a)`

The meat of the problem, and also the reason to build on the existing parsers. The fractional digits could be considered integral, but there are issues due to leading zeroes being significant as well as the total number of digits might exceed the capacity of any integral type we choose.

So we do a "trick" - we parse an unsigned integer, but ignoring any excess precision that doesn't fit: in fact we only care about the number of digits. We then check that this number is as expected: FDigits.

Then, we hand off to the base class implementation to actually compute the resulting value correctly, for any generic number type T (that satisfies the minimum requirements).

template <typename It, typename Attr>
static bool parse_frac_n(It& f, It const& l, Attr& a, int& n)
{
    It savef = f;

    if (qi::extract_uint<Attr, 10, FDigits, FDigits, true, true>::call(f, l, a)) {
        n = static_cast<int>(std::distance(savef, f));
        return n == FDigits;
    }
    return false;
}

Summary

You can see, by standing on the shoulders of existing, tested code we're already done and good to parse our numbers:

template <typename T>
using X19_type = qi::real_parser<T, policies::fixed_widths_D<T, 1, 12, 2>>;

Now your code runs as expected: Live On Coliru

template <typename T>
using X19_type = qi::real_parser<T, policies::fixed_widths_D<T, 1, 12, 2>>;

int main() {
    using It = std::string::const_iterator;
    using namespace qi::labels;

    qi::uint_parser<uint16_t, 10, 4, 4> i4;
    X19_type<double>                    x19;

    qi::rule<It, double()> X19 = x19 //
        | qi::repeat(19)[' '] >> qi::attr(0.0);

    for (std::string const str : {
             "1234                   1234",
             "1234 0.000000000000D+001234",
             "1234 7.065432100000D+001234",
             "1234-7.006543210000D+001234",
             "1234 0.065432100000D+031234",
             "1234 0.065432100000D-301234",
         }) {

        It f = str.cbegin(), l = str.cend();

        RECORD rec;
        if (qi::parse(f, l, (i4 >> X19 >> i4), rec)) {
            std::cout << "{a:" << rec.a << ", b:" << std::setprecision(12)
                      << rec.b << ", c:" << rec.c << "}\n";
        } else {
            std::cout << "Parse fail (" << std::quoted(str) << ")\n";
        }
    }
}

Prints

{a:1234, b:0, c:1234}
{a:1234, b:0, c:1234}
{a:1234, b:7.0654321, c:1234}
{a:1234, b:-7.00654321, c:1234}
{a:1234, b:65.4321, c:1234}
{a:1234, b:6.54321e-32, c:1234}

Decimals

Now, it's possible to instantiate this parser with precisions that exceed the precision of double. And there are always issues with the conversion from decimal numbers to inexact binary representation. To showcase how the choice for generic T already caters for this, let's instantiate with a decimal type that allows 64 significant decimal fractional digits:

Live On Coliru

using Decimal = boost::multiprecision::cpp_dec_float_100;

struct RECORD {
    uint16_t a{};
    Decimal  b{};
    uint16_t c{};
};

template <typename T>
using X71_type = qi::real_parser<T, policies::fixed_widths_D<T, 1, 64, 2>>;

int main() {
    using It = std::string::const_iterator;
    using namespace qi::labels;

    qi::uint_parser<uint16_t, 10, 4, 4> i4;
    X71_type<Decimal>                   x71;

    qi::rule<It, Decimal()> X71 = x71 //
        | qi::repeat(71)[' '] >> qi::attr(0.0);

    for (std::string const str : {
             "1234                                                                       6789",
             "2345 0.0000000000000000000000000000000000000000000000000000000000000000D+006789",
             "3456 7.0000000000000000000000000000000000000000000000000000000000654321D+006789",
             "4567-7.0000000000000000000000000000000000000000000000000000000000654321D+006789",
             "5678 0.0000000000000000000000000000000000000000000000000000000000654321D+036789",
             "6789 0.0000000000000000000000000000000000000000000000000000000000654321D-306789",
         }) {

        It f = str.cbegin(), l = str.cend();

        RECORD rec;
        if (qi::parse(f, l, (i4 >> X71 >> i4), rec)) {
            std::cout << "{a:" << rec.a << ", b:" << std::setprecision(65)
                      << rec.b << ", c:" << rec.c << "}\n";
        } else {
            std::cout << "Parse fail (" << std::quoted(str) << ")\n";
        }
    }
}

Prints

{a:2345, b:0, c:6789}
{a:3456, b:7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:4567, b:-7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:5678, b:6.54321e-56, c:6789}
{a:6789, b:6.54321e-89, c:6789}

Compare how using a binary long double representation would have lost accuracy here:

{a:2345, b:0, c:6789}
{a:3456, b:7, c:6789}
{a:4567, b:-7, c:6789}
{a:5678, b:6.5432100000000000002913506043764438647482181234694313277925965188e-56, c:6789}
{a:6789, b:6.5432100000000000000601529073044049029207066886931600941449474131e-89, c:6789}

Bonus Take: Optionals

In the current RECORD, missing doubles are silently taken to be 0.0. That's maybe not the best:

struct RECORD {
    uint16_t          a{};
    optional<Decimal> b{};
    uint16_t          c{};
};

// ...

qi::rule<It, optional<Decimal>()> X71 = x71 //
    | qi::repeat(71)[' '];

Now the output is Live On Coliru:

{a:1234, b:--, c:6789}
{a:2345, b: 0, c:6789}
{a:3456, b: 7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:4567, b: -7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:5678, b: 6.54321e-56, c:6789}
{a:6789, b: 6.54321e-89, c:6789}

Summary / Add Unit Tests!

That's a lot, but possibly not all you need.

Keep in mind that you still need proper unit tests for e.g. X19_type. Think of all edge cases you may encounter/want to accept/want to reject:

I have not changed any of the base policies dealing with Inf or NaN so you might want to close those gaps
You might actually have wanted to accept " 3.141 ", " .999999999999D+0 " etc.?

All these are pretty simple changes to the policies, but, as you know, code without tests is broken.

in addition to the above comment, when we parsing the fixed-width values we usually check not only the width of fractional part, but also the whole width of the field, so I added this definition to do it. — Anton, Oct 05 '21 at 16:11
I think the extra check should be redundant if all the parts are required to be fixed-witdh and always present. — sehe, Oct 05 '21 at 16:12
Perhaps I forgot to check the "missing exponent" case? The simplest you can do is `qi::raw [ x18 [ _val = _1] ] [ _pass = (19 == boost::phoenix::stl::size(_1)) ]` — sehe, Oct 05 '21 at 16:15
the code snapshot is too large for the comment. the number shouldn't be normalized, in general, so the following representations are legal for D14.5: 1.12345D+02 112.34500D+00 112.34500 — Anton, Oct 05 '21 at 16:24
What is D14.5? Note how you didn't specify any of the requirements except "length 19" and "+0.000000000000D+00" — sehe, Oct 06 '21 at 07:27
So in short, you can try again, obviously, but please be sure to be specific/complete in the requirements. Or, you know, figure it out with the information you have :) — sehe, Oct 06 '21 at 07:27
@Anton I took the opportunity to demonstrate the `qi::raw [ x18 [ _val = _1] ] [ _pass = (19 == boost::phoenix::stl::size(_1)) ]` approach in my latest answer to you https://stackoverflow.com/a/69482794/85371 — sehe, Oct 07 '21 at 14:27
Thank you, I didn't think about it in this way. I've rewritten the parse method of any_real_parser and it has unwanted consequences — Anton, Oct 07 '21 at 19:06
Please do not roll your own floating point parser. I mean, "unwanted consequences" that you see will be the least of your worries [think arithmetic error -> data loss]. If you have time, feel free to post [as I asked](https://stackoverflow.com/questions/69442406/parsing-fixed-width-numbers-with-boost-spirit/69451028?noredirect=1#comment122774065_69451028) — sehe, Oct 08 '21 at 00:05

Parsing fixed width numbers with boost spirit

1 Answers1

bool parse_sign(f, l)

bool parse_n(f, l, Attr& a)

bool parse_exp(f, l)

bool parse_exp_n(f, l, int& a)

bool parse_frac_n(f, l, Attr&, int& a)