3

I would like to use boost::spirit in order to extract the stoichiometry of compounds made of several elements from a brute formula. Within a given compound, my parser should be able to distinguish three kind of chemical element patterns:

  • natural element made of a mixture of isotopes in natural abundance
  • pure isotope
  • mixture of isotopes in non-natural abundance

Those patterns are then used to parse such following compounds:

  • "C" --> natural carbon made of C[12] and C[13] in natural abundance
  • "CH4" --> methane made of natural carbon and hydrogen
  • "C2H{H[1](0.8)H[2](0.2)}6" --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
  • "U[235]" --> pure uranium 235

Obviously, the chemical element patterns can be in any order (e.g. CH[1]4 and H[1]4C ...) and frequencies.

I wrote my parser which is quite close to do the job but I still face one problem.

Here is my code:

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>>
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {

        namespace phx = boost::phoenix;

        // Semantic action for handling the case of pure isotope    
        phx::function<PureIsotopeBuilder> const build_pure_isotope = PureIsotopeBuilder();
        // Semantic action for handling the case of pure isotope mixture   
        phx::function<IsotopesMixtureBuilder> const build_isotopes_mixture = IsotopesMixtureBuilder();
        // Semantic action for handling the case of natural element   
        phx::function<NaturalElementBuilder> const build_natural_element = NaturalElementBuilder();

        phx::function<UpdateElement> const update_element = UpdateElement();

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr=ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();
        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.getProperty<std::string>("symbol"),isotope.second.getProperty<std::string>("symbol"));
        }

        _mixtureToken = "{" >> +(_isotopeNames >> "(" >> qi::double_ >> ")") >> "}";
        _isotopesMixtureToken = (_elementSymbols[qi::_a=qi::_1] >> _mixtureToken[qi::_b=qi::_1])[qi::_pass=build_isotopes_mixture(qi::_val,qi::_a,qi::_b)];

        _pureIsotopeToken = (_isotopeNames[qi::_a=qi::_1])[qi::_pass=build_pure_isotope(qi::_val,qi::_a)];
        _naturalElementToken = (_elementSymbols[qi::_a=qi::_1])[qi::_pass=build_natural_element(qi::_val,qi::_a)];

        _start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[qi::_a=qi::_1] >>
                      (qi::double_|qi::attr(1.0))[qi::_b=qi::_1])[qi::_pass=update_element(qi::_val,qi::_a,qi::_b)] );

    }

    //! Defines the rule for matching a prefix
    qi::symbols<char,std::string> _isotopeNames;
    qi::symbols<char,std::string> _elementSymbols;

    qi::rule<Iterator,isotopesMixture()> _mixtureToken;
    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string,isotopesMixture>> _isotopesMixtureToken;

    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _pureIsotopeToken;
    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _naturalElementToken;

    qi::rule<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>> _start;
};

Basically each separate element pattern can be parsed properly with their respective semantic action which produces as ouput a map between the isotopes that builds the compound and their corresponding stoichiometry. The problem starts when parsing the following compound:

CH{H[1](0.9)H[2](0.4)}

In such case the semantic action build_isotopes_mixture return false because 0.9+0.4 is non sense for a sum of ratio. Hence I would have expected and wanted my parser to fail for this compound. However, because of the _start rule which uses alternative operator for the three kind of chemical element pattern, the parser manages to parse it by 1) throwing away the {H[1](0.9)H[2](0.4)} part 2) keeping the preceding H 3) parsing it using the _naturalElementToken. Is my grammar not clear enough for being expressed as a parser ? How to use the alternative operator in such a way that, when an occurrence has been found but gave a false when running the semantic action, the parser stops ?

sehe
  • 374,641
  • 47
  • 450
  • 633
Eurydice
  • 8,001
  • 4
  • 24
  • 37

1 Answers1

3

How to use the alternative operator in such a way that, when an occurrence has been found but gave a false when running the semantic action, the parser stops ?

In general, you achieve this by adding an expectation point to prevent backtracking.

In this case you are actually "conflating" several tasks:

  1. matching input
  2. interpreting matched input
  3. validating matched input

Spirit excels at matching input, has great facilities when it comes to interpreting (mostly in the sense of AST creation). However, things get "nasty" with validating on the fly.

An advice I often repeat is to consider separating the concerns whenever possible. I'd consider

  1. building a direct AST representation of the input first,
  2. transforming/normalizing/expanding/canonicalizing to a more convenient or meaningful domain representation
  3. do final validations on the result

This gives you the most expressive code while keeping it highly maintainable.

Because I don't understand the problem domain well enough and the code sample is not nearly complete enough to induce it, I will not try to give a full sample of what I have in mind. Instead I'll try my best at sketching the expectation point approach I mentioned at the outset.

Mock Up Sample To Compile

This took the most time. (Consider doing the leg work for the people who are going to help you)

Live On Coliru

#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <map>

namespace qi = boost::spirit::qi;

struct DummyBuilder {
    using result_type = bool;

    template <typename... Ts>
    bool operator()(Ts&&...) const { return true; }
};

struct PureIsotopeBuilder     : DummyBuilder {  };
struct IsotopesMixtureBuilder : DummyBuilder {  };
struct NaturalElementBuilder  : DummyBuilder {  };
struct UpdateElement          : DummyBuilder {  };

struct Isotope {
    std::string getName() const { return _name; }

    Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }

    template <typename T> std::string getProperty(std::string const& name) const {
        if (name == "symbol")
            return _symbol;
        throw std::domain_error("no such property (" + name + ")");
    }

  private:
    std::string _name, _symbol;
};

using MixComponent    = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;

template <typename Isotope>
struct ChemicalDatabaseManager {
    static ChemicalDatabaseManager* Instance() {
        static ChemicalDatabaseManager s_instance;
        return &s_instance;
    }

    auto& getDatabase() { return _db; }
  private:
    std::map<int, Isotope> _db {
        { 1, { "H[1]",   "H" } },
        { 2, { "H[2]",   "H" } },
        { 3, { "Carbon", "C" } },
        { 4, { "U[235]", "U" } },
    };
};

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> >
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {
        using namespace qi;
        namespace phx = boost::phoenix;

        phx::function<PureIsotopeBuilder>     build_pure_isotope;     // Semantic action for handling the case of pure isotope
        phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
        phx::function<NaturalElementBuilder>  build_natural_element;  // Semantic action for handling the case of natural element
        phx::function<UpdateElement>          update_element;

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();

        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"),isotope.second.template getProperty<std::string>("symbol"));
        }

        _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}";
        _isotopesMixtureToken = (_elementSymbols[_a=_1] >> _mixtureToken[_b=_1])[_pass=build_isotopes_mixture(_val,_a,_b)];

        _pureIsotopeToken     = (_isotopeNames[_a=_1])[_pass=build_pure_isotope(_val,_a)];
        _naturalElementToken  = (_elementSymbols[_a=_1])[_pass=build_natural_element(_val,_a)];

        _start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[_a=_1] >>
                    (double_|attr(1.0))[_b=_1]) [_pass=update_element(_val,_a,_b)] );
    }

  private:
    //! Defines the rule for matching a prefix
    qi::symbols<char, std::string> _isotopeNames;
    qi::symbols<char, std::string> _elementSymbols;

    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string, isotopesMixture> > _isotopesMixtureToken;

    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _naturalElementToken;

    qi::rule<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> > _start;
};

int main() {
    using It = std::string::const_iterator;
    ChemicalFormulaParser<It> parser;
    for (std::string const input : {
            "C",                        // --> natural carbon made of C[12] and C[13] in natural abundance
            "CH4",                      // --> methane made of natural carbon and hydrogen
            "C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
            "C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
            "U[235]",                   // --> pure uranium 235
        })
    {
        std::cout << " ============= '" << input << "' ===========\n";
        It f = input.begin(), l = input.end();
        isotopesMixture mixture;
        bool ok = qi::parse(f, l, parser, mixture);

        if (ok)
            std::cout << "Parsed successfully\n";
        else
            std::cout << "Parse failure\n";

        if (f != l)
            std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
    }
}

Which, as given, just prints

 ============= 'C' ===========
Parsed successfully
 ============= 'CH4' ===========
Parsed successfully
 ============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'U[235]' ===========
Parsed successfully

General remarks:

  1. no need for the locals, just use the regular placeholders:

    _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}";
    _isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ];
    
    _pureIsotopeToken     = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ];
    _naturalElementToken  = _elementSymbols [ _pass=build_natural_element(_val, _1) ];
    
    _start = +( 
            ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >>
              (double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ] 
        );
    
    // ....
    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken;
    qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture()> _naturalElementToken;
    qi::rule<Iterator, isotopesMixture()> _start;
    
  2. you will want to handle conflicts between names/symbols (possibly just by prioritizing one or the other)

  3. conforming compilers will require the template qualifier (unless I totally mis-guessed your datastructure, in which case I don't know what the template argument to ChemicalDatabaseManager was supposed to mean).

    Hint, MSVC is not a standards-conforming compiler

Live On Coliru

Expectation Point Sketch

Assuming that the "weights" need to add up to 100% inside the _mixtureToken rule, we can either make build_isotopes_micture "not dummy" and add the validation:

struct IsotopesMixtureBuilder {
    bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        // validate weights total only
        return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
    }
};

However, as you note, it will thwart things by backtracking. Instead you might /assert/ that any complete mixture add up to 100%:

_mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));

With something like

struct ValidateWeightTotal {
    bool operator()(isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
        return ok;
        // or perhaps just :
        return ok? ok : throw InconsistentsWeights {};
    }

    struct InconsistentsWeights : virtual std::runtime_error {
        InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
    };
};

Live On Coliru

#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/range/numeric.hpp>
#include <map>

namespace qi = boost::spirit::qi;

struct DummyBuilder {
    using result_type = bool;

    template <typename... Ts>
    bool operator()(Ts&&...) const { return true; }
};

struct PureIsotopeBuilder     : DummyBuilder {  };
struct NaturalElementBuilder  : DummyBuilder {  };
struct UpdateElement          : DummyBuilder {  };

struct Isotope {
    std::string getName() const { return _name; }

    Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }

    template <typename T> std::string getProperty(std::string const& name) const {
        if (name == "symbol")
            return _symbol;
        throw std::domain_error("no such property (" + name + ")");
    }

  private:
    std::string _name, _symbol;
};

using MixComponent    = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;

struct IsotopesMixtureBuilder {
    bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        // validate weights total only
        return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
    }
};

struct ValidateWeightTotal {
    bool operator()(isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
        return ok;
        // or perhaps just :
        return ok? ok : throw InconsistentsWeights {};
    }

    struct InconsistentsWeights : virtual std::runtime_error {
        InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
    };
};

template <typename Isotope>
struct ChemicalDatabaseManager {
    static ChemicalDatabaseManager* Instance() {
        static ChemicalDatabaseManager s_instance;
        return &s_instance;
    }

    auto& getDatabase() { return _db; }
  private:
    std::map<int, Isotope> _db {
        { 1, { "H[1]",   "H" } },
        { 2, { "H[2]",   "H" } },
        { 3, { "Carbon", "C" } },
        { 4, { "U[235]", "U" } },
    };
};

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture()>
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {
        using namespace qi;
        namespace phx = boost::phoenix;

        phx::function<PureIsotopeBuilder>     build_pure_isotope;     // Semantic action for handling the case of pure isotope
        phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
        phx::function<NaturalElementBuilder>  build_natural_element;  // Semantic action for handling the case of natural element
        phx::function<UpdateElement>          update_element;
        phx::function<ValidateWeightTotal>    validate_weight_total;

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();

        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"), isotope.second.template getProperty<std::string>("symbol"));
        }

        _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));
        _isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ];

        _pureIsotopeToken     = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ];
        _naturalElementToken  = _elementSymbols [ _pass=build_natural_element(_val, _1) ];

        _start = +( 
                ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >>
                  (double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ] 
            );
    }

  private:
    //! Defines the rule for matching a prefix
    qi::symbols<char, std::string> _isotopeNames;
    qi::symbols<char, std::string> _elementSymbols;

    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken;
    qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture()> _naturalElementToken;
    qi::rule<Iterator, isotopesMixture()> _start;
};

int main() {
    using It = std::string::const_iterator;
    ChemicalFormulaParser<It> parser;
    for (std::string const input : {
            "C",                        // --> natural carbon made of C[12] and C[13] in natural abundance
            "CH4",                      // --> methane made of natural carbon and hydrogen
            "C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
            "C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
            "U[235]",                   // --> pure uranium 235
        }) try 
    {
        std::cout << " ============= '" << input << "' ===========\n";
        It f = input.begin(), l = input.end();
        isotopesMixture mixture;
        bool ok = qi::parse(f, l, parser, mixture);

        if (ok)
            std::cout << "Parsed successfully\n";
        else
            std::cout << "Parse failure\n";

        if (f != l)
            std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
    } catch(std::exception const& e) {
        std::cout << "Caught exception '" << e.what() << "'\n";
    }
}

Prints

 ============= 'C' ===========
Parsed successfully
 ============= 'CH4' ===========
Parsed successfully
 ============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Caught exception 'boost::spirit::qi::expectation_failure'
 ============= 'U[235]' ===========
Parsed successfully
sehe
  • 374,641
  • 47
  • 450
  • 633
  • thank you very much for your help and sorry for not having provided a more complete example. I thought that the one I gave was already too complex to trigger any reply. – Eurydice Mar 22 '17 at 07:58
  • @Eurydice Well. In reality, these are "niche tags" and the people watching them are usually interested in the subject. As long as you keep make sure the question is clear, I don't think it usually hurts to include a SSCCE (at the bottom, so it doesn't interfere). Cheers :) – sehe Mar 22 '17 at 08:08
  • The parser works. Great ! However, I would have one additional question: if I give as input something wrong in the mixture such as "C2H{H[1](0.8);H[2](0.2)}6" (note the ";" introduced in the middle of the isotope mixture), the parser does not fail but instead consider the H as a natural element. Is there a way to avoid this and actually get a parsing failure ? – Eurydice Mar 22 '17 at 14:15
  • Yes. You could make the mixture an expectation point as soon as an opening brace is seen. Note that you can already see the trailing input – sehe Mar 22 '17 at 14:27
  • I understand this for the expectation point you introduced in the corrected example because the validation of the mixture was a function that took the outcome of the mixture parsing. Here, my problem is that I want for my mixture (and for all the parser on a general point of view) a kind of zero-tolerance policy according any character different from 1) an isotope name (e.g. H[1],C[14]) 2) an element symbol (e.g. H,C,Li) 3) a ratio defintion (e.g. (0.7)). Could you please give me some hint to achieve this ? – Eurydice Mar 22 '17 at 14:47
  • 1
    That was the hint. Did you read the link about expectation points? In short figure out which `>>` to replace with `>`. You might want/need to add explicit parentheses to group the expected items – sehe Mar 22 '17 at 16:21
  • (I could show you but then you'd still not understand it fully which would be a shame of the effort) – sehe Mar 22 '17 at 16:22
  • 1
    See, here is the [failing testcase demonstrating that it DOES report the unparsed trailing data](http://coliru.stacked-crooked.com/a/2d32c64933726eee) and here's the corrected rule [that rejects the whole input](http://coliru.stacked-crooked.com/a/307e6a1afb471be3). – sehe Mar 23 '17 at 00:31
  • 1
    So I changed `a >> b >> c > eps(postcondition)` into to `a > (b >> c >> eps(postcondition))`. I guess it's subtle, but really it just changes /what/ is being expected (`>`). Nothing else. – sehe Mar 23 '17 at 00:33
  • thanks a lot. In the meantime, I also made some trials but not as successful as yours ;-). I will make further imrprovments (e.g. CH%4 that I expect to fail but succeed currently and publish a final version of the parser asap. – Eurydice Mar 23 '17 at 07:53
  • Indeed, in the case of `CH%4` the parser succeeds because of the start rule which is defined in such a way that if no number is found after either a pure isotope, a natural element or an isotope mixture token, it will still consider the string to be OK by proposing a default value of 1.0. Clearly, this behavior is OK because I would like to avoid expressions such as `C1` but it triggers wrong expressions such as `CH%4` to be accepted also – Eurydice Mar 23 '17 at 08:49
  • 1
    Erm. Adding your newest requirements as a failing test case ***again*** shows you can already see that there's unparsed trailing input: [_`Remaining input unparsed: '%4'`_](https://wandbox.org/permlink/cXvq2IiJXs3VsHIE#result-container-tab-permlink). If you have requirements (like: "trailing input is not OK"), just add them, e.g. [`bool ok = qi::parse(f, l, parser >> qi::eoi, mixture);`](https://wandbox.org/permlink/41QeyvQvHCnVM4Td#result-container-tab-permlink) – sehe Mar 23 '17 at 09:18
  • everything is fine ! I did not really get what you meant with the unparsed trailing because I always (badly) use the `qi::parse` without storing its return value to a boolean for further grammar correctness checking. Now I see ... thaks for your patience – Eurydice Mar 23 '17 at 09:53
  • I _always_ make my testbed in that way precisely so you will /see/ when partial input is parsed. It's a common pitfall to miss that because it's not immediately visible. Good habits go a long way :) – sehe Mar 23 '17 at 10:38