Parsing a tokenized free form grammar with Boost.Spirit

Question

I've got stuck trying to create a Boost.Spirit parser for the callgrind tool's output which is part of valgrind. Callgrind outputs a domain specific embedded programming language (DSEL) which lets you do all sorts of cool stuff like custom expressions for synthetic counters, but it's not easy to parse.

I've placed some sample callgrind output at https://gist.github.com/ned14/5452719#file-sample-callgrind-output. I've placed my current best attempt at a Boost.Spirit lexer and parser at https://gist.github.com/ned14/5452719#file-callgrindparser-hpp and https://gist.github.com/ned14/5452719#file-callgrindparser-cxx. The Lexer part is straightforward: it tokenises tag-values, non-whitespace text, comments, end of lines, integers, hexadecimals, floats and operators (ignore the punctuators in the sample code, they're unused). White space is skipped.

So far so good. The problem is parsing the tokenised input stream. I haven't even attempted the main stanzas yet, I'm still trying to parse the tag-values which can occur at any point in the file. Tag values look like this:

tagtext: unknown series of tokens<eol>

It could be freeform text e.g.

desc: I1 cache: 32768 B, 64 B, 8-way associative, 157 picosec hit latency

In this situation you'd want to convert the set of tokens to a string i.e. to an iterator_range and extract.

It could however be an expression e.g.

event: EPpsec = 316 Ir + 1120 I1mr + 1120 D1mr + 1120 D1mw + 1362 ILmr + 1362 DLmr + 1362 DLmw

This says that from now on, event EPpsec is to be synthesised as Ir multiplied by 316 added to I1mr multiplied by 1120 added to ... etc.

The point I want to make here is that tag-value pairs need to be accumulated as arbitrary sets of tokens, and post-processed into whatever we turn them into later.

To that end, Boost.Spirit's utree() class looked exactly what I wanted, and that's what the sample code uses. But on VS2012 using the November CTP compiler with variadic templates I'm currently seeing this compile error:

1>C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(56): error C2440: 'static_cast' : cannot convert from 'boost::spirit::detail::list::node_iterator<const boost::spirit::utree>' to 'base_iterator_type'
1>          No constructor could take the source type, or constructor overload resolution was ambiguous
1>          C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(186) : see reference to function template instantiation 'IteratorT boost::iterator_range_detail::iterator_range_impl<IteratorT>::adl_begin<const Range>(ForwardRange &)' being compiled
1>          with
1>          [
1>              IteratorT=base_iterator_type
1>  ,            Range=boost::spirit::utree
1>  ,            ForwardRange=boost::spirit::utree
1>          ]

... which suggests that my base_iterator_type, which is a Boost.Spirit multi_pass<> wrap of an istreambuf_iterator for forward iterator nature, is somehow not understood by Boost.Spirit's utree() implementation. Thing is, I'm not sure if this is my bad code or bad Boost.Spirit code seeing as line_pos_iterator<> was failing to correctly specify its forward_iterator concept tag.

Thanks to past Stackoverflow help I could write a pure non-tokenised grammar, but it would be brittle. The right solution is to tokenise and use a freeform grammar capable of fairly arbitrary input. The number of examples of getting Boost.Spirit's Lex and Grammar working together in real world examples to achieve this rather than toy examples is sadly very few. Therefore any help would be greatly appreciated.

Niall

Related: use of utree with lexer: http://stackoverflow.com/a/11514398/85371 (which I modified to use istream_iterators to verify that it _could work_) — sehe, Apr 24 '13 at 20:04
For general information, here's a fixed-up version of the OP's gist that compiles a test program on linux GCC 4.7+/Clang with debug output: https://gist.github.com/sehe/5455336 — sehe, Apr 24 '13 at 20:37

sehe · Accepted Answer · 2013-04-24T20:10:44.923

The token attribute exposes a variant, which in addition to the base-iterator range, can _assume the types declared in the token_type typedef:

typedef lex::lexertl::token<base_iterator_type, mpl::vector<std::string, int, double>> token_type;

So: string, int and double. Note however that coercion into one of the possible types will only occur lazily, when the parser actually uses the value.

utrees are a very versatile container ^[1]. Hence, when you expose a spirit::utree attribute on a rule, and the token value variant contains an iterator_range, then it attempts to assign that into the utree object (this fails, because the iterators are ... 'funky').

The easiest way to get your desired behaviour is to force Qi to interpret the attribute of the tag token as a string, and have that assigned to the utree. Therefore the following line constitutes a fix that will make compilation succeed:

    unknowntagvalue = qi::as_string[tok.tag] >> restofline;

Notes

Having said all this, I would indeed suggest the following

Consider using the Nabialek Trick to dispatch different lazy rules depending on the tag matched - this makes it unnecessary to deal with raw utrees later on
You might have had success specializing boost::spirit::traits::assign_to_XXXXXX traits (see documentation)
consider using a pure Qi parser. While I can "feel" your sentiment that "it is going to brittle" ^[2] it seems you have already demonstrated that it raises the complexity to such a degree that it might not have net merit:
- the unexpected ways in which attributes materialize (this question)
- the problem with line-pos iterators (this is frequently asked question, and AFAIR it has mostly hard or inelegant solutions)
- the inflexibility regarding e.g. ad-hoc debugging (access to source data in SA), switching/disabling skippers etc.
- my personal experience was that looking at lexer states to drive these isn't helpful, because switching lexer state can only work reliably from lexer token semantic actions, whereas often, the disambiguation would happen in the Qi phase

but I'm diverging :)

^[1] e.g. they have facilities for very lightweight 'referencing' of iterator ranges (e.g. for symbols, or to avoid copying characters from a source buffer into the attribute unless wanted)

^[2] In effect, only because using a sequential lexer (scanner) vastly reduces the number of backtrack opportunities, so it simplifies the mental model of the parser. However, you can use expectation points to much the same effect.

Firstly, once again thank you sehe. Secondly, I had thought that utree stored its variants by storing iterator_range's (it's how I would have done it, it's the easiest route to variant storage in this case), and therefore could convert lazily. If utree doesn't do this - and by your answer I'm guessing it doesn't, then it looks like the Nabialek Trick is the only remaining. I had avoided it due to a lack of real world examples. — Niall Douglas, Apr 25 '13 at 18:01
BTW, I should explain why tokenisation. Callgrind output can easily break 100Mb of output. I also failed to get a pure Qi parser to completely grok MSVC's symbol mangling (https://github.com/ned14/NiallsCPP11Utilities/blob/master/SymbolManglerMSVC.cpp) because it turns out to **need** a multi-pass parse to premark nested template mangles much as you would brackets in expressions. Still tender from that failure, I worry callgrind needs the same. It's entirely possible I'm being insecure :) — Niall Douglas, Apr 25 '13 at 18:07
(a) utree is able to store opaque interator_ranges (but that won't be of use as long as you actually use (adapted) input iterators for the input) (b) any grammar can be expressed using spirit along, since semantic actions make it "turing complete"; Also, note `pos/neg lookahead` (unary& and unary!). Granted, this might become slower in the face of a lot of backtracking, but I'd say the linebased grammar looks simple enough to avoid backtracking in all but the most exceptional cases... — sehe, Apr 26 '13 at 21:14
... , and the demangler, I haven't read it but it sounds like you should be able to fix it up with an _inherited attribute_ (look at the miniXML sample [here](http://www.boost.org/doc/libs/1_48_0/libs/spirit/doc/html/spirit/qi/tutorials/mini_xml___asts_.html) for matching the end tags) — sehe, Apr 26 '13 at 21:14
Agreed on (a) now I've played with it, so I'll give a pure Nabialek trick grammar approach a try. As for the demangler, that was related to a cancelled internal project, so sadly I'll not be allowed the development time to fix it. A real shame, because a MSVC demangler would be of huge use for clang <=> MSVC interop :) — Niall Douglas, Apr 29 '13 at 21:20

Parsing a tokenized free form grammar with Boost.Spirit

1 Answers1

Notes

Linked