Boost spirit lex write token value back to input stream

Question

I'm wondering if there's a way in boost::spirit::lex to write a token value back to the input stream (possibly after editing) and rescanning again. What I'm basically looking for is a functionality like that offered by unput() in Flex.

Thanks!

What are you trying to achieve? I mean, in what context would you need to use unput()? If you show an example, I might be able to show you how I'd do it (possibly using Lexer states) — sehe, May 16 '12 at 08:11
Basically, I need the lexer to match an identifier directly followed by an open paren "abc(" as one token, and put it back into the input stream with the paren being at the beginning of the string like "(abc ". The next step would be for the lexer to scan it again but as two separate tokens (a paren token then an identifier token). — Haitham Gad, May 16 '12 at 18:10
Ok, I posted my take at this, let me know if I understood the _goal_ incorrectly. — sehe, May 16 '12 at 20:10

score 3 · Answer 1 · answered May 16 '12 at 20:07

Sounds like you just want to accept tokens in different orders but with the same meaning.

Without further ado, here is a complete sample that shows how this would be done, exposing the identifier regardless of input order. Output:

Input 'abc(' Parsed as: '(abc'
Input '(abc' Parsed as: '(abc'

Code

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <iostream>
#include <string>

using namespace boost::spirit;

///// LEXER
template <typename Lexer>
struct tokens : lex::lexer<Lexer>
{
    tokens()
    {
        identifier = "[a-zA-Z][a-zA-Z0-9]*";
        paren_open = '(';

        this->self.add
            (identifier)
            (paren_open)
            ;
    }

    lex::token_def<std::string> identifier;
    lex::token_def<lex::omit> paren_open;
};

///// GRAMMAR
template <typename Iterator>
struct grammar : qi::grammar<Iterator, std::string()>
{
    template <typename TokenDef>
        grammar(TokenDef const& tok) : grammar::base_type(ident_w_parenopen)
    {
        ident_w_parenopen = 
              (tok.identifier >> tok.paren_open)
            | (tok.paren_open >> tok.identifier) 
            ;
    }
  private:
    qi::rule<Iterator, std::string()> ident_w_parenopen;
};

///// DEMONSTRATION
typedef std::string::const_iterator It;

template <typename T, typename G>
void DoTest(std::string const& input, T const& tokens, G const& g)
{
    It first(input.begin()), last(input.end());

    std::string parsed;
    bool r = lex::tokenize_and_parse(first, last, tokens, g, parsed);

    if (r) {
        std::cout << "Input '" << input << "' Parsed as: '(" << parsed << "'\n";
    }
    else {
        std::string rest(first, last);
        std::cerr << "Parsing '" << input << "' failed\n" << "stopped at: \"" << rest << "\"\n";
    }
}

int main(int argc, char* argv[])
{
    typedef lex::lexertl::token<It, boost::mpl::vector<std::string> > token_type;
    typedef lex::lexertl::lexer<token_type> lexer_type;
    typedef tokens<lexer_type>::iterator_type iterator_type;

    tokens<lexer_type> tokens;
    grammar<iterator_type> g (tokens);

    DoTest("abc(", tokens, g);
    DoTest("(abc", tokens, g);
}

Thanks sehe. Unfortunately, exporting the problem to the parser is not a good option in my case for two reasons: 1) I need to distinguish the case where the open paren directly follows the identifier from that where they're separated by any whitespace character (in which case, the parser should match a different rule). 2) There are many other keywords that need to be handled the same way (not just identifiers), so this would double the number of productions needed to parse these keywords (one when the paren precedes the keyword and one when it succeeds it). — Haitham Gad, May 16 '12 at 21:05
@HaithamGad Let me say this: writing a quality question is hard. _This_ is why. I will give your new constraints some thought — sehe, May 16 '12 at 21:16

score 0 · Accepted Answer · answered Jun 28 '12 at 00:16

I ended up implementing my own unput() functionality as follows:

   struct unputImpl
   {
      template <typename Iter1T, typename Iter2T, typename StrT>
      struct result {
         typedef void type;
      };

      template <typename Iter1T, typename Iter2T, typename StrT>
      typename result<Iter1T, Iter2T, StrT>::type operator()(Iter1T& start, Iter2T& end, StrT str) const {
         start -= (str.length() - std::distance(start, end));
         std::copy(str.begin(), str.end(), start);
         end = start;
      }
   };

   phoenix::function<unputImpl> const unput = unputImpl();

This can then be used like:

   this->self += lex::token_def<lex::omit>("{SYMBOL}\\(")
        [
           unput(_start, _end, "(" + construct<string>(_start, _end - 1) + " "),
           _pass = lex::pass_flags::pass_ignore
        ];

If the unputted string length is bigger than the matched token length, it will override some of the previously parsed input. The thing you need to take care of is to make sure the input string has sufficient empty space at the very beginning to handle the case where unput() is called for the very first matched token.

Boost spirit lex write token value back to input stream

2 Answers2

Linked