C++ boost::spirit lexer regex

Question

I'm doing a simple lexer/parser with boost::spirit.

This is the lexer :

template <typename Lexer>
struct word_count_tokens : lex::lexer<Lexer>
{
  word_count_tokens()
  {                                                                                                                                                                                                     
        this->self.add_pattern
          ("WORD", "[a-z]+")
          ("NAME_CONTENT", "[a-z]+")
          ;

        word = "{WORD}";
        name = ".name";
        name_content = "{NAME_CONTENT}";

        this->self.add
          (word)                                                                                                                                                           
          (name)                                                                                                                                                               
          (name_content)                                                                                                                                                     
          ('\n')                                                                                                                                                       
          (' ')
          ('"')
          (".", IDANY)                                                                                                                                   
          ;
  }                                                                                                                                       
  lex::token_def<std::string> word;
  lex::token_def<std::string> name;
  lex::token_def<std::string> name_content;
};

I defined two identical patterns : WORD and NAME_CONTENT.

This is the grammar :

template <typename Iterator>
struct word_count_grammar : qi::grammar<Iterator>
{
  template <typename TokenDef>
  word_count_grammar(TokenDef const& tok)
    : word_count_grammar::base_type(start)
{
using boost::phoenix::ref;
using boost::phoenix::size;

start = tok.name >> lit(' ') >> lit('"')  >> tok.word >> lit('"');
}

qi::rule<Iterator> start;
};

This code works with tok.word in the grammar, but if I replace tok.word by tok.name_content it does not works. But tok.word == tok.name_content.

What is the issue with this code ?

PS : what I want to parse is something like : .name "this is my name"

score 3 · Answer 1 · answered Dec 12 '14 at 13:51

3

Update Oh by the way the problem is you can only have one token match - they're matched in order. You /can/ work around this by using lexer states. But I don't recommend this any more than using lexer here in the first place

My suggestion would be to use Qi directly:

    qi::lexeme[".name"] >> qi::lexeme['"' >> *~qi::char_('"') >> '"']

My recollection of Lexer token patterns is one of exceedingly confusing escape requirements.

I might try to figure it out later - out of curiosity only

Live On Coliru

#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

int main() {
    std::string const input(".name \"this is my name\"");

    auto f(input.begin()), l(input.end());


    std::string parsed_name;
    if (qi::phrase_parse(f,l,
                qi::lexeme[".name"] >> qi::lexeme['"' >> *~qi::char_('"') >> '"'],
                qi::space,
                parsed_name))
    {
        std::cout << "Parsed: '" << parsed_name << "'\n";
    }
    else
    {
        std::cout << "Parsed failed\n";
    }

    if (f!=l)
        std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
}

Prints

Parsed: 'this is my name'

answered Dec 12 '14 at 13:51

sehe

374,641
47
450
633

What do you mean by "only one token match" ? What if I want my regex with qi directly ? My regex is "[a-z]+", can I ? – Dec 12 '14 at 13:58
@ThibaultMartinez you can't (usefully) have two identical patterns, because one will never match. Conversely, you can't legally add the same token to multiple states, so when using multiple states this would stop being true. Feel free to browse [tag:boost-spirit-lex] answers for how to do this. But it won't be pretty, it won't be fast, and good luck if you needed source position information for error reporting etc. – sehe Dec 12 '14 at 13:58
I dunno, Thibault. Why do you have the patterns then...? And maybe this schematic helps http://i.imgur.com/cA8BaR5.png – sehe Dec 12 '14 at 14:04
Thanks for the picture. Indeed that was just an experiment but I was curious to know the issue, that's now clear. – Dec 12 '14 at 14:07
Ok, I can't define two same patters. But now, I have the same problem with two different patterns. "[a-zA-Z0-9_,.!?@#]+" and "[a-z0-9_]+" – Dec 12 '14 at 15:01
I suggest you post a new question, notably with a self contained example. Self contained means self contained, no need to require people to remember just which headers were required and what unholy underdocumented minimum token id to use in that enum etc. You can use one of my [lex answers](http://stackoverflow.com/search?tab=votes&q=user%3a85371%20%5bboost-spirit-lex%5d) for inspiration. 99% of my [spirit answers](http://stackoverflow.com/search?tab=relevance&q=user%3a85371%20%5bboost-spirit%5d%20OR%20%5bboost-spirit-qi%5d) should have self contained code samples because that's a lifestyle. – sehe Dec 12 '14 at 19:09

C++ boost::spirit lexer regex

1 Answers1