0

I'm trying to parse a language where a unary minus is distinguished from a binary minus by the whitespaces existing around the sign. Below are some pseudo rules defining how the minus sign is interpreted in this language:

 -x       // unary
 x - y    // binary
 x-y      // binary
 x -y     // unary
 x- y     // binary
 (- y ... // unary

Note: The open paren in the last rule can be replaced by any token in the language except 'identifier', 'number' and 'close_paren'.

Note: In the 4th case, x is an identifier. An identifier can constitue a statement of its own. And -y is a separate statement.

Since the minus sign type depends on whitespaces, I thought I'd have two different tokens returned from the lexer, one for unary minus and one for binary minus. Any ideas how can I do this?

Code: Here's some code that works for me, but I'm not quite sure if it's robust enough. I tried to make it simple by removing all the irrelevant lexer rules:

#ifndef LEXER_H
#define LEXER_H

#include <iostream>
#include <algorithm>
#include <string>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_function.hpp>
#include <boost/spirit/include/phoenix_algorithm.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/phoenix_object.hpp>
#include <boost/spirit/include/phoenix_statement.hpp>

#define BOOST_SPIRIT_LEXERTL_DEBUG 1

using std::string;
using std::cerr;

namespace skill {

   namespace lex = boost::spirit::lex;
   namespace phoenix = boost::phoenix;

   // base iterator type
   typedef string::iterator BaseIteratorT;

   // token type
   typedef lex::lexertl::token<BaseIteratorT, boost::mpl::vector<int, string> > TokenT;

   // lexer type
   typedef lex::lexertl::actor_lexer<TokenT> LexerT;

   template <typename LexerT>
   struct Tokens: public lex::lexer<LexerT>
   {
      Tokens(const string& input):
         lineNo_(1)
      {
         using lex::_start;
         using lex::_end;
         using lex::_pass;
         using lex::_state;
         using lex::_tokenid;
         using lex::_val;
         using lex::omit;
         using lex::pass_flags;
         using lex::token_def;
         using phoenix::ref;
         using phoenix::count;
         using phoenix::construct;

         // macros
         this->self.add_pattern
            ("EXP",     "(e|E)(\\+|-)?\\d+")
            ("SUFFIX",  "[yzafpnumkKMGTPEZY]")
            ("INTEGER", "-?\\d+")
            ("FLOAT",   "-?(((\\d+)|(\\d*\\.\\d+)|(\\d+\\.\\d*))({EXP}|{SUFFIX})?)")
            ("SYMBOL",  "[a-zA-Z_?@](\\w|\\?|@)*")
            ("STRING",  "\\\"([^\\\"]|\\\\\\\")*\\\"");

         // whitespaces and comments
         whitespaces_ = "\\s+";
         comments_    = "(;[^\\n]*\\n)|(\\/\\*[^*]*\\*+([^/*][^*]*\\*+)*\\/)";

         // literals
         float_   = "{FLOAT}";
         integer_ = "{INTEGER}";
         string_  = "{STRING}";
         symbol_  = "{SYMBOL}";

         // operators
         plus_          = '+';
         difference_    = '-';
         minus_         = "-({SYMBOL}|\\()";

         // ... more operators

         // whitespace
         this->self += whitespaces_
            [
               ref(lineNo_) += count(construct<string>(_start, _end), '\n'),
               _pass = pass_flags::pass_ignore
            ];

         // a minus between two identifiers, numbers or close-open parens is a binary minus, so add spaces around it
         this->self += token_def<omit>("[)a-zA-Z?_0-9]-[(a-zA-Z?_0-9]")
            [
               unput(_start, _end, *_start + construct<string>(" ") + *(_start + 1) + " " + *(_start + 2)),
               _pass = pass_flags::pass_ignore
            ];

         // operators (except for close-brackets) cannot be followed by a binary minus
         this->self += token_def<omit>("['`.+*<>/!~&|({\\[=,:@](\\s+-\\s*|\\s*-\\s+)")
            [
               unput(_start, _end, *_start + construct<string>("-")),
               _pass = pass_flags::pass_ignore
            ];

         // a minus directly preceding a symbol or an open paren is a unary minus
         this->self += minus_
            [
               unput(_start, _end, construct<string>(_start + 1, _end)),
               _val = construct<string>("-")
            ];

         // literal rules
         this->self += float_ | integer_ | string_ | symbol_;

         // ... other rules
      }

      ~Tokens() {}

      size_t lineNo() { return lineNo_; }

      // ignored tokens
      token_def<omit> whitespaces_, comments_;

      // literal tokens
      token_def<int> integer_;
      token_def<string>  float_, symbol_, string_;

      // operator tokens
      token_def<> plus_, difference_, minus_; // minus_ is a unary minus
      // ... other tokens

      // current line number
      size_t lineNo_;
   };
}

#endif // LEXER_H

Basically, I defined a binary minus (called difference in the code) to be any minus sign that has whitespaces on both sides and used unput to ensure this rule. I also defined a unary minus as a minus sign that directly precedes a symbol or an open paren and again used unput to ensure this rule is maintained (for numbers, the minus sign is part of the token).

Community
  • 1
  • 1
Haitham Gad
  • 1,529
  • 2
  • 13
  • 23
  • This is such a strange grammar that I'd normally say: "Don't. Stay away from this." Also, I could show you how to achieve it just as written, **but** it depends heavily on the surrounding grammar. Also the specification is a bit incomplete. (What does "x - - y" mean? "- (x - - (y))"?) ... – sehe Jun 28 '12 at 21:02
  • So I'll make you a deal: you post your grammar ([SSCCE](http://meta.stackexchange.com/questions/22754/sscce-how-to-provide-examples-for-programming-questions/22762#22762)), preferrably with test cases, and I'll see what i can do, if only just to demonstrate techniques. – sehe Jun 28 '12 at 21:03
  • I added some code that seems to work for me, but I kinda feel it's gonna break some time in the future! – Haitham Gad Jun 29 '12 at 18:37

0 Answers0