5

I am writing lex for a specific language where operations are carried out in prefix notation :

(+ a b) --> (a + b)

An integer is defined as follows : An integer can have a negative sign (–) but no positive sign. It can be with or without space(s) between the sign and the digits. If the value is 0, only a single digit 0 is allowed. Otherwise, it is the same as common integer definitions (no leading 0’s).

Order of expressions in the lex is as follows ( first match rule) :

  • Regex for integer :[\-]?[ ]*((0)|([1-9][0-9]*))
  • Regex for subtraction operator : "-"

With these definitions, I would like to parse the string - 5 3 ie. (5-3)

Current output

Integer : - 5, 

Integer : 3

Desired output:

Operator : '-'

Integer : 5

Integer : 3
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
ronakshah725
  • 290
  • 3
  • 10
  • Does your language also allow `-` as a unary operator? And is your language Lisp-like in that expressions must be parenthesized, or can you write `+ + 1 2 3`? – rici Feb 13 '16 at 23:52
  • In the language, `-` is not a unary operator. It is a simple language for now, so I dont think I will be encountering `+ + 1 2 3` like expressions. – ronakshah725 Feb 13 '16 at 23:56
  • If the parentheses are required around operators and forbidden around integers, then you can distinguish between unary minus and the minus which is part of an integer, because `( - 5)` is not valid. In that case, the `-` which follows a `(` is always an operator and the `-` which follows an operator or a value must be part of an integer. If those assumptions are not correct, then it's going to be a bit tricky. – rici Feb 14 '16 at 00:02
  • I am afraid, those assumptions cannot be made in my context. I think, I will let the parser handle it – ronakshah725 Feb 14 '16 at 00:05
  • OK. I suspect you'll run into ambiguities, so good luck. For example, `* - 5 - 3 1` could be `(* (-5) (- 3 1))` or `(* (- 5 (- 3)) 1)` – rici Feb 14 '16 at 00:09
  • @ronakshah725 It's a unary operator. Otherwise you don't have a problem. – user207421 Feb 14 '16 at 03:33

2 Answers2

7

You don't. You return - and INTEGER separately to the parser, and let the parser handle unary minus.

user207421
  • 305,947
  • 44
  • 307
  • 483
1

The lexer does not have to do that: normally the parser (such as one written using yacc) gets tokens for the minus sign and the integer separately. The parser combines the two according to the rules which you provide.

For simple grammars, you can make the lexer do parsing, by using states (also known as start conditions). In your example, there would be states for the left/right parenthesis nesting. If your grammar allows a line-break between a minus sign and an integer, you would need a state to show that you have a minus sign.

Just to recognize an optionally-signed integer all on one line, you could do that with an expression like

[-]?[[:space:]]*[[:digit:]]+

However, your desired output does not combine the sign and the integer. So you would have separate regular expressions for those, e.g.,

[-]           { printf ("Operator: %s\n", yytext); }
0|([1-9][0-9]*)  { printf ("Integer: %s\n", yytext); }
Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
  • I figured that it is work of the parser, to differentiate the appropriate use of the "-" symbol. But, was just curious. Also, thanks for the regex. However it doesn't cover the leading 0's condition @Thomas Dickey – ronakshah725 Feb 14 '16 at 00:00