In lex, how do I differentiate between '-' (subtraction) operator and an integer '-3'?

Question

I am writing lex for a specific language where operations are carried out in prefix notation :

(+ a b) --> (a + b)

An integer is defined as follows : An integer can have a negative sign (–) but no positive sign. It can be with or without space(s) between the sign and the digits. If the value is 0, only a single digit 0 is allowed. Otherwise, it is the same as common integer definitions (no leading 0’s).

Order of expressions in the lex is as follows ( first match rule) :

Regex for integer :[\-]?[ ]*((0)|([1-9][0-9]*))
Regex for subtraction operator : "-"

With these definitions, I would like to parse the string - 5 3 ie. (5-3)

Current output

Integer : - 5, 

Integer : 3

Desired output:

Operator : '-'

Integer : 5

Integer : 3

Does your language also allow `-` as a unary operator? And is your language Lisp-like in that expressions must be parenthesized, or can you write `+ + 1 2 3`? — rici, Feb 13 '16 at 23:52
In the language, `-` is not a unary operator. It is a simple language for now, so I dont think I will be encountering `+ + 1 2 3` like expressions. — ronakshah725, Feb 13 '16 at 23:56
If the parentheses are required around operators and forbidden around integers, then you can distinguish between unary minus and the minus which is part of an integer, because `( - 5)` is not valid. In that case, the `-` which follows a `(` is always an operator and the `-` which follows an operator or a value must be part of an integer. If those assumptions are not correct, then it's going to be a bit tricky. — rici, Feb 14 '16 at 00:02
I am afraid, those assumptions cannot be made in my context. I think, I will let the parser handle it — ronakshah725, Feb 14 '16 at 00:05
OK. I suspect you'll run into ambiguities, so good luck. For example, `* - 5 - 3 1` could be `(* (-5) (- 3 1))` or `(* (- 5 (- 3)) 1)` — rici, Feb 14 '16 at 00:09
@ronakshah725 It's a unary operator. Otherwise you don't have a problem. — user207421, Feb 14 '16 at 03:33

score 7 · Accepted Answer · answered Feb 14 '16 at 03:19

7

You don't. You return - and INTEGER separately to the parser, and let the parser handle unary minus.

answered Feb 14 '16 at 03:19

user207421

305,947
44
307
483

Thomas Dickey · Answer 2 · 2016-02-14T00:08:39.507

The lexer does not have to do that: normally the parser (such as one written using yacc) gets tokens for the minus sign and the integer separately. The parser combines the two according to the rules which you provide.

For simple grammars, you can make the lexer do parsing, by using states (also known as start conditions). In your example, there would be states for the left/right parenthesis nesting. If your grammar allows a line-break between a minus sign and an integer, you would need a state to show that you have a minus sign.

Just to recognize an optionally-signed integer all on one line, you could do that with an expression like

[-]?[[:space:]]*[[:digit:]]+

However, your desired output does not combine the sign and the integer. So you would have separate regular expressions for those, e.g.,

[-]           { printf ("Operator: %s\n", yytext); }
0|([1-9][0-9]*)  { printf ("Integer: %s\n", yytext); }

I figured that it is work of the parser, to differentiate the appropriate use of the "-" symbol. But, was just curious. Also, thanks for the regex. However it doesn't cover the leading 0's condition @Thomas Dickey — ronakshah725, Feb 14 '16 at 00:00

In lex, how do I differentiate between '-' (subtraction) operator and an integer '-3'?

2 Answers2

Linked