I am making a simple programming language. It has the following grammar:
program: declaration+;
declaration: varDeclaration
| statement
;
varDeclaration: 'var' IDENTIFIER ('=' expression)?';';
statement: exprStmt
| assertStmt
| printStmt
| block
;
exprStmt: expression';';
assertStmt: 'assert' expression';';
printStmt: 'print' expression';';
block: '{' declaration* '}';
//expression without left recursion
/*
expression: assignment
;
assignment: IDENTIFIER '=' assignment
| equality;
equality: comparison (op=('==' | '!=') comparison)*;
comparison: addition (op=('>' | '>=' | '<' | '<=') addition)* ;
addition: multiplication (op=('-' | '+') multiplication)* ;
multiplication: unary (op=( '/' | '*' ) unary )* ;
unary: op=( '!' | '-' ) unary
| primary
;
*/
//expression with left recursion
expression: IDENTIFIER '=' expression
| expression op=('==' | '!=') expression
| expression op=('>' | '>=' | '<' | '<=') expression
| expression op=('-' | '+') expression
| expression op=( '/' | '*' ) expression
| op=( '!' | '-' ) expression
| primary
;
primary: intLiteral
| booleanLiteral
| stringLiteral
| identifier
| group
;
intLiteral: NUMBER;
booleanLiteral: value=('True' | 'False');
stringLiteral: STRING;
identifier: IDENTIFIER;
group: '(' expression ')';
TRUE: 'True';
FALSE: 'False';
NUMBER: [0-9]+ ;
STRING: '"' ~('\n'|'"')* '"' ;
IDENTIFIER : [a-zA-Z]+ ;
This left recursive grammar is useful because it ensures every node in the parse tree has at most 2 children. For example,
var a = 1 + 2 + 3
will turn into two nested addition expressions, rather than one addition expression with three children. That behavior is useful because it makes writing an interpreter easy, since I can just do (highly simplified):
public Object visitAddition(AdditionContext ctx) {
return visit(ctx.addition(0)) + visit(ctx.addition(1));
}
instead of iterating through all the child nodes.
However, this left recursive grammar has one flaw, which is that it accepts invalid statements. For example:
var a = 3;
var b = 4;
a = b == b = a;
is valid under this grammar even though the expected behavior would be
b == b
is parsed first since==
has higher precedence than assignment (=
).- Because
b == b
is parsed first, the expression becomes incoherent. Parsing fails.
Instead, the following undesired behavior occurs: the final line is parsed as (a = b) == (b = a).
How can I prevent left recursion from parsing incoherent statements, such as a = b == b = a
?
The non-left-recursive grammar recognizes this input is correct and throws a parsing error, which is the desired behavior.