I have the following expression notation:
expr
: OpenParen expr (Comma expr)* Comma? CloseParen # parenExpr
| OpenParen simpleSelect CloseParen # subSelectExpr
Unfortunately, a simpleSelect
can also have a parenthetical around it, and so the following statement becomes ambiguous:
select ((select 1))
Here is the current grammar that I have, simplified down to only showing the issue:
grammar Subselect;
options { caseInsensitive=true; }
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
expr
: '(' expr ')' # parenExpr
| '(' query_expr ')' # subSelect
| Atom # identifier
;
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
And on the parse of select ((select 1))
, here is the output:
What would be a possible way to disambiguate this?
I suppose the main thing is here:
'(' query_statement ')'
Since that recursively calls itself -- is there a way to do indirection or something else such that a query_statement
called from within parens can never itself have parens?
Also, maybe this is a common thing? I get the same ambiguous output when running this on the official MySQL grammar here:
I would be curious whether any of the grammars can solve the issue here: https://github.com/antlr/grammars-v4/tree/master/sql. Maybe the best approach is just to remove duplicate parens before parsing the text? (If so, are there are good tools to do that, or do I need to write an additional antlr parser just to do that?)