How to disambiguate a subselect from a parenthesized expression?

Question

I have the following expression notation:

expr
    : OpenParen expr (Comma expr)* Comma? CloseParen           # parenExpr
    | OpenParen simpleSelect CloseParen                        # subSelectExpr

Unfortunately, a simpleSelect can also have a parenthetical around it, and so the following statement becomes ambiguous:

 select ((select 1))

Here is the current grammar that I have, simplified down to only showing the issue:

grammar Subselect;
options { caseInsensitive=true; }
statement: query_statement EOF;

query_statement
   : query_expr # simple
   | query_statement set_op query_statement # set
   ;

query_expr
    : with_clause?
    ( select | '(' query_statement ')' )
      limit_clause?
    ;

select
    : select_clause
     (from_clause
      where_clause?)?
    ;

with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';

expr
    : '(' expr ')'                      # parenExpr
    | '(' query_expr ')'                # subSelect
    | Atom                              # identifier
    ;

Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;

And on the parse of select ((select 1)), here is the output:

What would be a possible way to disambiguate this?

I suppose the main thing is here:

'(' query_statement ')'

Since that recursively calls itself -- is there a way to do indirection or something else such that a query_statement called from within parens can never itself have parens?

Also, maybe this is a common thing? I get the same ambiguous output when running this on the official MySQL grammar here:

I would be curious whether any of the grammars can solve the issue here: https://github.com/antlr/grammars-v4/tree/master/sql. Maybe the best approach is just to remove duplicate parens before parsing the text? (If so, are there are good tools to do that, or do I need to write an additional antlr parser just to do that?)

score 1 · Answer 1 · answered Aug 28 '22 at 22:52

Your input generates this parse tree:

That's a reasonable interpretation of your input and it is identified as a subSelect expr. It's a subSelect nested in a parenExpr (both of which are exprs).

If I switch up your rule a bit:

expr: '(' query_expr ')' # subSelect
    | '(' expr ')'       # parenExpr
    | Atom               # identifier
    ;

Now it's a subSelect that interprets the nested (select 1) as a query expression.

It's ambiguous because the outer parenthesized expression could match either of the first two alternatives resulting in different parse trees.

In ANTLR, ambiguities in alternatives are resolved by "using" the first alternative that matches. In this way ANTLR has deterministic behavior where you can control which interpretation is used (with alternative order). It's not uncommon for ANTLR grammars to have ambiguities like this.

IMHO, the IntelliJ plugin has caused many people to stumble over this as an indication that something is "wrong" with the grammar. There's a reason that ANTLR itself does not report an error in this situation. It has defined, deterministic behavior.

So far as "resolving" this ambiguity: the simple fact that the syntax uses parentheses to indicate two different "things" indicates that it is inherently ambiguous, so I don't believe you can "fix" the grammar ambiguity without modifying the syntax. (I might be wrong about this, and would find it interesting if someone provides a refactoring that manages to remove the ambiguity.)

thanks for the feedback on this! It seems that the parsing also runs about 100x slower on a statement of this type, so I think the 'ambiguity' (or whatever we want to call it) is somehow getting translated into the output program. — David542, Aug 28 '22 at 23:00

How to disambiguate a subselect from a parenthesized expression?

1 Answers1

Linked