How can semantic predicates use early information from listeners with ANTLR 4?

Question

I have a parser based on ANTLR 4 and using listeners, not visitors. It already recognizes and stores the declaration of functions, variables and so on.

I'm trying to resolve some grammar ambiguities with semantic predicates, for instance to separate a function call from an array/vector access when parsing VHDL source code. This is important in order to avoid further complications in the full grammar.

In the following example:

3 + j * f(i)

f(i) could be either a function f with parameter i or an array f accessed by index i. The following simplified example below shows how the predicates could help resolve that ambiguity:

expression:
    expression OPERATOR expression | simple_expression;
simple_expression:
    function_expression | array_expression | ID | NUMBER;
function_expression:
    {is_function()}? ID '(' expression_list ')';
array_expression:
    {is_array()}? ID '(' expression ')';
expression_list:
    expression ( ',' expression )*;

The listeners parse the declarations and store function and array identifiers in a database, which allows to know whether identifier ID is a function, an array or undeclared (I'm not showing any example of grammar for those declarations here, to keep it simple).

An example of predicate would be, at the top of the grammar file:

@parser::members {
    Definitions defs;
    boolean is_function() {
        return defs.isFunction(getCurrentToken().getText());
    }
    boolean is_array() {
        return defs.isArray(getCurrentToken().getText());
    }
}

However I cannot use that information in the predicates because they are called too early, before the declaration's listeners are called to build the ID database. If I put a System.out.print in those functions, and also in the listeners, I see that

the expression predicates are first called on the entire file being parsed,
and only then are all the declaration listeners called, even though the declarations are before these expressions in the file.

I'm aware the parser is looking ahead, but is there a way to expedite the declaration listeners as soon as possible, in order to have their information ready for the predicates related to expressions in the rest of the file?

Or is that the wrong way to use the predicates? I would like to avoid source code in the grammar as much as possible, like a work-around that stores preliminary information during the parsing of declarations with code embedded in the grammar file. And a 2-pass parser seems a bit awkward.

GRosenberg · Answer 1 · 2018-04-07T20:15:30.953

The problem, as implicitly recognized, is that the statement

3 + j * f(i)

is ambiguous.

Given that the parser runs to completion before a tree-walk is performed, a tree-walker has no way to inform semantic predicates of semantic decisions made in the walker.

A better approach is to recognize that the parser can only distinguish syntax. Consequently, the grammar might be written:

expression
    : expression OPERATOR expression                       #op
    | ID LPAREN expression ( COMMA expression )* RPAREN    #simple
    | ID                                                   #id
    | NUMBER                                               #num
    ;

Now walk the parse tree to annotate the existing nodes with deduced semantic information, e.g., whether a given SimpleExpressionContext node represents a function or array. Annotation can be done using ParseTreeProperty.

Preferably, use multiple walks, each focused on some distinct semantic analysis aspect, either discrete or building on/using the results of prior walks. (Each walk is relatively cheap in terms of execution performance, permits separation of concerns, enhances maintainability, etc, etc.)

Not uncommon to have some number of preparatory walks, a symbol table build walk, an evaluation walk, and an output walk.

And, little or no native code or complicated predicates in the parser grammar.

Thanks but that is not solving the problem, I'm already annotating the tree and extracting declarations among other things, but keeping the ambiguity and resolving it *after* parsing is not a good option. It forces me to change and merge two entire subtrees of the grammar that are normally distinct (for function calls and array subrange selections), this ends up in an approximative and obscure grammar. It also makes the source code of the listener much more complicated and confuse, though this last part may perhaps be addressed with different walkers (not sure how easy that would be). — RedGlyph, Apr 08 '18 at 12:19
'Fixing' an ambiguous language is always problematic. Of course, creating separate function and array CST nodes would be ideal. Absent some sufficiently simple & efficient way to distinguish the types by way of semantic predicates, the next best way is to flag a common function/array node as discretely one or the other. In either case, the walker's logic will be nearly the same -- the latter just requiring a bit additional code structuring. Wish there was more elegant solution, too. — GRosenberg, Apr 08 '18 at 22:39

How can semantic predicates use early information from listeners with ANTLR 4?

1 Answers1