3

I have Antlr3 parser rule that looks like this:

ruleA:
   (TOKEN_1) => TOKEN_1 ruleToken1
   | (TOKEN_2) => TOKEN_2 ruleToken2
   ....
   ....
   <many more such rules>
   | genericRuleA
;

Assume for a moment that the tokens and rules are appropriately defined. Also, genericRuleA is defined in a way that it behaves like a "catch-all" for anything that falls through the rule elements above it.

So, for example, the subrules above genericRuleA correspond to named functions and ruleToken1 and ruleToken2 capture how those named functions are called. genericRuleA would capture any other functions that are not named functions (e.g. a user defined function).

The net effect of this grammar is that if ruleA finds a TOKEN_1, it takes ruleToken1 path and reports an error if the rest of the input did not satisfy ruleToken1

In Antlr4, I would end up with the following parser rule after taking out the syntactic predicates:

ruleA:
    TOKEN_1 ruleToken1
    | TOKEN_2 ruleToken2
   ....
   ....
   <many more such rules>
    | genericRuleA
;

This works well, except, if the rest of the ruleToken1 fails, the parser automatically picks genericRuleA as the preferred path. So now, I have the side effect of being able to "overload" named functions. That may be useful in some situations, but in my situation, the requirement is to expressedly not allow this overloading i.e. named functions must conform to the specific structure laid out in ruleToken1, and report an error if that structure is violated. The system for which this grammar is being written does not support overloaded functions.

genericRuleA must cater to anything other than named functions.

My first question: Is there a standard way to implement this conversion?

One approach I have seen is to create a list of tokens that correspond to the named functions (TOKEN_1, TOKEN_2, etc.); construct a @parser::member function that returns true if the input token has membership in this list. So assuming a isNamedFunction() appropriately defined, the rule would look something like:

ruleA:
    TOKEN_1 ruleToken1
    | TOKEN_2 ruleToken2
    | {!isNamedFunction()}? genericRuleA
;

This might work when you have only a few named functions, but if you have potentially hundreds (think builtin functions in TSQL for example), that list would be pretty cumbersome to build. Not to mention that as new named functions came about, I would have to keep updating that list.

So my follow up question is: Assuming the semantic predicate approach outlined above is the right way to do this, is there a way to programmatically assemble the list?

One approach that appears to hold promise is to build this from getExpectedTokens(). Here I would re-structure the rule a little bit so all named functions fall under one rule (let's say we call it namedRuleA) like so:

ruleA:
    namedRuleA
    | genericRuleA
;

namedRuleA:
   TOKEN_1 ruleToken1
   | TOKEN_2 ruleToken2
   ....
   ....
   <many more such rules>
;

Then starting from ruleA's ATNState [recognizer.getATN().states.get(recognizer.getState())] I would walk down the transitions till I arrive at namedRuleA's ATNState and then use getATN().getExpectedTokens(<namedRuleA's ATNState>, null) to get an IntervalSet that corresponds to the token types of TOKEN_1, TOKEN_2, etc.

This appears to be yielding the right set of tokens, and appears to make sense since (my understanding is that) the ATNState and its transitions are known and fixed at transpile time i.e. during .g4 --> .java transpiling (or whatever your target language is).

I know sometimes the transitions are dynamically determined through closure() calls e.g. when there are semantic predicates involves, but let's assume I can ensure no semantic predicates would be used in namedRuleA

I just wanted to get a sense if somebody else tried this approach and if there might be a gotcha that I'm completely missing.

Cod.ie
  • 380
  • 5
  • 14

0 Answers0