Antlr4: How to pass current token's value to lexer's predicate?

Question

Is there a way to provide a lexer's predicate with the current token's value? for instance, in my lexer grammar FlowLexer, I dynamically load tokens:

Before I parse, I load the tokens dynamically:

var lexer = new FlowLexer(new AntlrInputStream(flowContent)) {
    TokenExists = tokenValue => tokensDictionary.ContainsKey(tokenValue)
};

And then during parsing/lexing, the TokenExists predicate is called:

@lexer::members{
    public Func<string,bool> TokenExists = null;
}

/* ... stuff ... */

TOK : [-_.0-9a-zA-Z]+ 
    {!TokenExists(/*WHAT GOES HERE?*/);}? 
    -> mode(IN_TOKEN);

/* ... stuff ... */

But how do I pass the token value to the TokenExists predicate?

(This is an attempt to create context-aware lexer: I have several modes, and in which one there are different rules).

@Zinov because I load my tokens from some external source (file/DB/...) — Tar, Jul 09 '19 at 18:00
normally when you create the grammar, you should know all the tokens in advance, you are saying that you load them dynamically because you don't know the structure of the grammar? — Zinov, Jul 09 '19 at 18:14
@Zinov when the grammar is created/designed, it is possible to define say employees : 'peter' | 'paul' ....; At run time, when you want to validate an input phrase, there could be more employees and maybe 'peter' has left the company. You have defined the 'structure' of the grammar, but its 'content' may change at runtime. IMHO. — peter.cyc, Jul 27 '20 at 17:17
@peter.cyc I agree with you, but that should happen on the parser side with your AST while you are visiting the values of your node if you designed correctly your grammar, you shouldn't worried about a new employee. Your production should take care of it Employees -> name for example where name is a terminal production. Design the grammar well and you will have any values on your AST. Don't need to go at the Lexer level to address this issue — Zinov, Jul 27 '20 at 18:47

Mike Lischke · Accepted Answer · 2019-07-10T07:38:54.990

1

Accessing token values in ANTLR4 predicates and actions is possible with a special syntax. For details see the Actions and Attributes doc.

In general, you access a parsed token by using dollar sign and the token name, like

a: x = INT {$x.text == "0"}?;

or without a label (and only if the subrule exists only once in that parser rule):

a: INT {$INT.text == "0"}?;

ANTLR4 translates such pseudo code into target language code to allow accessing token properties (e.g. in C++ this becomes: INT->getText() == "0").

In lexer rules, however, this special access ist not possible (ANTLR3 supported it, but not ANTLR4). Still, you can access a token's properties with native code (in fact it's not a token directly, since it doesn't exist yet, but values which will be used to create it from, once the lexer rule has finished). Though, this is often not portable to other target languages (which doesn't matter if you don't have more than a single parser target).

The code triggered in a lexer action (which includes predicates) is executed in the context of the lexer. This lexer keeps values from which the new token will be created, after the rule has ended. This allows to get the currently matched text:

TOK : [-_.0-9a-zA-Z]+ {!TokenExists(Text);}? -> mode(IN_TOKEN);

Text is a property of the C# lexer.

edited Jul 10 '19 at 07:38

answered Jul 10 '19 at 07:14

Mike Lischke

48,925
16
119
181

If I change it to `{!TokenExists($TOK.text)}?`, I get `error AC0128: attribute references not allowed in lexer actions: $TOK.text` – Tar Jul 10 '19 at 07:22
Sorry, I forgot that in ANTLR4 it is no longer allowed to use the attributes syntax. I edited my answer accordingly. – Mike Lischke Jul 10 '19 at 07:39
Thanks! What I ended up doing is: `{!TokenExists(this)}?`, defining `TokenExists` as `public Func` (then using the lexer object as `lexer.Text`). The problem is that it gives me char by char now, rather than as a whole. For instance instead of `print` token string, I get `p`, then `pr`, then `pri`, `prin` and `print`, and I can't distinguish between e.g. `print` and `print_line`. Is there a way around it? – Tar Jul 10 '19 at 10:16
This is by design, as you could also use the predicate to match only a specific length of the text. – Mike Lischke Jul 10 '19 at 12:48

Antlr4: How to pass current token's value to lexer's predicate?

1 Answers1