How can I keep concatenated tokens separate during lexing when a more general token is availible

Question

The language I'm working on allows certain tokens to be stuck together (eg "intfloat") and I'm looking for a way to have the lexer not turn them into an ID so they're available separately at parse time. The simplest grammar I can come up with that demonstrates it is (WS omitted):

B: 'B';
C: 'C';
ID: ('a'..'z')+;
doc : (B | C | ID)* EOF;

Run against:

bc
abc
bcd

What I'd like out of the lexer:

B C
ID (starts with not-a-keyword so it's an ID)
<error> (cannot concat non-keywords)

But what I get is 3 IDs, as expected.

I have been looking at making the ID not greedy but that degenerates into individual tokens for each character. I suppose I could glue them back together later, but it feels like there should be a better way.

Any thoughts?

Thanks

user1201210 · Answer 1 · 2013-01-09T22:17:45.730

Here's a start towards a solution, using the lexer to break up the text into tokens. The trick here is that rule ID can emit more than one token per invokation. This is non-standard lexer behavior so there are some caveats:

I'm confident that this won't work in ANTLR4.
This code assumes all tokens are queued into tokenQueue.
Rule ID doesn't prevent a keyword from repeating, so intintint produces tokens INT INT INT. If that's bad, you'll want to handle that either on the lexer or parser side, depending on which makes more sense in your grammar.
The shorter the keyword, the more fragile this solution becomes. Input internal is an invalid ID because it starts with keyword int but is followed by a non-keyword string.
The grammar produces warnings that I haven't expunged. If you use this code, I recommend attempting to remove them.

Here is the grammar:

MultiToken.g

grammar MultiToken;


@lexer::members{
    private java.util.LinkedList<Token> tokenQueue = new java.util.LinkedList<Token>();

    @Override
    public Token nextToken() {
            Token t = super.nextToken();
            if (tokenQueue.isEmpty()){
                if (t.getType() == Token.EOF){
                    return t;
                } else { 
                    throw new IllegalStateException("All tokens must be queued!");
                }
            } else { 
                return tokenQueue.removeFirst();
            }
    }

    public void emit(int ttype, int tokenIndex) {
        //This is lifted from ANTLR's Lexer class, 
        //but modified to handle queueing and multiple tokens per rule.
        Token t;

        if (tokenIndex > 0){
            CommonToken last = (CommonToken) tokenQueue.getLast();
            t = new CommonToken(input, ttype, state.channel, last.getStopIndex() + 1, getCharIndex() - 1);
        } else { 
            t = new CommonToken(input, ttype, state.channel, state.tokenStartCharIndex, getCharIndex() - 1);
        }

        t.setLine(state.tokenStartLine);
        t.setText(state.text);
        t.setCharPositionInLine(state.tokenStartCharPositionInLine);
        emit(t);
    }

    @Override
    public void emit(Token t){
        super.emit(t);
        tokenQueue.addLast(t);
    }
}

doc     : (INT | FLOAT | ID | NUMBER)* EOF;

fragment
INT     : 'int';

fragment
FLOAT   : 'float';

NUMBER  : ('0'..'9')+;

ID  
@init {
    int index = 0; 
    boolean rawId = false;
    boolean keyword = false;
}
        : ({!rawId}? INT {emit(INT, index++); keyword = true;}
            | {!rawId}? FLOAT {emit(FLOAT, index++); keyword = true;}
            | {!keyword}? ('a'..'z')+ {emit(ID, index++); rawId = true;} 
          )+
        ;

WS      : (' '|'\t'|'\f'|'\r'|'\n')+ {skip();};

Test Case 1: Mixed Keywords

Input

intfloat a
int b
float c
intfloatintfloat d

Output (Tokens)

[INT : int] [FLOAT : float] [ID : a] 
[INT : int] [ID : b]
[FLOAT : float] [ID : c] 
[INT : int] [FLOAT : float] [INT : int] [FLOAT : float] [ID : d]

Test Case 2: Ids containing Keywords

Input

aintfloat
bint
cfloat
dintfloatintfloat

Output (Tokens)

[ID : aintfloat] 
[ID : bint] 
[ID : cfloat] 
[ID : dintfloatintfloat]

Test Case 3: Bad Id #1

Input

internal

Output (Tokens & Lexer Error)

[INT : int] [ID : rnal] 
line 1:3 rule ID failed predicate: {!keyword}?

Test Case 4: Bad Id #2

Input

floatation

Output (Tokens & Lexer Error)

[FLOAT : float] [ID : tion] 
line 1:5 rule ID failed predicate: {!keyword}?

Test Case 5: Non-ID Rules

Input

int x
float 3 float 4 float 5
5 a 6 b 7 int 8 d

Output (Tokens)

[INT : int] [ID : x] 
[FLOAT : float] [NUMBER : 3] [FLOAT : float] [NUMBER : 4] [FLOAT : float] [NUMBER : 5] 
[NUMBER : 5] [ID : a] [NUMBER : 6] [ID : b] [NUMBER : 7] [INT : int] [NUMBER : 8] [ID : d]

Shouldn't nextToken only call super.nextToken if the token queue is empty? That should solve the problem of normal tokens disappearing. — Troy Daniels, Jan 09 '13 at 18:16
@TroyDaniels `super.nextToken` manages the call to the lexer rules. If it isn't called, `tokenQueue` won't get populated and `nextToken` won't have anything to return. — user1201210, Jan 09 '13 at 21:01
@TroyDaniels I updated the grammar (and added a test case) to handle other lexer rules: `nextToken` only draws tokens from `tokenQueue`, `emit(int, int)` calls `emit(Token)`, and `emit(Token)` is overridden to push all tokens into the queue. Not as bad as I thought it would be, but give it a try in case I missed something. — user1201210, Jan 09 '13 at 22:04

Sam Harwell · Answer 2 · 2013-01-09T17:19:47.943

Here's an almost-all-grammar solution for ANTLR 4 (only requires one small predicate in the target language):

lexer grammar PackedKeywords;

INT : 'int' -> pushMode(Keywords);
FLOAT : 'float' -> pushMode(Keywords);

fragment ID_CHAR : [a-z];
ID_START : ID_CHAR {Character.isLetter(_input.LA(1))}? -> more, pushMode(Identifier);
ID : ID_CHAR;

// these are the other tokens in the grammar
WS : [ \t]+ -> channel(HIDDEN);
Newline : '\r' '\n'? | '\n' -> channel(HIDDEN);

// The Keywords mode duplicates the default mode, except it replaces ID
// with InvalidKeyword. You can handle InvalidKeyword tokens in whatever way
// suits you best.
mode Keywords;

    Keywords_INT : INT -> type(INT);
    Keywords_FLOAT : FLOAT -> type(FLOAT);
    InvalidKeyword : ID_CHAR;
    // must include every token which can follow the Keywords mode
    Keywords_WS : WS -> type(WS), channel(HIDDEN), popMode;
    Keywords_Newline : Newline -> type(Newline), channel(HIDDEN), popMode;

// The Identifier mode is only entered if we know the current token is an
// identifier with >1 characters and which doesn't start with a keyword. This is
// essentially the default mode without keywords.
mode Identifier;

    Identifier_ID : ID_CHAR+ -> type(ID);
    // must include every token which can follow the Identifiers mode
    Identifier_WS : WS -> type(WS), channel(HIDDEN), popMode;
    Identifier_Newline : Newline -> type(Newline), channel(HIDDEN), popMode;

This grammar also works in the ANTLRWorks 2 lexer interpreter (coming soon!) for everything except single-character identifiers. Since the lexer interpreter can't evaluate the predicate in ID_START, an input like a<space> will (in the interpreter) produce a single token with text a<space> of type WS on the HIDDEN channel.

The docs say modes are only available for lexer only grammers. If I want this to be available for a combined grammer do I need to make separate lexer and parser grammers and bridge them in the calling app? p.s. Why lexer only? — user1959981, Jan 25 '13 at 18:42
The `more` command says the text matched for the current token isn't a complete token, so try to match another token and combine the two for the result. — Sam Harwell, Jan 26 '13 at 19:58

How can I keep concatenated tokens separate during lexing when a more general token is availible

2 Answers2

MultiToken.g

Test Case 1: Mixed Keywords

Test Case 2: Ids containing Keywords

Test Case 3: Bad Id #1

Test Case 4: Bad Id #2

Test Case 5: Non-ID Rules