Here's a start towards a solution, using the lexer to break up the text into tokens. The trick here is that rule ID
can emit more than one token per invokation. This is non-standard lexer behavior so there are some caveats:
I'm confident that this won't work in ANTLR4.
This code assumes all tokens are queued into tokenQueue
.
Rule ID
doesn't prevent a keyword from repeating, so intintint
produces tokens INT
INT
INT
. If that's bad, you'll want to handle that either on the lexer or parser side, depending on which makes more sense in your grammar.
The shorter the keyword, the more fragile this solution becomes. Input internal
is an invalid ID
because it starts with keyword int
but is followed by a non-keyword string.
The grammar produces warnings that I haven't expunged. If you use this code, I recommend attempting to remove them.
Here is the grammar:
MultiToken.g
grammar MultiToken;
@lexer::members{
private java.util.LinkedList<Token> tokenQueue = new java.util.LinkedList<Token>();
@Override
public Token nextToken() {
Token t = super.nextToken();
if (tokenQueue.isEmpty()){
if (t.getType() == Token.EOF){
return t;
} else {
throw new IllegalStateException("All tokens must be queued!");
}
} else {
return tokenQueue.removeFirst();
}
}
public void emit(int ttype, int tokenIndex) {
//This is lifted from ANTLR's Lexer class,
//but modified to handle queueing and multiple tokens per rule.
Token t;
if (tokenIndex > 0){
CommonToken last = (CommonToken) tokenQueue.getLast();
t = new CommonToken(input, ttype, state.channel, last.getStopIndex() + 1, getCharIndex() - 1);
} else {
t = new CommonToken(input, ttype, state.channel, state.tokenStartCharIndex, getCharIndex() - 1);
}
t.setLine(state.tokenStartLine);
t.setText(state.text);
t.setCharPositionInLine(state.tokenStartCharPositionInLine);
emit(t);
}
@Override
public void emit(Token t){
super.emit(t);
tokenQueue.addLast(t);
}
}
doc : (INT | FLOAT | ID | NUMBER)* EOF;
fragment
INT : 'int';
fragment
FLOAT : 'float';
NUMBER : ('0'..'9')+;
ID
@init {
int index = 0;
boolean rawId = false;
boolean keyword = false;
}
: ({!rawId}? INT {emit(INT, index++); keyword = true;}
| {!rawId}? FLOAT {emit(FLOAT, index++); keyword = true;}
| {!keyword}? ('a'..'z')+ {emit(ID, index++); rawId = true;}
)+
;
WS : (' '|'\t'|'\f'|'\r'|'\n')+ {skip();};
Test Case 1: Mixed Keywords
Input
intfloat a
int b
float c
intfloatintfloat d
Output (Tokens)
[INT : int] [FLOAT : float] [ID : a]
[INT : int] [ID : b]
[FLOAT : float] [ID : c]
[INT : int] [FLOAT : float] [INT : int] [FLOAT : float] [ID : d]
Test Case 2: Ids containing Keywords
Input
aintfloat
bint
cfloat
dintfloatintfloat
Output (Tokens)
[ID : aintfloat]
[ID : bint]
[ID : cfloat]
[ID : dintfloatintfloat]
Test Case 3: Bad Id #1
Input
internal
Output (Tokens & Lexer Error)
[INT : int] [ID : rnal]
line 1:3 rule ID failed predicate: {!keyword}?
Test Case 4: Bad Id #2
Input
floatation
Output (Tokens & Lexer Error)
[FLOAT : float] [ID : tion]
line 1:5 rule ID failed predicate: {!keyword}?
Test Case 5: Non-ID Rules
Input
int x
float 3 float 4 float 5
5 a 6 b 7 int 8 d
Output (Tokens)
[INT : int] [ID : x]
[FLOAT : float] [NUMBER : 3] [FLOAT : float] [NUMBER : 4] [FLOAT : float] [NUMBER : 5]
[NUMBER : 5] [ID : a] [NUMBER : 6] [ID : b] [NUMBER : 7] [INT : int] [NUMBER : 8] [ID : d]