Scanner (Lexing keywords with ANTLR)

Question

I have been working on writing a scanner for my program and most of the tutorials online include a parser along with the scanner. It doesn't seem possible to write a lexer without writing a parser at the same time. I am only trying to generate tokens, not interpret them. I want to recognize INT tokens, float tokens, and some tokens like "begin" and "end"

I am confused about how to match keywords. I unsuccessfully tried the following:

KEYWORD : KEY1 | KEY2;

KEY1 : {input.LT(1).getText().equals("BEGIN")}? LETTER+ ;
KEY2 : {input.LT(1).getText().equals("END")}? LETTER+ ;

FLOATLITERAL_INTLITERAL
  : DIGIT+ 
  ( 
    { input.LA(2) != '.' }? => '.' DIGIT* { $type = FLOATLITERAL; }
    | { $type = INTLITERAL; }
  )
  | '.'  DIGIT+ {$type = FLOATLITERAL}
;

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

IDENTIFIER 
 : LETTER 
   | LETTER DIGIT (LETTER|DIGIT)+ 
   | LETTER LETTER (LETTER|DIGIT)*
 ;

WS  //Whitespace
  : (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
;

Bart Kiers · Answer 1 · 2011-09-02T15:16:00.853

If you only want a lexer, start your grammar with:

lexer grammar FooLexer; // creates: FooLexer.java

LT(int): Token can only be used inside parser rules (on a TokenStream). Inside lexer rules, you can only use LA(int): int that gets the next int (character) from the IntStream. But there is no need for all the manual look ahead. Just do something like this:

lexer grammar FooLexer;

BEGIN
  :  'BEGIN'
  ;

END
  :  'END'
  ;

FLOAT
  :  DIGIT+ '.' DIGIT+
  ;

INT
  :  DIGIT+
  ;

IDENTIFIER 
  :  LETTER (LETTER | DIGIT)*
  ;

WS
  :  (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
  ; 

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

I don't see the need to create a token called KEYWORD that matches all keywords: you'll want to make a distinction between a BEGIN and END token, right? But if you really want this, simply do:

KEYWORD
  :  'BEGIN'
  |  'END'
  ;

and remove the BEGIN and END rules. Just make sure KEYWORD is defined before IDENTIFIER.

EDIT

Test the lexer with the following class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "BEGIN END 3.14159 42 FOO";
    FooLexer lexer = new FooLexer(new ANTLRStringStream(src));
    while(true) {
      Token token = lexer.nextToken();
      if(token.getType() == FooLexer.EOF) {
        break;
      }
      System.out.println(token.getType() + " :: " + token.getText());
    }
  }
}

If you generate a lexer, compile the .java source files and run the Main class like this:

java -cp antlr-3.3.jar org.antlr.Tool FooLexer.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

the following output will be printed to the console:

4 :: BEGIN
11 ::  
5 :: END
11 ::  
7 :: 3.14159
11 ::  
8 :: 42
11 ::  
10 :: FOO

In your example FLOAT and INT can't both be recognized because it is an ambiguous case. I would get the following warning: Multiple token rules can match input such as "'0'..'9''0'..'9'": FLOATLITERAL, INTLITERAL As a result, token(s) INTLITERAL were disabled for that input — slimbo, Sep 02 '11 at 13:21
@macneil, no, that is not true. My guess is that you've not copy-pasted my suggestion. I'll add a little demo shortly. — Bart Kiers, Sep 02 '11 at 15:13
As you can see, the tokens `3.14159` and `42` are of a different type (FLOAT and INT respectively). — Bart Kiers, Sep 02 '11 at 15:16
I'm using something similar to debug my lexer and as a quality of life thing you can change the println to `System.out.println(FooLexer.tokenNames[token.getType()] + " :: " + token.getText());` to get the token names (at least in ANTLR4). — Ron Warholic, May 21 '14 at 17:28

umlcat · Answer 2 · 2014-07-15T17:22:26.873

[From a guy who make a custom lexer tool, and still trying to learn ANTLR]

Boring extensive answer:

You are right. Many books & courses mix both tools. And sometimes "generating/detecting tokens" and "interpreting tokens" may mix.

Sometimes, a developer is trying to do a scanner, and still, mixes scanning & parsing in its mind ;-)

Usually, when detecting tokens, you also have to do an action ("interpretation"), as simple, as printing a message or the found token to string. Example: "{ cout << "Hey, I found a integer constant" << "\n" }"

There are also several cases that may make scanning difficult for a begginner in the topic.

One case is that several text may be used for different tokens.

Example:

"-" as the substraction binary operator, and "-" as the negative prefix operator. Or, treating 5 both as an integer and a float. In scanners, "-" can be seen as the same token, while in parsers, you may treat it as different tokens.

In order to fix this, my favorite approach its to use "generic tokens", in the scanning/lexer process, and later, convert them as "custom tokens" in the parsing/syntax process.

Quick answer:

As mentioned in previous answers, start with making a grammar, in fact, I suggest try it in a whiteboard or notebook, and later in your favorite (ANTLRL, other) scanning tool.

Consider those special cases, where there could be some token overlappings.

Good Luck.

Scanner (Lexing keywords with ANTLR)

2 Answers2

EDIT

Linked