Use pygments lexer with antl python target

Question

Terence Parr himself says about antlr3: " Unfortunately, it still seems more difficult to build tokenizer with ANTLR than with a traditional lex-like approach". Where as pygments has lexers for almost any language you can think of: http://pygments.org/languages/

Has anyone tried using a pygments lexer with the antlr python target ? antlr2 had an example of using flex with the cpp target, unfortunately there are no such examples for antlr3.
Can I just hand write a grammarname.tokens file that the antlr parser can import ? When I use a antlr lexer, there are a bunch of anonymous tokens, can i just remove them ? Alternatively maybe pygments can be modified to accept the antlr .tokens file for its tokens. The pygments token stream just needs to implement the antlr token stream interface.

score 1 · Answer 1 · answered Sep 05 '11 at 07:53

Naveen wrote:

Has anyone tried using a pygments lexer with the antlr python target ?

I doubt it. At least, I have never seen anyone mention this either here on SO, or on the ANTLR mailing-lists (which I monitor for quite some time now).

Naveen wrote:

Can I just hand write a grammarname.tokens file that the antlr parser can import ?

No. The parser expects an instance of a Lexer object, which is present in the (Python) runtime. A .tokens file is not supposed to be edited by hand.

Naveen wrote:

When I use a antlr lexer, there are a bunch of anonymous tokens, can i just remove them ?

Not quite sure what you mean, but removing any of the generated code seems a bad idea to me. If you're referring to the .tokens file, as I mentioned before: it is not supposed to be edited by hand.

I really wouldn't bother trying to "glue" some external lexer-grammar, or complete lexer, into ANTLR. I am pretty sure this will cause you more time to implement than it is to just write the ANTLR lexer grammar yourself. After all: defining the lexer rules is the easiest part of a language in most cases.

Thanks for your answer. Defining lexer rules should be easy, unfortunately its not so with antlr and some languages. You really need a lot more magic even for lexers, and pygments has solved the problem i think. Also, the .tokens file is just a simple dictionary, I don't know why i shouldn't be able to poke at it to adapt a different lexer. Anyway, I will give it a shot myself. — Naveen, Sep 05 '11 at 15:48
Naveen, changing the `.tokens` file will not affect your lexer in any useful way. And what languages are so difficult to tokenize? — Bart Kiers, Sep 05 '11 at 20:55

score 0 · Accepted Answer · edited May 23 '17 at 11:55

this other q/a was very helpful: ANTLR Parser with manual lexer also read through the stax and jflex snippets: http://www.antlr.org/wiki/display/ANTLR3/Interfacing+StAX+to+ANTLR http://www.antlr.org/pipermail/antlr-interest/2007-October/023957.html

the tokens file is a non issue if you import the token types from the generated parser file. Unfortunately i first tried parsing the .tokens file and forgot to convert the token types to integers which caused a long bugchase...

but, I finally figured it out: I figured it out: http://github.com/tinku99/antlr-pygments

Use pygments lexer with antl python target

2 Answers2