How to get a syntax tree with comments?

Question

I'm trying to create a documentation generator for several languages. For this I need an AST, in order to known that, for instance, this comment is for a class and this one is for a method of this class.

I started to write this simple Python code which display the tree by recursively looking on it:

import sys
import antlr4
from ECMAScriptLexer import ECMAScriptLexer
from ECMAScriptParser import ECMAScriptParser

def handleTree(tree, lvl=0):
    for child in tree.getChildren():
        if isinstance(child, antlr4.tree.Tree.TerminalNode):
            print(lvl*'│ ' + '└─', child)
        else:
            handleTree(child, lvl+1)

input = antlr4.FileStream(sys.argv[1])
lexer = ECMAScriptLexer(input)
stream = antlr4.CommonTokenStream(lexer)
parser = ECMAScriptParser(stream)
tree = parser.program()
handleTree(tree)

And tried to parse this Javascript code, with the antlr EcmaScript grammar:

var i = 52; // inline comment

function foo() {
  /** The foo documentation */
  console.log('hey');
}

This outputs:

│ │ │ │ └─ var
│ │ │ │ │ │ └─ i
│ │ │ │ │ │ │ └─ =
│ │ │ │ │ │ │ │ │ │ └─ 52
│ │ │ │ │ └─ ;
│ │ │ └─ function
│ │ │ └─ foo
│ │ │ └─ (
│ │ │ └─ )
│ │ │ └─ {
│ │ │ │ │ │ │ │ │ │ │ │ └─ console
│ │ │ │ │ │ │ │ │ │ │ └─ .
│ │ │ │ │ │ │ │ │ │ │ │ └─ log
│ │ │ │ │ │ │ │ │ │ │ └─ (
│ │ │ │ │ │ │ │ │ │ │ │ │ │ └─ 'hey'
│ │ │ │ │ │ │ │ │ │ │ └─ )
│ │ │ │ │ │ │ │ │ └─ ;
│ │ │ └─ }
└─ <EOF>

All the comments are ignored, probably because of the presence of channel(HIDDEN) in the grammar.

After some googling I found this with this answer:

Unless you have a very compelling reason to put the comment inside the parser (which I'd like to hear), you should put it in the lexer.

So, why comments should not be included in the parser and how to get a tree including comments?

The way that Python associates documentation with language elements is through docstrings, not comments. Docstrings should show up in your ast. WIth comments, you cannot determine that a particular comment "is for a class and this one is for a method of this class". — larsks, Sep 10 '17 at 10:33
Sorry, maybe it wasn't clear: here I am trying to parse a JavaScript code, with a parser written in Python. — roipoussiere, Sep 10 '17 at 13:39

Bart Kiers · Accepted Answer · 2017-09-11T11:58:34.370

So, why comments should not be included in the parser and how to get a tree including comments?

If you remove the -> channel(HIDDEN) from the rule MultiLineComment

MultiLineComment
 : '/*' .*? '*/' -> channel(HIDDEN)
 ;

then the MultiLineComment would end up in the parser. But then, each of your parser rules would need to include these tokens where they are allowed.

For example, take the arrayLiteral parser rule:

/// ArrayLiteral :
///     [ Elision? ]
///     [ ElementList ]
///     [ ElementList , Elision? ]
arrayLiteral
 : '[' elementList? ','? elision? ']'
 ;

Since this is a valid array literal in JavaScript:

[/* ... */ 1, 2 /* ... */ , 3 /* ... */ /* ... */]

it would mean you'd need litter all your parser rules with MultiLineComment tokens like this:

/// ArrayLiteral :
///     [ Elision? ]
///     [ ElementList ]
///     [ ElementList , Elision? ]
arrayLiteral
 : '[' MultiLineComment* elementList? MultiLineComment* ','? MultiLineComment* elision? MultiLineComment* ']'
 ;

It would become one big mess.

EDIT

From the comments:

So it's not possible to generate a tree including comments with antlr? Is there some hacks or other libraries to do this?

And GRosenberg's answer:

Antlr provides a convenience method for this task: BufferedTokenStream#getHiddenTokensToLeft. In walking the parse tree, access the stream to obtain the node associated comment, if any. Use BufferedTokenStream#getHiddenTokensToRight to get any trailing comment.

Thanks for the clarification! So it's not possible to generate a tree including comments with antlr? Is there some hacks or other libraries to do this? — roipoussiere, Sep 10 '17 at 13:44
Antlr provides a convenience method for this task: `BufferedTokenStream#getHiddenTokensToLeft`. In walking the parse tree, access the stream to obtain the node associated comment, if any. Use `BufferedTokenStream#getHiddenTokensToRight` to get any trailing comment. — GRosenberg, Sep 10 '17 at 20:04
I knew there were such utility methods, but didn't know them by heart. Thanks for mentioning them @GRosenberg! Added to the answer. — Bart Kiers, Sep 11 '17 at 11:59

How to get a syntax tree with comments?

1 Answers1

EDIT