Including comments in AST

Question

I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.

At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having

data Expr = Add Expr Expr | ...

one would have

data Expr a = Add a Expr Expr

and use a for whatever annotation (e.g. for comments that come after the expression).

However, there are some not so exciting cases. The language features C-like comments (// ..., /* .. */) and a simple for loop like this:

for (i in 1:10)
{
   ... // list of statements
}

Now, excluding the body there are at least 10 places where one could put one (or more) comments:

/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/ 
{ /*K*/
...

In other words, while the for loop could previously be comfortably represented as an identifier (i), two expressions (1 & 10) and a list of statements (the body), we would now at least had to include 10 more parameters or records for annotations. This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?

I suspect that for code formatting you want more of a *concrete* syntax tree. And instead of thinking of it like an AST (where only the important information is saved), consider that every character in the source should be marked as some node, and then those nodes combine into bigger nodes. — user253751, Sep 30 '22 at 23:00
I would expect A, C, F, G, H, J, and K to come from the productions for expressions (C, F, G, H) and statements (A, J, K). That just leaves B, E, and I (there is no D) to come from the production for `for`, which doesn't seem so bad. — Daniel Wagner, Oct 01 '22 at 03:40
@DanielWagner Right. So the idea is that usually, expressions (and statements) in the AST contain the comments in front of them, e.g. `/*F*/` would be annotated to the node of the expression `1`, right? One thing I missed was that the `:` could be parsed as a binary op. The `/*J*/` would be annotated to something like an `EOL` node? And, to which statetement would `/*A*/` be annotated to? Wouldn't it also be part of the node for the `for` loop, or am I missing something? — Soeren, Oct 01 '22 at 09:58
@Soeren I would expect production rules like `Stmt -> Comment Stmt | { Stmt* } | "for" ... | ...`. This captures `A`, `J`, and `K` in the `Stmt -> Comment Stmt` production. — Daniel Wagner, Oct 01 '22 at 16:07
@DanielWagner Gotcha, thank you so much! If you want to post an answer, then I'll happily accept it. — Soeren, Oct 01 '22 at 16:24

score 2 · Accepted Answer · answered Oct 01 '22 at 17:18

You can probably capture most of those positions with just two generic comment productions:

Expr -> Comment Expr
Stmt -> Comment Stmt

This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for production (maybe four, with one hidden in Range here):

Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt

In other words: one before each literal string but the first. Seems not too onerous, ultimately.

Including comments in AST

1 Answers1