Preserving comments in `Text.Parsec.Token` tokenizers

Question

I'm writing a source-to-source transformation using parsec, So I have a LanguageDef for my language and I build a TokenParser for it using Text.Parsec.Token.makeTokenParser:

myLanguage = LanguageDef { ...
  commentStart = "/*"
  , commentEnd = "*/"
  ...
}

-- defines 'stringLiteral', 'identifier', etc...
TokenParser {..} = makeTokenParser myLanguage

Unfortunately since I defined commentStart and commentEnd, each of the parser combinators in the TokenParser is a lexeme parser implemented in terms of whiteSpace, and whiteSpace eats spaces as well as comments.

What is the right way to preserve comments in this situation?

Approaches I can think of:

Don't define commentStart and commentEnd. Wrap each of the lexeme parsers in another combinator that grabs comments before parsing each token.
Implement my own version of makeTokenParser (or perhaps use some library that generalizes Text.Parsec.Token; if so, which library?)

What's the done thing in this situation?

AndrewC · Accepted Answer · 2014-06-26T16:00:54.553

5

In principle, defining commentStart and commentEnd don't fit with preserving comments, because you need to consider comments as valid parts of both source and target language, including them in your grammar and your AST/ADT.

In this way, you'd be able to keep the text of the comment as the payload data of a Comment constructor, and output it appropriately in the target language, something like

data Statement = Comment String | Return Expression | ......

The fact that neither source nor target language sees the comment text as relevant is irrelevant for your translation code.

Major problem with this approach: It doesn't really fit well with makeTokenParser, and fits better with implementing your source language's parser from the ground up.

I guess I'm veering towards editing makeTokenParser to just get the comment parsers to return the String instead of ().

edited Jun 26 '14 at 16:00

answered Jun 26 '14 at 15:16

AndrewC

32,300
7
79
115

Right, so I should probably ask a separate question about the best AST representation to use for keeping track of comments. For example the one you have here may work (provided that `Expression` also has a constructor for comments, etc...), but an alternative may be to attach comments as metadata to each AST node (which seems like it would be more robust under AST-rewriting operations). – Lambdageek Jun 26 '14 at 18:17
@Lambdageek Yikes yes - it doesn't half make a mess of your grammar. Hmm yes, some sort of `Either Term (Comment,Term)` or `(Maybe Comment,Term)` transformation throughout the tree. Wow. That's either very ugly indeed or there's some really neat free monad or algebra thing you can do for it. You can see why the default is to throw the comments away! – AndrewC Jun 26 '14 at 22:39

Preserving comments in `Text.Parsec.Token` tokenizers

1 Answers1

Linked