Faithfully handle white-spacing in a pretty-printer

Question

I am writing a front-end for a language (by ocamllex and ocamlyacc).

So the frond-end can build a Abstract Syntax Tree (AST) from a program. Then we often write a pretty printer, which takes an AST and print a program. If later we just want to compile or analyse the AST, most of the time, we don't need the printed program to be exactly the same as the original program, in terms of white-spacing. However, this time, I want to write a pretty printer that prints exactly the same program as the original one, in terms of white-spacing.

Therefore, my question is what are best practices to handle white-spacing while trying not to modify too much the types of AST. I really don't want to add a number (of white-spaces) to each type in the AST.

For example, this is how I currently deal with (ie, skip) white-spacing in lexer.mll:

rule token = parse
  ...
  | [' ' '\t']       { token lexbuf }     (* skip blanks *)
  | eof              { EOF }

Does anyone know how to change this as well as other parts of the front-end to correctly taking white-spacing into account for a later printing?

If your pretty-printer doesn't alter whitespace in any way, what exactly does it do which justifies the word "pretty"? :) In other words, why don't you just regurgitate the entire input text? — rici, Jun 01 '16 at 22:31
I see... that's because for some parts of a program, I don't want to alter white-spaces. For example, for a function call `f(arg0,arg1, arg2,arg3)`, I want to keep as it is, rather than changing it to pretty `f(arg0, arg1, arg2, arg3)`. — SoftTimur, Jun 01 '16 at 22:39

score 1 · Accepted Answer · answered Jun 01 '16 at 22:52

1

It's quite common to keep source-file location information for each token. This information allows for more accurate errors, for example.

The most general way to do this is to keep the beginning and ending line number and column position for each token, which is a total of four numbers. If it were easy to compute the end position of a token from its value and the start position, that could be reduced to two numbers, but at the price of extra code complexity.

Bison has some features which simplify the bookkeeping work of remembering location objects; it's possible that ocamlyacc includes similar features, but I didn't see anything in the documentation. In any case, it is straight-forward to maintain a location object associated with each input token.

With that information, it is easy to recreate the whitespace between two adjacent tokens, as long as what separated the tokens was whitespace. Comments are another issue.

It's a judgement call whether or not that is simpler than just attaching preceding whitespace (and even comments) to each token as it is lexed.

answered Jun 01 '16 at 22:52

rici

234,347
28
237
341

OCaml does have a type [position](http://caml.inria.fr/pub/docs/manual-ocaml/libref/Lexing.html) to get the position of a token. Could you please tell more about how to **maintain** the location associated with each token? Do I have to store the location or the number of white-spaces for each element in AST, so that I use that information in pretty-printer? – SoftTimur Jun 02 '16 at 00:16
@softtimur: i would keep the location information as part of the token. But there are probably other alternatives. Augmenting the token is simple – rici Jun 02 '16 at 03:51
Sorry, what do you mean by "keep the location information as part of the token"? How do the types look like? – SoftTimur Jun 02 '16 at 10:47
Maybe something like this: https://github.com/np/camllexer/blob/master/Located.mli – rici Jun 02 '16 at 16:36
1

I noticed when I was looking around ocaml solutions that ocamlp5 seems to keep location information in an array which is indexed by token number. That might be less invasive, but of course it depends on your being able to figure out the number of each token. It's worth saying that pretty-printers do not have the same structure as, for example, compilers; in a pretty-printer, it would be more common to construct a full parse tree rather than abstracting away purely syntactic tokens. Also, comments. And, as with this question, whitespace. – rici Jun 02 '16 at 16:40

score 0 · Answer 2 · answered Jun 03 '16 at 08:02

You can have match statements that print different number of spaces depending on the token you're dealing with. I would usually print 1 space before if the token is an: id,num,define statement, assign(=)

If the token is an arithmetic expression I would print one space before and one space after it.

if you are dealing with an if or while statement I would indent the body by four spaces.

I think the best bet would be to write a pretty_print function such as:

let rec pretty_print pos ast =
   match ast with
    |Some_token -> String.make pos ' '; (* adds 'pos' number of spaces; pos will start off as zero. *)
                   print_string "Some_token";
    |Other_token...

In sum I would handle white spaces by matching each token individually in a recursive function, and printing out the appropriate number of spaces.

I suppose this wouldn't recreate the original program's format exactly but it would create a perfectly indented program — Abhas Arya, Jun 03 '16 at 08:05

Faithfully handle white-spacing in a pretty-printer

2 Answers2