I want to write a small beautifier for a specific language. In the beautifier, we will be able to indent one or several lines (ie, adding white-spaces on the left hand of each line); we will also be able to format the whole code (ie, alter white-spaces and newlines in appropriate locations).
Given a program, my front-end by ocamllex
and ocamlyacc
could build a Abstract Syntax Tree (AST)
:
(* in main.ml *)
let f = open_in file in
let buf = Lexing.from_channel f in
let ast = Parser.main Lexer.token buf in
analyse ast
...
I am familier with working on the AST to analyse, compile and print (not exactly the same) the program. However, it seems that we need to work directly on tokens to write a good beautifier. But I don't know how to manipulate tokens outside the front-end.
For example, is it common to record tokens and their position somewhere while parsing, so that we could still use them outside the front-end? For example, we may go through tokens in this record one by one, and print exactly the same program (including exact white spaces)?
Does anyone have any code snippet?
Edit 1:
Here are some examples that use Lexing.lexeme_start_p
on lexbuf
runtime. However, what I want to know is whether and how people get these information outside (or after) a parsing? For instance, outside (or after) a parsing, how could we get the token from a position?
(* in main.ml *)
let ast = try Parser.main Lexer.token buf with
| Lexer.Lexing_error e ->
let pos = Lexing.lexeme_start_p buf in
let l = pos.pos_lnum in
let c = pos.pos_cnum - pos.pos_bol + 1 in
pffo "File \"%s\", line %d, characters %d-%d:\n" file l (c-1) c
pffo "Unexpected exception, parser top : lexical analysis > %s@." e;
exit 1
...
(* in lexer.mll *)
rule token = parse
...
| "'" '\\' (_ as c)
{ let msg = Printf.sprintf "illegal escape sequence \\%c" c in
let p = Lexing.lexeme_start_p lexbuf in
raise (Lexical_error (msg, p.Lexing.pos_fname, p.Lexing.pos_lnum,
p.Lexing.pos_cnum - p.Lexing.pos_bol + 1)) }