Record tokens and their position to use them outside the front-end

Question

I want to write a small beautifier for a specific language. In the beautifier, we will be able to indent one or several lines (ie, adding white-spaces on the left hand of each line); we will also be able to format the whole code (ie, alter white-spaces and newlines in appropriate locations).

Given a program, my front-end by ocamllex and ocamlyacc could build a Abstract Syntax Tree (AST):

(* in main.ml *)
let f = open_in file in
let buf = Lexing.from_channel f in
let ast = Parser.main Lexer.token buf in
analyse ast
...

I am familier with working on the AST to analyse, compile and print (not exactly the same) the program. However, it seems that we need to work directly on tokens to write a good beautifier. But I don't know how to manipulate tokens outside the front-end.

For example, is it common to record tokens and their position somewhere while parsing, so that we could still use them outside the front-end? For example, we may go through tokens in this record one by one, and print exactly the same program (including exact white spaces)?

Does anyone have any code snippet?

Edit 1: Here are some examples that use Lexing.lexeme_start_p on lexbuf runtime. However, what I want to know is whether and how people get these information outside (or after) a parsing? For instance, outside (or after) a parsing, how could we get the token from a position?

  (* in main.ml *)
  let ast = try Parser.main Lexer.token buf with
    | Lexer.Lexing_error e ->
      let pos = Lexing.lexeme_start_p buf in
      let l = pos.pos_lnum in
      let c = pos.pos_cnum - pos.pos_bol + 1 in
      pffo "File \"%s\", line %d, characters %d-%d:\n" file l (c-1) c
      pffo "Unexpected exception, parser top : lexical analysis > %s@." e;
      exit 1
    ...

 (* in lexer.mll *)
 rule token = parse 
   ...
   | "'" '\\' (_ as c)
     { let msg = Printf.sprintf "illegal escape sequence \\%c" c in
       let p = Lexing.lexeme_start_p lexbuf in
       raise (Lexical_error (msg, p.Lexing.pos_fname, p.Lexing.pos_lnum, 
              p.Lexing.pos_cnum - p.Lexing.pos_bol + 1)) }

camlspotter · Accepted Answer · 2016-06-05T02:20:24.533

0

Keeping token positions with tokens is pretty common in practical programming language implementations.

The easiest way to print out a part of input code as-is is keep the input text somewhere and extract the part you want using the token locations. Rebuilding the text from the stream of tokens and its positions inserting white spaces appropriately is hard to implement and very error prone I am afraid, and impossible when your lexer ignores non white space thing like comments.

Such an example of printing input code as-is can be found in OCaml compiler implementation. For example Location.highlight_dumb tries to print the code around an error using lexer's lex_buffer field which carries the input text, though sometimes it is impossible since lex_buffer does not keep whole the input.

edited Jun 05 '16 at 02:20

answered Jun 05 '16 at 02:06

camlspotter

8,990
23
27

Thanks for your comment... I see how they use lexer's `lex_buffer` to get the input text. But how could we record tokens and their position while parsing? – SoftTimur Jun 05 '16 at 03:17
Additionally, how could we get the token from a position in `lex_buffer`? – SoftTimur Jun 05 '16 at 03:47
1

Positions of tokens can be obtained by functions of `Lexing` module, such as `lexeme_start_p`. OCaml's `parser.mly` is a good example of getting positions from lexer. What you get from `lex_buffer` is not the token but the source string. `Location.highlight_dumb` shows an example to get the original source code from `lex_buffer`. – camlspotter Jun 05 '16 at 06:13
So far, the using of `lexeme_start_p` I see is always dependent to a parsing. However, my question is about how to do this outside (or after) the parsing... please see my updates in OP... After a parsing, if I need to extract the part you want using the **token locations** as you said, I should be able to get the location from a token or get the token from a location, right? – SoftTimur Jun 05 '16 at 11:45
1

At lexing or parsing, get the locations of tokens and keep them in the parsed AST nodes. The best example is OCaml's `parser.mly`: it keeps locations in various record fields named `xxx_loc`. See functions like `mkexp`, `mktyp` and `mkpat`, etc. You could attach locations to tokens themselves too, but I think it is not a good idea since you flood your lexer tokens with locations. – camlspotter Jun 06 '16 at 05:28

Record tokens and their position to use them outside the front-end

1 Answers1