Changing the State of Lexing.lexbuf

Question

I am writing a lexer for Brainfuck with Ocamllex, and to implement its loop, I need to change the state of lexbuf so it can returns to a previous position in the stream.

Background info on Brainfuck (skippable)

in Brainfuck, a loop is accomplished by a pair of square brackets with the following rule:

[ -> proceed and evaluate the next token

] -> if the current cell's value is not 0, return to the matching [

Thus, the following code evaluates to 15:
+++ [ > +++++ < - ] > .
it reads:

In the first cell, assign 3 (increment 3 times)

Enter loop, move to the next cell

Assign 5 (increment 5 times)

Move back to the first cell, and subtract 1 from its value

Hit the closing square bracket, now the current cell (first) is equals to 2, thus jumps back to [ and proceed into the loop again

Keep going until the first cell is equals to 0, then exit the loop

Move to the second cell and output the value with .

The value in the second cell would have been incremented to 15 (incremented by 5 for 3 times).

Problem:

Basically, I wrote two functions to take care of pushing and popping the last position of the last [ in the header section of brainfuck.mll file, namely push_curr_p and pop_last_p which pushes and pops the lexbuf's current position to a int list ref named loopstack:

{ (* Header *)
  let tape = Array.make 100 0
  let tape_pos = ref 0
  let loopstack = ref []

  let push_curr_p (lexbuf: Lexing.lexbuf) =
    let p = lexbuf.Lexing.lex_curr_p in
      let curr_pos = p.Lexing.pos_cnum in
        (* Saving / pushing the position of `[` to loopstack *)
        ( loopstack := curr_pos :: !loopstack
        ; lexbuf
        )

  let pop_last_p (lexbuf: Lx.lexbuf) =
    match !loopstack with
    | []       -> lexbuf
    | hd :: tl ->
      (* This is where I attempt to bring lexbuf back *)
      ( lexbuf.Lexing.lex_curr_p <- { lexbuf.Lexing.lex_curr_p with Lexing.pos_cnum = hd }
      ; loopstack := tl
      ; lexbuf
      )
}

{ (* Rules *)
  rule brainfuck = parse
  | '['  { brainfuck (push_curr_p lexbuf) }
  | ']'  { (* current cell's value must be 0 to exit the loop *)
           if tape.(!tape_pos) = 0
           then brainfuck lexbuf
           (* this needs to bring lexbuf back to the previous `[`
            * and proceed with the parsing 
            *)
           else brainfuck (pop_last_p lexbuf)
         }
  (* ... other rules ... *)
}

The other rules work just fine, but it seems to ignore [ and ]. The problem is obviously at the loopstack and how I get and set lex_curr_p state. Would appreciate any leads.

What's the benefit of putting the interpreter inside the lexer like this? — sepp2k, Oct 12 '17 at 20:13
@sepp2k just for the purpose of learning ocamllex. For Brainfuck it is possible to write recursive parser in plain Ocaml (which I've already done). — Pandemonium, Oct 12 '17 at 20:15
I don't mean to sound (or be) contrary, but if you want to learn ocamllex wouldn't it make more sense to use it for its intended purpose (that is, using it to write the lexer, not the whole interpreter)? I'm not even sure that what you're trying to do (i.e. looping in the lexer) is even possible and if it is possible, what are you going to learn from it that would be useful when using ocamllex in real projects? — sepp2k, Oct 12 '17 at 20:25
I understand your point now. You are saying lexer should only "lex" and emit tokens instead of interpreting grammars. Is that right? I think Brainfuck is simple enough it doesn't require parsing tokens into AST, and at some point I might add a parser anyway. — Pandemonium, Oct 12 '17 at 20:44
Yes, that way my point (or more precisely I was trying to find out whether you were purposefully swimming against the stream to find out how far you can go or whether you just didn't know that that's not how you're supposed to ocamllex). What you do after lexing (interpret the token stream directly or parse it into an AST) is another matter, but you'll have a hard time if you try to make the lexer do much more than produce the token stream. — sepp2k, Oct 12 '17 at 20:50
@PieOhPah technically the lexer does interpret grammar, namely the lexical grammar, but sepp2k is right. A lexer’s goal is to produce members of an alphabet that a parser can understand. You might be thinking of a scannerless parser which is basically where you combine the lexing and parsing phase, but ocamllex is a tool solely for building a lexer. So you could try to write a parser with it but it’d be like using a screwdriver to hammer in a nail — Nick Zuber, Oct 13 '17 at 18:52
According to the ocaml source [`runtime/lexing.c`](https://github.com/ocaml/ocaml/blob/trunk/runtime/lexing.c#L94), the lexing engine seems to keep track of what it is looking at with `lex_curr_pos`. I think at least from what the code is suggesting, manipulating that should affect the lexing engine. — Yuning, Feb 01 '21 at 15:57

score 4 · Accepted Answer · answered Oct 12 '17 at 21:16

lex_curr_p is meant to keep track of the current position, so that you can use it in error messages and the like. Setting it to a new value won't make the lexer actually seek back to an earlier position in the file. In fact I'm 99% sure that you can't make the lexer loop like that no matter what you do.

So you can't use ocamllex to implement the whole interpreter like you're trying to do. What you can do (and what ocamllex is designed to do) is to translate the input stream of characters into a stream of tokens.

In other languages that means translating a character stream like var xyz = /* comment */ 123 into a token stream like VAR, ID("xyz"), EQ, INT(123). So lexing helps in three ways: it finds where one token ends and the next begins, it categorizes tokens into different types (identifiers vs. keywords etc.) and discards tokens you don't need (white space and comments). This can simplify further processing a lot.

Lexing Brainfuck is a lot less helpful as all Brainfuck tokens only consist of a single character anyway. So finding out where each token ends and the next begins is a no-op and finding out the type of the token just means comparing the character against '[', '+' etc. So the only useful thing a Brainfuck lexer does is to discard whitespace and comments.

So what your lexer would do is turn the input [,[+-. lala comment ]>] into something like LOOP_START, IN, LOOP_START, INC, DEC, OUT, LOOP_END, MOVE_RIGHT, LOOP_END, where LOOP_START etc. belong to a discriminated union that you (or your parser generator if you use one) defined and made the lexer output.

If you want to use a parser generator, you'd define the token types in the parser's grammar and make the lexer produce values of those types. Then the parser can just parse the token stream.

If you want to do the parsing by hand, you'd call the lexer's token function by hand in a loop to get all the tokens. In order to implement loops, you'd have to store the already-consumed tokens somewhere to be able to loop back. In the end it'd end up being more work than just reading the input into a string, but for a learning exercise I suppose that doesn't matter.

That said, I would recommend going all the way and using a parser generator to create an AST. That way you don't have to create a buffer of tokens for looping and having an AST actually saves you some work (you no longer need a stack to keep track of which [ belongs to which ]).

Changing the State of Lexing.lexbuf

1 Answers1