1

I want to add the support of structured references with Excel tables to my lexer & parser of Excel formulas.

I added the following regular expressions to lexer_structref.mll:

let lex_table_name = "DeptSales"
let lex_column_header = "Sales Amount"

(* EG: =[Sales Amount] *)
let lex_ColumnWOTable = "[" lex_column_header "]"

(* EG: =[Region]:[% Commission] *)
let lex_RangeWOTable = lex_ColumnWOTable ":" lex_ColumnWOTable

(* EG: =DeptSales[Sales Amount] *)
let lex_Column' = lex_table_name lex_ColumnWOTable

let lex_structref = lex_ColumnWOTable | lex_RangeWOTable | lex_Column'

In lexer_e.mll, I added the identifier as follows. And parser_e.mly will call Parser_structref.mly which parse a structured reference.

| lex_structref as r           { STRUCTREF r }

However, compiling the whole programe gave me the following error:

741 states, 34313 transitions, table size 141698 bytes
File "frontend/gen/lexer_e.mll":
transition table overflow, automaton is too big
make: *** [frontend/gen/lexer_e.ml] Error 3

Removing | lex_Column' from let lex_structref made the compiling work.

Is there anything I write wrong, or is it because my previous lexer & parser (which works fine) was already big and adding a little stuff explodes it? How could I diagnostic that?

SoftTimur
  • 5,630
  • 38
  • 140
  • 292
  • 2
    I can't see much of your code, so I'm not really sure but it seems like you are expecting the lexer to recognize highly structured data, perhaps with the intention of reparsing it later. Is that true? If so, you have probably created a monster for the lexer state machine. It's almost always better to just use the lexer to split the input into indivisible lexical units, and let the parser figure out the actual syntactic structure. – rici Aug 13 '20 at 21:25
  • 1
    Note that the easiest way to blow up a lexer description is to use finite repetition with even moderate repetition factors. I don't know if you're doing that -- again, your snippet is way too small to offer an informed opinion -- but if you are, you might want to reconsider. In general terms, repetitions (whether in lexical analysis or grammatical analysis) should be very short or indefinite; if you need to enforce a limit like "up to eight X" or, worse, "no more than 64 characters", use an indefinite repetition and add a semantic action which checks the constraint. – rici Aug 13 '20 at 21:28
  • Thanks for the comment. "you are expecting the lexer to recognize highly structured data, perhaps with the intention of reparsing it later" ==> That's true. And, what is a "repetition"? – SoftTimur Aug 13 '20 at 21:30
  • 1
    A repetition is usually the `{n,m}` regex operator (if your tool supports it), but it can also be written manually by simply repeating the same subcomponent some number of times. Eg: `id | id "." id | id "." id "." id | id "." id "." id "." id`, which could be written as `id ( "." id ( "." id ("." id)?)?)?)?` but either way expect a state blow up. Indefinite repetition is produced by the Kleene `*` or `+` operators (or their equivalent), and does not create a state blow up. – rici Aug 13 '20 at 21:37

1 Answers1

1

Besides the comments which help optimize the lexer, there is a workaround: ocamllex -ml does not limit the number of states, transitions, table size. Use it when there is no other choices.

SoftTimur
  • 5,630
  • 38
  • 140
  • 292