I have a problem replacing ebnf rules with regex in a Tatsu grammar

Question

I have developed a syntax checker for the Gerber format, using Tatsu. It works fine, my thanks to the Tatsu developers. However, it is not overly fast, and I am now optimizing the grammar.

The Gerber format is a stream of commands, and this is handled by main loop of the grammar is as follows:

start =

{
    | ['X' integer] ['Y' integer] ((['I' integer 'J' integer] 'D01*')|'D02*'|'D03*')
    | ('G01*'|'G02*'|'G03*'|'G1*'|'G2*'|'G3*')
    ... about 25 rules
}*
M02
$;

with integer = /[+-]?[0-9]+/;

In big files, where the performance is important, the vast majority of the statements are covered by the first rule in the choice. (It is actually three commands. By putting them first, and merging then to eliminate common elements made the checker 2-3 times faster.) Now I try to replace the first rule by a regex, assuming regex is faster as it is in C.

In the first step I inlined the integer:

    | ['X' /[+-]?[0-9]+/] ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*')

This worked fine and gave a modest speedup.

Then I tried go regex the whole rule. Failure. As a test I only modified the first rule in the sequence:

    | /(X[+-]?[0-9]+)?/ ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*')

This fails to recognize the following command: X81479571Y-38450761D01*

I cannot see the difference between ['X' /[+-]?[0-9]+/] and /(X[+-]?[0-9]+)?/

What do I miss?

score 0 · Answer 1 · answered Jan 28 '22 at 11:07

0

The difference is that an optional expression with [] will advance over whitespace and comments while a pattern expression with // will not. It's in the documentation. A trick for this case is to place the pattern in it's own, initial-lower-case rule, so there's whitespace and comments tokenization before applying the pattern, though I don't think adding that indirection will aid with performance.

As to optimization, a trick I've used in the "...25 more rules" case is to group rules with similar prefixes under a &lookahead, for example &/G0/ in your case.

TatSu is designed to be friendly to grammar writers in favor of being performant. If you need blazing speeds, through generation of parsers in C, you may want to take a look at pegen, the predecesor to the new PEG parser in CPython.

answered Jan 28 '22 at 11:07

Apalala

9,017
3
30
48

Thanks for answering. One note first. I am OK with the performance of Tatsu, and like it's convenience. I wanted to develop a formal grammar for the Gerber format, and needed means to test it. I choose Tatsu and it was indeed convenient to debug the grammar, an interpreter and all that. Performance was irrelevant. I then developed a syntax checker reporting deprecated elements and common mistakes for public use. Extending the grammar was again very convenient in Tatsu. Performance is of some importance, and I did some grammar optimizations, and hit the problem above. Performance is acceptable. – Karel Tavernier Jan 29 '22 at 10:37
I do not quite understand how whitespace explains the difference. As far as I know there is no whitespace involved. I made a simple testfile and testgrammars to clarify my question. Here is a Gerber like testfile: X11479571Y-38450761D01* X21552142Y-38354000D02* X31552142Y-38354000D03* G01* G02* G03* M02* It parses fine with the following grammar: start = { | [/X[+-]?[0-9]+/] ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*') | &'G'('G01*'|'G02*'|'G03*'|'G1*'|'G2*'|'G3*') }* 'M02*' $; – Karel Tavernier Jan 29 '22 at 10:39
However with the following 'optimized' grammar: start = { | [/X[+-]?[0-9]+/] ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*') | &'G'('G01*'|'G02*'|'G03*'|'G1*'|'G2*'|'G3*') }* 'M02*' $; I get a parse problem at the second line. Unexpected content at line:char (2:1) expecting 'M02*' : X21552142Y-38354000D02* ^ start The only difference in the grammar is that the X coordinate is 'optimized'. I do not understand what is going on. Help would be much appreciated! – Karel Tavernier Jan 29 '22 at 10:42
The layout of the grammars is garbled. I sent them via email. – Karel Tavernier Jan 29 '22 at 11:12

I have a problem replacing ebnf rules with regex in a Tatsu grammar

1 Answers1