1

I have the following grammar in TatSu. To reduce parse time, I implemented cut operations (i.e., commit to a particular rule option once a particular token is seen).

However, I still see long runtimes. On a file with about 830K lines, it takes about 25min (without cut expressions it was close to 40min). I think further improvement is possible but I am not sure how to rewrite the following logic in a bettery way.

The main issue that I believe is taking the bulk of the time (by observing the TatSu grammar matching traces) is the vec_data_string/vec_data_strings. Any suggestions on how to improve this further?

pattern_stmt
    =
    (('V'|'C')&'{' | 'Vector' | 'Condition') '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'
    | ('W' | 'WaveformTable') ~ identifier ';'
    | annotation
    | 'Loop' ~ integer '{' ~ [pattern_statements] '}'
    | 'MatchLoop' ~ ('Infinite' | integer) ~ '{' ~ pattern_statements 'BreakPoint' ~ '{' ~ pattern_statements '}' ~ '}'
    | ('GoTo' | 'ScanChain') ~ identifier ';'
    | 'BreakPoint' '{' ~ pattern_statements '}'
    | ('BreakPoint' | 'IddqTestPoint' | 'Stop') ~ ';'
    | 'TimeUnit' ~ "'" ~ number [siunit] "'" ~ ';'
    ;
vec_data_block
        =
        | signal_reference_expr '=' ~ vec_data_string ';'
        signal_reference_expr '{' ~ {vec_data_strings} '}'
    vec_data_strings
        =
        {vec_data_string ';'}+
        ;
    vec_data_string
        =
        {wfc_data}+
        | {hex_data}+
        | {dec_data}+
        ;
    wfc_data
        =
        ['\\r' ~ integer] wfcs
        | hex_mode
        | dec_mode
        ;
    hex_data
        =
        ['\\r' ~ integer] hex
        | wfc_mode
        | dec_mode
        ;
    dec_data
        =
        ['\\r' ~ integer] integer
        | wfc_mode
        | hex_mode
        ;
    hex_mode
        =
        '\\h' ~ [wfcs] {hex_data}+
        ;
    wfc_mode
        =
        '\\w' ~ {wfc_data}+
        ;
    dec_mode
        =
        '\\d' ~ [wfcs] {dec_data}+
        ;
    wfcs
        =
        /[a-zA-Z0-9#%]+/
        ;
    hex
        =
        /[0-9a-fA-F]+/
        ;

    integer::int
        =
        /\d+/
        ;

My test file has lot of sequences like this:

 "ALLPIs" = 001 \r27 Z 10001ZZ0 \r22 Z 0 \r22 Z 0 \r22 Z 0 \r20 Z 1111 \r133 Z 0Z0010; 
    "ALLPOs" = \r243 X ; 
    "ALLCIOs" = \r557 Z 0 \r10 Z 0ZZ0001001 \r19 Z ;
user4979733
  • 3,181
  • 4
  • 26
  • 41
  • Take a look at the trace to know what kind of recursion is going on. Also, the generated Python parser may be faster than the in-memory model. – Apalala Nov 20 '19 at 14:37
  • You are right. I was able to debug it from look at the trace and seeing all the 'red' (failing) rules. The generated Parser is also a little bit faster than memory-model. – user4979733 Nov 20 '19 at 18:36

2 Answers2

2

I was able to get the runtime down significantly (about 50%) by doing two things:

  1. Instead of trying to parse to create an AST for the long strings. I just read them all in one go using a regex expression. I will post-process the string later.
  2. I noticed that the bulk of time is being consumed by {vec_data_block}. After every match it tries to 4 more matches and fails before proceeding to match the '}' token. To get around this problem, I added a lookahead rule (&'}') to pass if it sees the '}' token. This kind of acts like the 'cut' expression and prevents the 4 failing matches + recursion from happening. This is the bulk of the speedup. Not sure if this is the correct way to do it, but it works.

Here is the updated grammar with the two changes:

pattern_stmt
    =
    (('V'|'C')&'{' | 'Vector' | 'Condition') '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'
    | ('W' | 'WaveformTable') ~ identifier ';'
    | annotation
    | 'Loop' ~ integer '{' ~ [pattern_statements] '}'
    | 'MatchLoop' ~ ('Infinite' | integer) ~ '{' ~ pattern_statements 'BreakPoint' ~ '{' ~ pattern_statements '}' ~ '}'
    | ('GoTo' | 'ScanChain') ~ identifier ';'
    | 'BreakPoint' '{' ~ pattern_statements '}'
    | ('BreakPoint' | 'IddqTestPoint' | 'Stop') ~ ';'
    | 'TimeUnit' ~ "'" ~ number [siunit] "'" ~ ';'
    ;
vec_data_block
    =
    | &'}'  <-- Short circuit the next 4 matches if lookahead and find a '}'
    | signal_reference_expr '=' ~ vec_data_string ';'
    | signal_reference_expr '{' ~ {vec_data_strings} '}'
    | annotation
    | non_cyclized_data
    ;
non_cyclized_data
    =
    '@' integer event_pair ';'
     | '@' integer '{' ~ ';'.{event_pair} '}'
    ;
event_pair
    =
    signal_reference_expr '=' ~ event
    | annotation
    ;
vec_data_strings
    =
    {vec_data_string ';'}+
    | annotation
    ;
foo
    =
    /[0-9A-Za-z#%\\r\\h\\d\\w]+/
    ;
vec_data_string
    =
    {foo}+   <-- Just read the tokens into one big string for post-processing later.
    ;
user4979733
  • 3,181
  • 4
  • 26
  • 41
2

You can considerably reduce the number of calls by factoring some common subexpressions.

For example:

    | ('Macro' | 'Call') identifier '{' ~ {vec_data_block} '}'
    | ('Macro' | 'Call') identifier ';'

can be written as:

    | ('Macro' | 'Call') identifier (';' | '{' ~ {vec_data_block} '}')

Using TatSu's support for reserved words should also help.

Apalala
  • 9,017
  • 3
  • 30
  • 48
  • Thank you very much. I have made this mistake in quite a few places. I will fix it up and see what further improvement can be had. I have declared 'Macro', 'Call' as @@keywords in my grammar already. – user4979733 Nov 20 '19 at 18:36
  • Remember to add `@name` to the `identifier` rule so it doesn't match keywords. – Apalala Nov 21 '19 at 11:51