0

I am using plyplus to design a simple grammar and I have been struggling with some weird error for a while. Please bear in mind I am a newbie. Here is a piece of code that reproduces the issue:

from plyplus import Grammar

list_parser = Grammar("""
    start: context* ;
    context : WORD '{' (rule)* '}' ;
    rule: 'require' space_marker ;
    space_marker: 'newline'
        | 'tab'
        | 'space'
        ;

    WORD: '\w+' ;
    SPACES: '[ \t\n]+' (%ignore) ;
    """, auto_filter_tokens=False)

res = list_parser.parse("test { require tab }")

If my input string contains require space or require newline, it works perfectly fine. However, as soon as I provide require tab, an exception is thrown:

Traceback (most recent call last):
  File "/Users/bore/Projects/ThesisCode/CssCoco/coco/plytest.py", line 18, in <module>
    res = list_parser.parse("test { require tab }")
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/plyplus/plyplus.py", line 584, in parse
    return self._grammar.parse(text)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/plyplus/plyplus.py", line 668, in parse
    raise ParseError('\n'.join(self.errors))
plyplus.plyplus.ParseError: Syntax error in input at 'tab' (type WORD) line 1 col 16

Ironically, I do not get this exception every time I run the code, but exactly once in three times. I noticed that if I change the grammar and the input from tab to ta, I get the same exception every time I run the code. Also, if I change it to tabb, the error is gone.

The error suggests that tab is parsed as a WORD instead of a space_marker. However, tabb is also a WORD. From my trial and error it seems that plyplus is sensitive to the specific string I provide as a keyword. Am I missing something? Any help/hints/comments will be highly appreciated!

1 Answers1

0

PlyPlus is an implementation of PLY, where L and Y stand for Lex and Yacc, so it is — for better of worse, probably worse — an LR parser, which works strictly bottom-up. This also means 'tab' cannot be parsed as TAB (or _ANON_X, or whatever names it generates for the token) because of your very generous definition of WORD. The only way around it is to make the definition more restrictive. For instance:

WORD: '\w+' (%unless
    TAB: 'tab';
    REQ: 'require';
  );

My guess is that it works for 'newline' and 'space' because there is an implicitly defined preterminal somewhere which gets a higher priority assigned than the WORD, but the documentation of PlyPlus is not exactly top class either, so one would have to look at the actual implementation of PlyPlus’s tokeniser.

grammarware
  • 120
  • 6
  • 1
    Actually, the reason some words work and other don't is that PLY's tokenizer attempts matching the regexps by order of their length. Since '\w+' is three letters, only tokens above three letters have a chance to be matched. Having said that, your code for the definition of WORD is indeed the correct solution. Source: I'm the author of PlyPlus – Erez Dec 30 '15 at 19:55