I'm having an issue building a grammar which is capable of parsing Python 3's AST dump format and converting it into an AST format that's easier for me to play around with. I decided to write an ANTLR grammar to do so, but I'm having an issue processing keyword blocks (but only keyword blocks, for some reason). I isolated out the keyword grammar, as shown:
grammar kwds;
options {output=AST;}
keywords: 'keywords=['((', '?)keyword)*']' -> keyword*
;
keyword : 'keyword(arg='STRING', value='str')'
;
str : 'Str(s='STRING')' -> STRING
;
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
EMPTYBRACKETS
: '[]';
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
This is designed to accept a list of keywords (0 or more with comma separators), with the format shown in the keyword rule.
If you give the above grammar the following (valid) input,
keywords=[keyword(arg='name', value=Str(s='UGA')), keyword(arg='rank', value=Str(s='2'))]
the grammar will recognize this, as it should.
However, using the "complete" python 3 AST format grammar that I've written (found at http://pastebin.com/ETrSVXvf for space-saving purposes, with the two above rules found on lines 106 and 109, respectively), which uses virtually the exact same grammar rules, the token stream seems to be a few characters off after parsing the first keyword match from the sample shown above, producing the following output when parsing against the keywords rule:
sample3.txt line 1:52 mismatched character 'e' expecting 'w'
sample3.txt line 1:53 no viable alternative at character 'y'
sample3.txt line 1:54 no viable alternative at character 'w'
sample3.txt line 1:55 no viable alternative at character 'o'
sample3.txt line 1:56 no viable alternative at character 'r'
sample3.txt line 1:57 no viable alternative at character 'd'
sample3.txt line 1:58 no viable alternative at character '('
sample3.txt line 1:59 missing ENDBR at 'arg='
I can only think of one possibility for this occurring: Something is being tokenized incorrectly due to an ambiguity in the grammar, since the pattern I used to detect multiple keyword statements works for other types of statements. However, I'm completely stuck as to where that ambiguity actually is in the grammar.
Also, any general improvement tips as to how I could improve my grammar in general would be appreciated!