12

I'm trying to use reserved words in my grammar:

reserved = {
   'if' : 'IF',
   'then' : 'THEN',
   'else' : 'ELSE',
   'while' : 'WHILE',
}

tokens = [
 'DEPT_CODE',
 'COURSE_NUMBER',
 'OR_CONJ',
 'ID',
] + list(reserved.values())

t_DEPT_CODE = r'[A-Z]{2,}'
t_COURSE_NUMBER  = r'[0-9]{4}'
t_OR_CONJ = r'or'

t_ignore = ' \t'

def t_ID(t):
 r'[a-zA-Z_][a-zA-Z_0-9]*'
 if t.value in reserved.values():
  t.type = reserved[t.value]
  return t
 return None

However, the t_ID rule somehow swallows up DEPT_CODE and OR_CONJ. How can I get around this? I'd like those two to take higher precedence than the reserved words.

Nick Heiner
  • 119,074
  • 188
  • 476
  • 699

2 Answers2

16

Mystery Solved!

Ok, i ran into this issue on my own today and looked for solution - did not find it on S/O - but found it in the manual: http://www.dabeaz.com/ply/ply.html#ply_nn6

When building the master regular expression, rules are added in the following order:

  • All tokens defined by functions are added in the same order as they appear in the lexer file.
  • Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).

That is why t_ID "beats" the string definitions. A trivial (although brutal) fix will be to simply def t_DEPT_CODE(token): r'[A-Z]{2,}'; return token before def t_ID

Nas Banov
  • 28,347
  • 6
  • 48
  • 67
0

Two things spring to mind:

  • obviously, the 'or' is a reserved word, like 'if', 'then' etc.
  • your RE for t_ID matches a superset of the strings that are matched by DEPT_CODE.

Therefore I would solve it as follows: Include 'or' as reserved word and in t_ID, check if the length of the string is 2 and if it consists of uppercase letters only. If this is the case, return DEPT_CODE.

Ingo
  • 36,037
  • 5
  • 53
  • 100
  • No, this is not how PLY works! Matching is done in order of definitions, so in theory it should have picked up t_DEPT_CODE to t_ID. **Definitely** not the way to over-ride the lexer with manual check in t_ID, trust me – Nas Banov May 02 '12 at 22:52