The syntax of many programming languages requires that they be tokenized according to the "maximal munch" principle. That is, that tokens be built from the maximum possible number of characters from the input stream.
PLY's lexer does not seem to apply this principle. For example:
import ply.lex as lex
tokens = ('ASSIGNMENT', 'EQUALITY')
t_ASSIGNMENT = r'[+\-*/]?='
t_EQUALITY = r'=='
lexer = lex.lex()
lexer.input('==')
for tok in lexer:
print(tok)
According to "maximal munch", the output of this code should be LexToken(EQUALITY,'==',1,0)
, but it is LexToken(ASSIGNMENT,'=',1,0) LexToken(ASSIGNMENT,'=',1,1)
. This seems to be because the lexer prefers ASSIGNMENT
over EQUALITY
- prioritizing the longer regular expression rather than the longer match.
Is it possible to get PLY's lexer to follow the "maximal munch" principle?
If not, are there any guidelines for how tokens should be specified to avoid "less-than-maximal munch" situations such as the one above?