I'm writing a language parser/interpreter, and I figured I could use the shlex module for generating tokens, but ran into an issue when working with multi-character operators, such as +=
or **
. The shlex module will lex those as two separate operators, which is not ideal.
>>> t = shlex.shlex('x += 3')
>>> t.get_token()
'x'
>>> t.get_token()
'+'
>>> t.get_token()
'='
I thought I'd try adding operator characters to shlex.wordchars, but that creates problems with code without white space:
>>> t = shlex.shlex('x+=3')
>>> t.wordchars += '+=*-/'
>>> t.get_token()
'x+=3'
So then I had the idea that I could just manually rebuild operators from tokens when I have multiple tokens in a row that could be a valid operator. For example, if I have a +
token followed by a =
, then I would concatenate them to make a '+='. However, this solution creates a problem with expressions like x - -3
. It would get tokenized into x
, --
, and 3
, which also isn't what I want.
Is there any way to do what I want simply with the shlex module? Or am I probably going to have to write a tokenizer myself?