How can I deal with multi-character operators when lexing with shlex in Python?

Question

I'm writing a language parser/interpreter, and I figured I could use the shlex module for generating tokens, but ran into an issue when working with multi-character operators, such as += or **. The shlex module will lex those as two separate operators, which is not ideal.

>>> t = shlex.shlex('x += 3')
>>> t.get_token()
'x'
>>> t.get_token()
'+'
>>> t.get_token()
'='

I thought I'd try adding operator characters to shlex.wordchars, but that creates problems with code without white space:

>>> t = shlex.shlex('x+=3')
>>> t.wordchars += '+=*-/'
>>> t.get_token()
'x+=3'

So then I had the idea that I could just manually rebuild operators from tokens when I have multiple tokens in a row that could be a valid operator. For example, if I have a + token followed by a =, then I would concatenate them to make a '+='. However, this solution creates a problem with expressions like x - -3. It would get tokenized into x, --, and 3, which also isn't what I want.

Is there any way to do what I want simply with the shlex module? Or am I probably going to have to write a tokenizer myself?

How can I deal with multi-character operators when lexing with shlex in Python?

0 Answers0