7

I'm implementing an interpreter for a long-outdated text editor's scripting language, and I'm having some trouble getting a lexer to work properly.

Here's an example of the problematic part of the language:

T
L /LOCATE ME/
C /LOCATE ME/CHANGED ME/ * *
C ;CHANGED ME;CHANGED ME AGAIN; 1 *

The / characters seem to quote strings and also act as a delimiter for the C (CHANGE) command in a sed-type syntax, although it allows any character as a delimiter.

I've probably implemented about half the most common commands, just using parse_tokens(line.split()) until now. That was quick and dirty, but it worked surprisingly well.

To avoid writing my own lexer, I tried shlex.

It works pretty well, except for the CHANGE cases:

import shlex

def shlex_test(cmd_str):
    lex = shlex.shlex(cmd_str)
    lex.quotes = '/'
    return list(lex)

print(shlex_test('L /spaced string/'))
# OK! gives: ['L', '/spaced string/']

print(shlex_test('C /spaced string/another string/ * *'))
# gives   : ['C', '/spaced string/', 'another', 'string/', '*', '*']
# desired : any format that doesn't split on a space between /'s

print(shlex_test('C ;a b;b a;'))
# gives   : ['C', ';', 'b', 'a', ';', 'a', 'b', ';']
# desired : same format as CHANGE command above

Anyone know an easy way to accomplish this (with shlex or otherwise)?

EDIT:

If it helps, here's the CHANGE command syntax given in the help file:

'''
C [/stg1/stg2/ [n|n m]]

    The CHANGE command replaces the m-th occurrence of "stg1" with "stg2"
for the next n lines.  The default value for m and n is 1.'''

The similarly difficult to tokenize X and Y commands:

'''
X [/command/[command/[...]]n]
Y [/command/[command/[...]]n]

    The X and Y commands allow the execution of several commands contained
in one command.  To define an X or Y "command string", enter X (or Y)
followed by a space, then individual commands, each separated by a
delimiter (e.g. a period ".").  An unlimited number of commands may be
placed in the X or Y command string.  Once the command string has been
defined, entering X (or Y) followed optionally by a count n will execute
the defined command string n times.  If n is not specified, it will
default to 1.'''
Robbie Rosati
  • 1,205
  • 1
  • 9
  • 23

1 Answers1

0

The problem is possibly that the / is not standing for quotes but only for delimiting. I am guessing that the 3rd character is always used to define the delimiter. Further you don't need the / or ; in the output, do you?

I just done the following only with split for the L and C command case:

>>> def parse(cmd):
...     delim = cmd[2]
...     return cmd.split(delim)
...
>>> c_cmd = "C /LOCATE ME/CHANGED ME/ * *"
>>> parse(c_cmd)
['C ', 'LOCATE ME', 'CHANGED ME', ' * *']

>>> c_cmd2 = "C ;a b;b a;"
>>> parse(c_cmd2)
['C ', 'a b', 'b a', '']

>>> l_cmd = "L /spaced string/"
>>> parse(l_cmd)
['L ', 'spaced string', '']

For the optional " * *" part you could use split(" ") on last list element.

>>> parse(c_cmd)[-1].split(" ")
['', '*', '*']
Cwt
  • 8,206
  • 3
  • 32
  • 27