I have the following kind of text that I want to tokenize.
Text:
<!-- foo-bar --> Text1 <!-!> <!-- bar-baz --> Text2
I want to tokenize it into three kinds of tokens, COMMENT_START
, COMMENT_END
and OTHER
.
For example, for the above text, I want the following output.
COMMENT_START <!--
OTHER foo-bar
COMMENT_END -->
OTHER Text1 <!-!>
COMMENT_START <!--
OTHER bar-baz
COMMENT_END -->
OTHER Text2
Inspired by https://docs.python.org/3.4/library/re.html#writing-a-tokenizer I wrote this program.
import re
def tokenize(code):
token_specification = [
('COMMENT_START', '<!--'),
('COMMENT_END', '-->'),
('OTHER', '.*')
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
print(kind, value)
test_string = '<!-- foo-bar --> Text1 <!-!> <!-- bar-baz --> Text2'
tokenize(test_string)
But it doesn't give the desired output. This is the output I get.
COMMENT_START <!--
OTHER foo-bar --> Text1 <!-!> <!-- bar-baz --> Text2
OTHER
The problem is that the regular expression for OTHER
is consuming the entire string.
The regular expression for OTHER
is meant to match everything else apart from the special tokens such as <!--
and -->
.
How can I write this program properly that the regular expression for OTHER
does not consume <!--
or -->
and leave it alone so that they can later get matched by the regular expressions for COMMENT_START
and COMMENT_END
?
More generally, how to write a tokenizer that can yield the special tokens we are interested in as well as everything else as tokens too?