1

I have the following kind of text that I want to tokenize.

Text:

<!-- foo-bar --> Text1 <!-!> <!-- bar-baz --> Text2

I want to tokenize it into three kinds of tokens, COMMENT_START, COMMENT_END and OTHER.

For example, for the above text, I want the following output.

COMMENT_START <!--
OTHER  foo-bar 
COMMENT_END -->
OTHER  Text1 <!-!>
COMMENT_START <!--
OTHER  bar-baz
COMMENT_END -->
OTHER Text2

Inspired by https://docs.python.org/3.4/library/re.html#writing-a-tokenizer I wrote this program.

import re

def tokenize(code):
    token_specification = [
        ('COMMENT_START', '<!--'),
        ('COMMENT_END', '-->'),
        ('OTHER', '.*')
    ]

    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)

    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)
        print(kind, value)

test_string = '<!-- foo-bar --> Text1 <!-!> <!-- bar-baz --> Text2'
tokenize(test_string)

But it doesn't give the desired output. This is the output I get.

COMMENT_START <!--
OTHER  foo-bar --> Text1 <!-!> <!-- bar-baz --> Text2
OTHER

The problem is that the regular expression for OTHER is consuming the entire string.

The regular expression for OTHER is meant to match everything else apart from the special tokens such as <!-- and -->.

How can I write this program properly that the regular expression for OTHER does not consume <!-- or --> and leave it alone so that they can later get matched by the regular expressions for COMMENT_START and COMMENT_END?

More generally, how to write a tokenizer that can yield the special tokens we are interested in as well as everything else as tokens too?

Lone Learner
  • 18,088
  • 20
  • 102
  • 200
  • I had the very exact problem. In my case I have a human-entered text in a CRM and I want to a) substitute IDs in the form of a sha1 result into links to that entity, b) subsituting everything that ressembles a link to a clickable link and c) convert `<` and `>` to the printable strings `<` and `>`. I asked here https://stackoverflow.com/q/68988193/1315009 and the answer @rici gave suggested a Left-To-Right parser. Then I had this very exact problem you describe: Tokens "ID", "LINK" and "HTML-CHAR" and then "ANYTHING-ELSE": https://stackoverflow.com/q/69059734/1315009 – Xavi Montero Sep 05 '21 at 09:26

2 Answers2

4

The problem is that your other expression will match anything, even a comment. To get around this, you have two options. One is to make other match just one character, and then later collapse strings of "others" into a single "other". Like this:

token_specification = [
    ('COMMENT_START', '<!--'),
    ('COMMENT_END', '-->'),
    ('OTHER', '.')
]

The output then is:

COMMENT_START <!--
OTHER  
OTHER f
OTHER o
OTHER o
OTHER  
COMMENT_END -->
OTHER  
OTHER T
OTHER e
OTHER x
OTHER t
OTHER 1
(etc. . . .)

By matching just one character in "other", you give it a chance to look for a comment at every position. You'd then have to iterate over the token list and combine consecutive "other" tokens.

The other way is to make your other non-greedy and include a lookahead for the other token types:

token_specification = [
    ('COMMENT_START', '<!--'),
    ('COMMENT_END', '-->'),
    ('OTHER', r'.*?(?=-->|<!--)')
]

This gives your desired output:

COMMENT_START <!--
OTHER  foo 
COMMENT_END -->
OTHER  Text1 
COMMENT_START <!--
OTHER  bar 
COMMENT_END -->

However, this solution is less extensible because you have to repeat the other tokens inside other. If you had more kinds of tokens, this would become unwieldy.

I'd recommend you take a look at parsing libraries like parcon or pyparsing, which are better suited for doing this sort of parsing than are plain regexes.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
0
(?:(?!<!--|-->).)*

Use these as other.See demo.

http://regex101.com/r/sD1lU8/5

vks
  • 67,027
  • 10
  • 91
  • 124