0

I'm trying to remove the characters \/<>~?`% if they appear within three &lt;'s and &gt;'s.

For the string:

<html><body>Multiple &lt;&lt;&lt;parameter&gt;&gt;&gt; options %to &lt;&lt;&lt;test/verify&gt;&gt;&gt; in &lt;&lt;&lt;one% g?o&gt;&gt;&gt;</body></html>

(reads like Multiple <<<parameter>>> options %to <<<test/verify>>> in <<<one% g?o>>>.)

The final string I want is:

<html><body>Multiple &lt;&lt;&lt;parameter&gt;&gt;&gt; options %to &lt;&lt;&lt;testverify&gt;&gt;&gt; in &lt;&lt;&lt;one go&gt;&gt;&gt;</body></html>

Note that the '%' in '%to' is not removed since it's not within three &lt;'s and &gt;'s.

I tried these regex's so far:

>>> s = '<html><body>Multiple &lt;&lt;&lt;parameter&gt;&gt;&gt; options %to &lt;&lt;&lt;test/verify&gt;&gt;&gt; in &lt;&lt;&lt;one% g?o&gt;&gt;&gt;</body></html>'
>>>
>>> # just getting everything between <<< and >>> is easy
... re.sub(r'((?:&lt;){3})(.*?)((?:&gt;){3})', r'\1\2\3', s)
'<html><body>Multiple &lt;&lt;&lt;parameter&gt;&gt;&gt; options %to &lt;&lt;&lt;test/verify&gt;&gt;&gt; in &lt;&lt;&lt;one%? go&gt;&gt;&gt;</body></html>'
>>> re.findall(r'((?:&lt;){3})(.*?)((?:&gt;){3})', s)
[('&lt;&lt;&lt;', 'parameter', '&gt;&gt;&gt;'),
 ('&lt;&lt;&lt;', 'test/verify', '&gt;&gt;&gt;'),
 ('&lt;&lt;&lt;', 'one%? go', '&gt;&gt;&gt;')]

But trying to get a sequence of non-\/<>~?`% characters doesn't work since anything containing it just gets excluded:

>>> re.findall(r'((?:&lt;){3})([^\\/<>~?`%]*?)((?:&gt;){3})', s)
[('&lt;&lt;&lt;', 'parameter', '&gt;&gt;&gt;')]
>>> re.findall(r'((?:&lt;){3})((?:[^\\/<>~?`%]*?)*?)((?:&gt;){3})', s)
[('&lt;&lt;&lt;', 'parameter', '&gt;&gt;&gt;')]
>>> re.findall(r'((?:&lt;){3})((?:[^\\/<>~?`%])*?)((?:&gt;){3})', s)
[('&lt;&lt;&lt;', 'parameter', '&gt;&gt;&gt;')]
aneroid
  • 12,983
  • 3
  • 36
  • 66

1 Answers1

2

The solution I went with was using the original <<<.*>>> regex and the repl as a function option for re.sub:

>>> def illrepl(matchobj):
...     return ''.join([matchobj.group(1),
...                     matchobj.group(2).translate(None, r'\/<>~?`%'),
...                     matchobj.group(3)])
...
>>> re.sub(r'((?:&lt;){3})(.*?)((?:&gt;){3})', illrepl, s)
'<html><body>Multiple &lt;&lt;&lt;parameter&gt;&gt;&gt; options %to &lt;&lt;&lt;testverify&gt;&gt;&gt; in &lt;&lt;&lt;one go&gt;&gt;&gt;</body></html>'
>>> # verify that this is the final string I wanted:
... re.sub(r'((?:&lt;){3})(.*?)((?:&gt;){3})', illrepl, s) == '<html><body>Multiple &lt;&lt;&lt;parameter&gt;&gt;&gt; options %to &lt;&lt;&lt;testverify&gt;&gt;&gt; in &lt;&lt;&lt;one go&gt;&gt;&gt;</body></html>'
True

And since I don't need to change the &lt;'s and &gt;'s and I know the match is only for things within them, I could either use a non-capturing group for those parts of the regex or simplify the illrepl function a bit by just using the full match object at group(0) to remove illegal/invalid characters:

>>> def illrepl(matchobj):
...     # return matchobj.group(0).translate(None, r'\/<>~?`%')  # may have unicode so can't use this
...     return re.sub(r'[\/<>~?`%]*', '', matchobj.group(0))
...
>>> re.sub(r'(?:&lt;){3}(.*?)(?:&gt;){3}', illrepl, s)
'<html><body>Multiple &lt;&lt;&lt;parameter&gt;&gt;&gt; options %to &lt;&lt;&lt;testverify&gt;&gt;&gt; in &lt;&lt;&lt;one go&gt;&gt;&gt;</body></html>'

Not certain if there is a way I could have done this only via the regex and not needing to use the illrepl function to generate the replacements and having to use re.sub again within that.

aneroid
  • 12,983
  • 3
  • 36
  • 66