I'm trying to remove the characters \/<>~?`%
if they appear within three <
's and >
's.
For the string:
<html><body>Multiple <<<parameter>>> options %to <<<test/verify>>> in <<<one% g?o>>></body></html>
(reads like Multiple <<<parameter>>> options %to <<<test/verify>>> in <<<one% g?o>>>
.)
The final string I want is:
<html><body>Multiple <<<parameter>>> options %to <<<testverify>>> in <<<one go>>></body></html>
Note that the '%' in '%to' is not removed since it's not within three <
's and >
's.
I tried these regex's so far:
>>> s = '<html><body>Multiple <<<parameter>>> options %to <<<test/verify>>> in <<<one% g?o>>></body></html>'
>>>
>>> # just getting everything between <<< and >>> is easy
... re.sub(r'((?:<){3})(.*?)((?:>){3})', r'\1\2\3', s)
'<html><body>Multiple <<<parameter>>> options %to <<<test/verify>>> in <<<one%? go>>></body></html>'
>>> re.findall(r'((?:<){3})(.*?)((?:>){3})', s)
[('<<<', 'parameter', '>>>'),
('<<<', 'test/verify', '>>>'),
('<<<', 'one%? go', '>>>')]
But trying to get a sequence of non-\/<>~?`%
characters doesn't work since anything containing it just gets excluded:
>>> re.findall(r'((?:<){3})([^\\/<>~?`%]*?)((?:>){3})', s)
[('<<<', 'parameter', '>>>')]
>>> re.findall(r'((?:<){3})((?:[^\\/<>~?`%]*?)*?)((?:>){3})', s)
[('<<<', 'parameter', '>>>')]
>>> re.findall(r'((?:<){3})((?:[^\\/<>~?`%])*?)((?:>){3})', s)
[('<<<', 'parameter', '>>>')]