What i try to do:
Remove suspicious comments from html mails with bs4. Now i encountered a problem with so called conditional comments
of type downlevel-revealed
.
import bs4
html = 'A<!--[if expression]>a<![endif]-->' \
'B<![if expression]>b<![endif]>'
soup = bs4.BeautifulSoup(html, 'html5lib')
for comment in soup.find_all(text=lambda text: isinstance(text, bs4.Comment)):
comment.extract()
Befor extract comments:
'A',
'[if expression]>a<![endif]',
'B',
'[if expression]',
'b',
'[endif]',
After extract comments:
'A',
'B',
'b',
Problem:
The small b should also be removed. Problem is, bs4 detects first comment as one single comment object, but second is detected as 3 objects. Comment(if), NavigableString(b) and Comment(endif). Extraction just removes the both comment types. NavigableString with content 'b' remains in DOM.
Any solution to this?