I'm using the lxml library in Python to clean html pages from potentially harmful code/parts I don't want. I noticed a strange behavior in the function: when given an empty <li>
node, it removes the closing </li>
tag but not the opening one.
For example,
from lxml.html.clean import Cleaner
text = '<ul><li></li><li>FooBar</li></ul>'
cleaner = Cleaner()
print cleaner.clean_html(text)
will output <ul><li><li>FooBar</li></ul>
...
As far as I can tell this only happens when dealing with <li>
tags. Is that a bug from the lxml library? Am I doing something wrong?
Any insight would be appreciated. Thanks !