Python - lxml library 'clean' method erasing only half of empty
node

Question

I'm using the lxml library in Python to clean html pages from potentially harmful code/parts I don't want. I noticed a strange behavior in the function: when given an empty <li> node, it removes the closing </li> tag but not the opening one.

For example,

from lxml.html.clean import Cleaner
text = '<ul><li></li><li>FooBar</li></ul>'
cleaner = Cleaner()
print cleaner.clean_html(text)

will output <ul><li><li>FooBar</li></ul>...

As far as I can tell this only happens when dealing with <li>tags. Is that a bug from the lxml library? Am I doing something wrong?

Any insight would be appreciated. Thanks !

unutbu · Accepted Answer · 2013-05-24T13:48:30.737

1

The closing tag for <li> in HTML is optional, so its not a bug, though it may not be the behavior you desire.

You could force a closing tag by printing it as XML:

from lxml.html.clean import Cleaner
import lxml.html as LH
text = '<ul><li></li><li>FooBar</li></ul>'
cleaner = Cleaner()
root = LH.fromstring(cleaner.clean_html(text, ))
print(LH.tostring(root, method='xml'))

yields

<ul><li/><li>FooBar</li></ul>

edited May 24 '13 at 13:48

answered May 24 '13 at 13:28

unutbu

842,883
184
1,785
1,677

Thank you very much for your help. I didn't know that the closing `li` tag was optional, but it is still weird that the function treats empty tags differently than the ones with content... Thanks for your solution ! – Robin May 24 '13 at 13:46

Python - lxml library 'clean' method erasing only half of empty node

1 Answers1

Python - lxml library 'clean' method erasing only half of empty
node