2

I'm using the lxml library in Python to clean html pages from potentially harmful code/parts I don't want. I noticed a strange behavior in the function: when given an empty <li> node, it removes the closing </li> tag but not the opening one.

For example,

from lxml.html.clean import Cleaner
text = '<ul><li></li><li>FooBar</li></ul>'
cleaner = Cleaner()
print cleaner.clean_html(text)

will output <ul><li><li>FooBar</li></ul>...

As far as I can tell this only happens when dealing with <li>tags. Is that a bug from the lxml library? Am I doing something wrong?

Any insight would be appreciated. Thanks !

Robin
  • 9,415
  • 3
  • 34
  • 45

1 Answers1

1

The closing tag for <li> in HTML is optional, so its not a bug, though it may not be the behavior you desire.

You could force a closing tag by printing it as XML:

from lxml.html.clean import Cleaner
import lxml.html as LH
text = '<ul><li></li><li>FooBar</li></ul>'
cleaner = Cleaner()
root = LH.fromstring(cleaner.clean_html(text, ))
print(LH.tostring(root, method='xml'))

yields

<ul><li/><li>FooBar</li></ul>
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you very much for your help. I didn't know that the closing `li` tag was optional, but it is still weird that the function treats empty tags differently than the ones with content... Thanks for your solution ! – Robin May 24 '13 at 13:46