0

I'm trying to use cssselect on some HTML page parsed by lxml, but I found that only one parser gives the expected result:

This works just fine:

lxml.html.fromstring("...").cssselect("div.foo")

This returns no results:

lxml.html.html5parser.fromstring("...").cssselect("div.foo")

What's the difference? And can I get cssselect to work with html5parser?

viraptor
  • 33,322
  • 10
  • 107
  • 191

1 Answers1

1

Please see these two answers about the reason:

How to remove namespace value from inside lxml.html.html5paser element tag

lxml html5parser ignores "namespaceHTMLElements=False" option

In short, the reason is that the parse from html5lib adds namespace html to the element tree while other parsers don't.

I think it should be a bug, from lxml side, maybe... To fix this:

import lxml.html.html5parser
from html5lib import HTMLParser
from html5lib.treebuilders.etree_lxml import TreeBuilder

parser = HTMLParser(tree=TreeBuilder, namespaceHTMLElements=False)
print(lxml.html.html5parser.fromstring("<div class=\"foo\"></div>", parser=parser))
David Jones
  • 4,766
  • 3
  • 32
  • 45
Sraw
  • 18,892
  • 11
  • 54
  • 87