I want to extract some HTML elements with python3 and the HTML parser provided by lxml.
Consider this HTML:
<!DOCTYPE html>
<html>
<body>
<span class="foo">
<span class="bar">bar</span>
foo
</span>
</body>
</html>
Consider this program:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[@class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))
In a browser, the query selector "span.bar" selects only the span element. This is what I desire. However, the above program produces:
[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo
It looks like my XPath does not actually behave like a query selector and the sibling text node is picked up next to the span element. How can I adjust the XPath to select only the bar element, but not the text "foo"?