1

How can I get the content of <body> element by using html5lib in Python?

Example input data: <html><head></head><body>xxx<b>yyy</b></hr></body></html>

Expected output: xxx<b>yyy</b></hr>

It should work even if HTML is broken (unclosed tags,...).

sorin
  • 161,544
  • 178
  • 535
  • 806

1 Answers1

5

html5lib allows you to parse your documents using a variety of standard tree formats. You can do this using lxml, as I've done below, or you can follow the instructions in their user documentation to do it either with minidom, ElementTree or BeautifulSoup.

file = open("mydocument.html")
doc = html5lib.parse(file, treebuilder="lxml")
content = doc.findtext("html/body", default=None):

Response to comment

It is possible to acheive this without installing any external libs using their own simpletree.py, but judging by the comment at the start of the file I would guess this is not the recommended way...

# Really crappy basic implementation of a DOM-core like thing

If you still want to do this, however, you can parse the html document like so:

f = open("mydocument.html")
doc = html5lib.parse(f) 

and then find the element you're looking for by doing a breadth-first search of the child nodes in the document. The nodes are kept in an array named childNodes and each node has a name stored in the field name.

Mia Clarke
  • 8,134
  • 3
  • 49
  • 62
  • Don't you have a solution that does not require me to install other python modules? – sorin May 28 '11 at 12:53
  • http://code.google.com/p/html5lib/wiki/UserDocumentation, under "Parsing HTML," can help you there. BeautifulSoup is probably the best choice if you don't have a reason to trust that the HTML is well-formed. – Steve Howard May 28 '11 at 14:24
  • Maybe I wasn't clear, I wanted the entire data inside the ``, and it looks that this does return only the text (not entities). – sorin May 28 '11 at 16:21
  • @sorin I've added a new block in my answer in response to your first comment. With regards to this second comment, changing the call from `findtext` to simply `find` will give you the entire element. – Mia Clarke May 28 '11 at 17:35