1

Assume that I have some HTML code, like this (generated from Markdown or Textile or something):

<h1>A header</h1>
<p>Foo</p>
<h2>Another header</h2>
<p>More content</p>
<h2>Different header</h2>
<h1>Another toplevel header
<!-- and so on -->

How could I generate a table of contents for it using Python?

LeafStorm
  • 3,057
  • 4
  • 24
  • 28

2 Answers2

6

Use an HTML parser such as lxml or BeautifulSoup to find all header elements.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 3
    @van: BeautifulSoup is pure Python but not very 3.x-compatible. lxml is good, but requires a C compiler to build. – Ignacio Vazquez-Abrams Feb 05 '10 at 21:03
  • I was really looking more for some example code, but I think I have figured it out. I'll probably use lxml, since (a) it's useful for more stuff, and (b) having a module that looks like a class name disturbs my sense of aesthetics. – LeafStorm Feb 05 '10 at 21:28
  • I'm not sure about 2010, but these days beautifulsoup can use lxml as parser so you can have the best of both words (speed and convenience). – Mark Mar 02 '16 at 12:38
3

Here's an example using lxml and xpath.

from lxml import etree
doc = etree.parse("test.xml")
for node in doc.xpath('//h1|//h2|//h3|//h4|//h5'):
    print node.tag, node.text
kloffy
  • 2,928
  • 2
  • 25
  • 34