Generate a table of contents from HTML with Python

Question

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.

My plan so far was to:

Extract a list of headers using beautifulsoup
Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?
Output a nested list of links to the headers in a predefined spot.

It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.

Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?

A example:

<p>This is an introduction</p>

<h2>This is a sub-header</h2>
<p>...</p>

<h3>This is a sub-sub-header</h3>
<p>...</p>

<h2>This is a sub-header</h2>
<p>...</p>

Note: lxml isn't an option as the document is not a whole XML document and also may not be otherwise valid under all circumstances. — Oli, Mar 25 '10 at 11:17
"Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents)" this probably should be: add ids to each header tag so that they can be referenced using #id anchor tags, right? — Łukasz, Mar 25 '10 at 11:21
@Łukasz Possibly. I thought IE had problems with giving random things IDs like that but, no you might be right. — Oli, Mar 25 '10 at 11:29

score 3 · Answer 1 · answered Mar 25 '10 at 12:43

Some quickly hacked ugly piece of code:

soup = BeautifulSoup(html)

toc = []
header_id = 1
current_list = toc
previous_tag = None

for header in soup.findAll(['h2', 'h3']):
    header['id'] = header_id

    if previous_tag == 'h2' and header.name == 'h3':
        current_list = []
    elif previous_tag == 'h3' and header.name == 'h2':
        toc.append(current_list)
        current_list = toc

    current_list.append((header_id, header.string))

    header_id += 1
    previous_tag = header.name

if current_list != toc:
    toc.append(current_list)


def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#%s">%s</a></li>' % item)
    result.append("</ul>")
    return "\n".join(result)

# Table of contents
print list_to_html(toc)

# Modified HTML
print soup

score 1 · Accepted Answer · answered Mar 25 '10 at 11:24

1

Use lxml.html.

It can deal with invalid html just fine.
It is very fast.
It allows you to easily create the missing elements and move elements around between the trees.

answered Mar 25 '10 at 11:24

nosklo

217,122
57
293
297

Okay I've got it parsed but how do I output? `lxml.html.tostring(doc)` wants to encase the whole thing in a `...` construc, which is no good for me as this content is going back into an existing document. – Oli Mar 25 '10 at 12:08
@Oli: Well, it doesn't do that. You're probably generating the entire document in `doc`. `lxml.html.tostring(E.A(href='foo.bar'))` returns only `''`. Perhaps you want to show your code, in another question? – nosklo Mar 26 '10 at 20:38
Sorry I thought I posted an update. I got around it by wrapping my code fragment in a `
` and parsing it as a fragment instead of a document. When I `tostring`ed it that time, it worked perfectly and I just used a slice to remove the div from the string.
– Oli Mar 26 '10 at 21:43
@Oli: You can `tostring` just some element, you know, then it will `tostring` the element and its subelements. – nosklo Mar 27 '10 at 06:31
lxml is at https://lxml.de/ now. Debian/Ubuntu package is `python3-lxml`. – Alexey Vazhnov Sep 10 '22 at 19:08

score 1 · Answer 3 · answered Jul 30 '18 at 15:14

I have come with an extended version of the solution proposed by Łukasz's.

def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#{}">{}</a></li>'.format(item[0], item[1]))
    result.append("</ul>")
    return "\n".join(result)

soup = BeautifulSoup(article, 'html5lib')

toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0

for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
    data = [(slugify(header.string), header.string)]

    if header.name == "h2":
        toc.append(data)
        h3_prev = 0
        h4_prev = 0
        h5_prev = 0
        h2_prev = len(toc) - 1
    elif header.name == "h3":
        toc[int(h2_prev)].append(data)
        h3_prev = len(toc[int(h2_prev)]) - 1
    elif header.name == "h4":
        toc[int(h2_prev)][int(h3_prev)].append(data)
        h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
    elif header.name == "h5":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
        h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
    elif header.name == "h6":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)

toc_html = list_to_html(toc)

score 0 · Answer 4 · edited May 23 '17 at 11:53

0

How do I generate a table of contents for HTML text in Python?

But I think you are on the right track and reinventing the wheel will be fun.

edited May 23 '17 at 11:53

Community

1
1

answered Mar 25 '10 at 12:08

Pratik Deoghare

35,497
30
100
146

Generate a table of contents from HTML with Python

4 Answers4