6

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.

My plan so far was to:

  • Extract a list of headers using beautifulsoup

  • Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?

  • Output a nested list of links to the headers in a predefined spot.

It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.

Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?

A example:

<p>This is an introduction</p>

<h2>This is a sub-header</h2>
<p>...</p>

<h3>This is a sub-sub-header</h3>
<p>...</p>

<h2>This is a sub-header</h2>
<p>...</p>
Oli
  • 235,628
  • 64
  • 220
  • 299
  • Note: lxml isn't an option as the document is not a whole XML document and also may not be otherwise valid under all circumstances. – Oli Mar 25 '10 at 11:17
  • can you provide example of the documents in your question? – nosklo Mar 25 '10 at 11:19
  • "Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents)" this probably should be: add ids to each header tag so that they can be referenced using #id anchor tags, right? – Łukasz Mar 25 '10 at 11:21
  • @Łukasz Possibly. I thought IE had problems with giving random things IDs like that but, no you might be right. – Oli Mar 25 '10 at 11:29

4 Answers4

3

Some quickly hacked ugly piece of code:

soup = BeautifulSoup(html)

toc = []
header_id = 1
current_list = toc
previous_tag = None

for header in soup.findAll(['h2', 'h3']):
    header['id'] = header_id

    if previous_tag == 'h2' and header.name == 'h3':
        current_list = []
    elif previous_tag == 'h3' and header.name == 'h2':
        toc.append(current_list)
        current_list = toc

    current_list.append((header_id, header.string))

    header_id += 1
    previous_tag = header.name

if current_list != toc:
    toc.append(current_list)


def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#%s">%s</a></li>' % item)
    result.append("</ul>")
    return "\n".join(result)

# Table of contents
print list_to_html(toc)

# Modified HTML
print soup
Łukasz
  • 35,061
  • 4
  • 33
  • 33
1

Use lxml.html.

nosklo
  • 217,122
  • 57
  • 293
  • 297
  • Okay I've got it parsed but how do I output? `lxml.html.tostring(doc)` wants to encase the whole thing in a `...` construc, which is no good for me as this content is going back into an existing document. – Oli Mar 25 '10 at 12:08
  • @Oli: Well, it doesn't do that. You're probably generating the entire document in `doc`. `lxml.html.tostring(E.A(href='foo.bar'))` returns only `''`. Perhaps you want to show your code, in another question? – nosklo Mar 26 '10 at 20:38
  • Sorry I thought I posted an update. I got around it by wrapping my code fragment in a `
    ` and parsing it as a fragment instead of a document. When I `tostring`ed it that time, it worked perfectly and I just used a slice to remove the div from the string.
    – Oli Mar 26 '10 at 21:43
  • @Oli: You can `tostring` just some element, you know, then it will `tostring` the element and its subelements. – nosklo Mar 27 '10 at 06:31
  • lxml is at https://lxml.de/ now. Debian/Ubuntu package is `python3-lxml`. – Alexey Vazhnov Sep 10 '22 at 19:08
1

I have come with an extended version of the solution proposed by Łukasz's.

def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#{}">{}</a></li>'.format(item[0], item[1]))
    result.append("</ul>")
    return "\n".join(result)

soup = BeautifulSoup(article, 'html5lib')

toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0

for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
    data = [(slugify(header.string), header.string)]

    if header.name == "h2":
        toc.append(data)
        h3_prev = 0
        h4_prev = 0
        h5_prev = 0
        h2_prev = len(toc) - 1
    elif header.name == "h3":
        toc[int(h2_prev)].append(data)
        h3_prev = len(toc[int(h2_prev)]) - 1
    elif header.name == "h4":
        toc[int(h2_prev)][int(h3_prev)].append(data)
        h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
    elif header.name == "h5":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
        h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
    elif header.name == "h6":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)

toc_html = list_to_html(toc)
Sergiu Vlad
  • 91
  • 1
  • 2
0

How do I generate a table of contents for HTML text in Python?

But I think you are on the right track and reinventing the wheel will be fun.

Community
  • 1
  • 1
Pratik Deoghare
  • 35,497
  • 30
  • 100
  • 146