GET table of contents from a PDF with python

Question

I'm trying to get Table of Contents from a PDF. I'm using PyMuPDF for that purpose. But it only extracts ToC if the PDF consists of Bookmarks. Otherwise it only results in an empty list.

def get_Table_Of_Contents(doc):
    toc = doc.getToC()
    return toc
toc= get_Table_Of_Contents(file)
toc

I am also facing same scenario. Did you find any approach which extracts ToC if the PDF doesn't consists of Bookmarks — user3734568, May 25 '21 at 19:11

score -1 · Answer 1 · answered Nov 06 '20 at 02:14

-1

Convert pdf to html using pdf-html converter. You can parse html toextract whatever data you want using parser like beautifulsoup.

answered Nov 06 '20 at 02:14

Anjaly Vijayan

237
2
9

score -1 · Answer 2 · answered Nov 14 '20 at 01:40

Usually TOC is represented like a regular text on a page.

Try pdfreader to extract texts and/or PDF "markdown".

Here is a sample code extracting all the above from a page:

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

# navigate to TOC
viewer.navigate(toc_page_number)

viewer.render()
pdf_markdown = viewer.canvas.text_content
plain_text = "".join(viewer.canvas.strings)

then you can parse plain_text or pdf_markdown as regular strings.

GET table of contents from a PDF with python

2 Answers2