0

I'm trying to get Table of Contents from a PDF. I'm using PyMuPDF for that purpose. But it only extracts ToC if the PDF consists of Bookmarks. Otherwise it only results in an empty list.

def get_Table_Of_Contents(doc):
    toc = doc.getToC()
    return toc
toc= get_Table_Of_Contents(file)
toc
snwflk
  • 3,341
  • 4
  • 25
  • 37
sheshank
  • 29
  • 4
  • I am also facing same scenario. Did you find any approach which extracts ToC if the PDF doesn't consists of Bookmarks – user3734568 May 25 '21 at 19:11

2 Answers2

-1

Convert pdf to html using pdf-html converter. You can parse html toextract whatever data you want using parser like beautifulsoup.

Anjaly Vijayan
  • 237
  • 2
  • 9
-1

Usually TOC is represented like a regular text on a page.

Try pdfreader to extract texts and/or PDF "markdown".

Here is a sample code extracting all the above from a page:

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

# navigate to TOC
viewer.navigate(toc_page_number)

viewer.render()
pdf_markdown = viewer.canvas.text_content
plain_text = "".join(viewer.canvas.strings)

then you can parse plain_text or pdf_markdown as regular strings.

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77