1

I am trying to extract the TOC/outlines from PDFs and their page number using Python (PyPDF2), I am aware of the reader.outlines but it does not return the correct page number.

Pdf example: https://www.annualreports.com/HostedData/AnnualReportArchive/l/NASDAQ_LOGM_2018.pdf

and the output of reader.outlines is :

[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'},
{'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'}, 
...
{'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, 
[{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, 
{'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'}
...

For instance, PART I was not expected to begin at page 10, am I missing something ? Does anyone have an alternative ?

I've tried with PyMupdf, Tabula and the getDestinationPageNumber method with no luck.

Thank you in advance.

Marrluxia
  • 61
  • 1
  • 9
  • @KJ I just read the pdf using PdfFileReader (from PyPDF2) and just printed the outlines, this is why it seemed strange to me. – Marrluxia Jul 16 '21 at 17:13

3 Answers3

1

Martin Thoma's answer is exactly what I needed (PyMuPDF). Diblo Dk's answer is an interesting workaround as well (PyPDF2).

I am citing exactly Martin Thoma's code :

from typing import Dict

import fitz  # pip install pymupdf


def get_bookmarks(filepath: str) -> Dict[int, str]:
    # WARNING! One page can have multiple bookmarks!
    bookmarks = {}
    with fitz.open(filepath) as doc:
        toc = doc.getToC()  # [[lvl, title, page, …], …]
        for level, title, page in toc:
            bookmarks[page] = title
    return bookmarks


print(get_bookmarks("my.pdf"))
Marrluxia
  • 61
  • 1
  • 9
0

you should reference this PDF outlines and their Page Number

targetPDFFile = 'your_pdf_filename.pdf'
pdfFileObj=open(targetPDFFile, 'rb')
# use outline replace of bookmark, outline is more accuracy than bookmark
result = {}
def outline_dict(bookmark_list):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            outline_dict(item)
        else:
            try:
                pageNum = pdfReader.getDestinationPageNumber(item) + 1
                # print("key=" + str(pageNum) + ",title=" + item.title)
                # 相同页码的item会被替换掉
                result[pageNum] = item.title
            except:
                print("except:" + item)
                pass

outline_dict(pdfReader.getOutlines())
print(result)
K J
  • 8,045
  • 3
  • 14
  • 36
-1

Check out the package called Tabula. It is really easy to extract tables using this package. The package also has options which enable you to extract content from tables which extend over multiple pages.

Here is link worth checking out:- https://towardsdatascience.com/scraping-table-data-from-pdf-files-using-a-single-line-in-python-8607880c750