2

I a trying to read a PDF document using Python with PyPDF2 package. The objective is to read all the bookmarks in the pdf and construct a dictionary with page numbers of the bookmark as keys and titles of bookmarks as values.

There is not much support on the internet on how to achieve it except for this article. The code posted in it doesn't work and i am not an expert in python to correct it. PyPDF2's reader object has a property named outlines which gives you a list of all bookmark objects but there are no page numbers for bookmarks and traversing the list is little difficult as there are no parent/child relationships between bookmarks.

I am sharing below my code to read a pdf document and inspect outlines property.

import PyPDF2

reader = PyPDF2.PdfFileReader('SomeDocument.pdf')

print(reader.numPages)
print(reader.outlines[1][1])
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
mdowes
  • 592
  • 7
  • 18

4 Answers4

7

The parent/child relationships are preserved by having the lists nested in each other. This sample code will display bookmarks recursively as an indented table of contents:

import PyPDF2


def show_tree(bookmark_list, indent=0):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call with increased indentation
            show_tree(item, indent + 4)
        else:
            print(" " * indent + item.title)


reader = PyPDF2.PdfFileReader("[your filename]")

show_tree(reader.getOutlines())

I don't know how to retrieve the page numbers. I tried with a few files, and the page attribute of a Destination object is always an instance of IndirectObject, which doesn't seem to contain any information about page number.

UPDATE:

There is a getDestinationPageNumber method to get page numbers from Destination objects. Modified code to create your desired dictionary:

import PyPDF2


def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
            result[reader.getDestinationPageNumber(item)] = item.title
    return result


reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

However, note that you will overwrite and lose some values if there are multiple bookmarks on the same page (dictionary keys must be unique).

mportes
  • 1,589
  • 5
  • 13
  • 2
    Notice that most methods are "now" deprecated and have been renamed but the error traceback tells how to correct, see for ex doc of [`outline`](https://pypdf2.readthedocs.io/en/3.0.0/modules/PdfReader.html#PyPDF2.PdfReader.outline) – cards Jan 30 '23 at 22:40
  • Use the bookmark's name as key (which suppose to be unique). To retrieve the page number use `item.page`. Then add the terms in the dictionary with `result[item.title] = item.page` – cards Jan 30 '23 at 22:50
  • UPDATE : `outline = reader.outline` `show_tree = show_tree(outline)` `print(show_tree)` – i2_ Jun 05 '23 at 14:58
  • an update for deprecated functions of PyPDF2: def bookmark_dict(bookmark_list): result = [] for item in bookmark_list: if isinstance(item, list): # recursive call result.extend(bookmark_dict(item)) else: result.append({'PageNumber': reader.get_destination_page_number(item), 'Title': item.title}) return result – Omer RVU Jun 28 '23 at 07:22
6

edit: PyPDF2 is not dead! I'm the new maintainer.

edit: PyPDF2 moved to pypdf I'm now also the maintainer of that project

Using pypdf

This is an updated / improved verson of mportes answer:

from typing import Dict, Union

from pypdf import PdfReader


def bookmark_dict(
    bookmark_list, use_labels: bool = False
) -> Dict[Union[str, int], str]:
    """
    Extract all bookmarks as a flat dictionary.

    Args:
        bookmark_list: The reader.outline or a recursive call
        use_labels: If true, use page labels. If False, use page indices.

    Returns:
        A dictionary mapping page labels (or page indices) to their title

    Examples:
        Download the PDF from https://zenodo.org/record/50395 to give it a try
    """
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
            page_index = reader.get_destination_page_number(item)
            page_label = reader.page_labels[page_index]
            if use_labels:
                result[page_label] = item.title
            else:
                result[page_index] = item.title
    return result


if __name__ == "__main__":
    reader = PdfReader("GeoTopo-A5.pdf")
    bms = bookmark_dict(reader.outline, use_labels=True)

    for page_nb, title in sorted(bms.items(), key=lambda n: f"{str(n[0]):>5}"):
        print(f"{page_nb:>3}: {title}")

My old answer

Here is how you do it with PyMupdf and type annotations:

from typing import Dict

import fitz  # pip install pymupdf


def get_bookmarks(filepath: str) -> Dict[int, str]:
    # WARNING! One page can have multiple bookmarks!
    bookmarks = {}
    with fitz.open(filepath) as doc:
        toc = doc.getToC()  # [[lvl, title, page, …], …]
        for level, title, page in toc:
            bookmarks[page] = title
    return bookmarks


print(get_bookmarks("my.pdf"))
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    `getToC` is now deprecated -> [`get_toc`](https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_toc) – cards Jan 30 '23 at 23:43
  • 1
    quite fun, I used PyPDF2 to edit a pdf with bookmarks and I am testing if everything is right. With PyPDF2 (answer of _mportes_) it shows them properly but with `fitz` `get_toc` returns an empty list – cards Jan 30 '23 at 23:51
  • 1
    @cards Thanks for the hint. I've just created an updated version of mportes answer. You should use `pypdf` and not `PyPDF2` – Martin Thoma Jan 31 '23 at 12:37
  • I wasn't aware of it! I found very confusing all the `PyPDFX` changing in the last years... Thanks anyway for your work! Here the link with [history](https://pypdf.readthedocs.io/en/latest/meta/history.html) of the project available in the doc – cards Jan 31 '23 at 16:38
  • And how can you extract the content of each page? – misterkandle Feb 09 '23 at 11:33
  • `page.extract_text()` – Martin Thoma Feb 09 '23 at 11:37
  • AttributeError: 'PdfReader' object has no attribute 'page_labels' – Omer RVU Jun 28 '23 at 07:27
  • Are you using `pypdf>=3.3.0`? We are at `pypdf==3.11.1` – Martin Thoma Jun 28 '23 at 13:00
  • 1
    You also just introduced me to F-string formatting. Thanks. – Ninga Jul 29 '23 at 06:21
1

@myrmica provides the correct answer. The function needs some additional error handling to handle a situation where a bookmark is defective. I've also added 1 to the page numbers because they are zero-based.

import PyPDF2

def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
          try:
             result[reader.getDestinationPageNumber(item)+1] = item.title
          except:
             pass
    return result

reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))
shawmat
  • 23
  • 6
0

an update to @mportes answer due to deprecated functions:

    def bookmark_dict(bookmark_list):
        result = []
        for item in bookmark_list:
            if isinstance(item, list):
                # recursive call
                result.extend(bookmark_dict(item))
            else:
                result.append({'PageNumber': reader.get_destination_page_number(item),
                               'Title': item.title})
        return result
    
    
    reader = PyPDF2.PdfReader(file)
    
    bookmarks = bookmark_dict(reader.outline)
    bkmarks_df = pd.DataFrame(bookmarks)
    print(bkmarks_df)
Omer RVU
  • 58
  • 6