How can I extract the TOC with PyPDF2?

Question

Take this pdf as an example. I can extrac the table of contents (TOC) with dumppdf.py -T 1707.09725.pdf:

<outlines>
    <outline level="1" title="1 Introduction">
        <dest>
            <list size="5">
                <ref id="513"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>14</pageno>
    </outline>
    <outline level="1" title="2 Convolutional Neural Networks">
        <dest>
            <list size="5">
                <ref id="554"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>16</pageno>
    </outline>
...

Can I do something similar with PyPDF2?

score 3 · Accepted Answer · answered Jan 08 '18 at 20:15

Found it:

from PyPDF2 import PdfFileReader

reader = PdfFileReader(open("1707.09725.pdf", 'rb'))

print(reader.outlines)

gives:

[{'/Title': '1 Introduction', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(513, 0)},
 {'/Title': '2 Convolutional Neural Networks', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(554, 0)}, [{'/Title': '2.1 Linear Image Filters', '/Left': 99.213, '/Type': '/XYZ', '/Top': 486.791, '/Zoom': ..., '/Page': IndirectObject(554, 0)},
 {'/Title': '2.2 CNN Layer Types', '/Left': 70.866, '/Type': '/XYZ', '/Top': 316.852, '/Zoom': ..., '/Page': IndirectObject(580, 0)},
[{'/Title': '2.2.1 Convolutional Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 562.722, '/Zoom': ..., '/Page': IndirectObject(608, 0)},
 {'/Title': '2.2.2 Pooling Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 299.817, '/Zoom': ..., '/Page': IndirectObject(654, 0)},
 {'/Title': '2.2.3 Dropout', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(689, 0)},
 {'/Title': '2.2.4 Normalization Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 193.779, '/Zoom': <PyPDF2.generic.NullObject object at 0x7fbe49d14350>, '/Page': IndirectObject(689, 0)}]

To elaborate further, you can use the following to get just the titles and page numbers. Per @shawmat: def bookmark_dict(bookmark_list): result = {} for item in bookmark_list: if isinstance(item, list): # recursive call result.update(bookmark_dict(item)) else: try: result[reader.getDestinationPageNumber(item)+1] = item.title except: pass return result reader = PyPDF2.PdfFileReader("[your filename]") print(bookmark_dict(reader.getOutlines())) — Cazforshort, Apr 30 '21 at 11:30

Gabriel Sandoval · Answer 2 · 2020-08-18T00:06:55.957

2

Alternatively, as suggested by this answer you can use pikepdf

from pikepdf import Pdf

path = "path/to/file.pdf"

with Pdf.open(path) as pdf:
    outline = pdf.open_outline()
    for title in outline.root:
        print(title)
        for subtitle in title.children:
            print('\t', subtitle)

edited Aug 18 '20 at 00:06

answered Aug 17 '20 at 23:49

Gabriel Sandoval

152
7

How can I extract the TOC with PyPDF2?

2 Answers2