I am trying to extract the TOC/outlines from PDFs and their page number using Python (PyPDF2), I am aware of the reader.outlines
but it does not return the correct page number.
Pdf example: https://www.annualreports.com/HostedData/AnnualReportArchive/l/NASDAQ_LOGM_2018.pdf
and the output of reader.outlines
is :
[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'},
{'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'},
...
{'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
[{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
{'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'}
...
For instance, PART I was not expected to begin at page 10, am I missing something ? Does anyone have an alternative ?
I've tried with PyMupdf, Tabula and the getDestinationPageNumber method with no luck.
Thank you in advance.