I’m using pdfminer.six to extract text from a PDF file. I’ve tried others PDF extractors, but only pdfminer handles the text they way I need.
I want to extract the text from a specific outline (bookmark) that matches a search criteria.
The PDFDocument
class has the method get_outlines
for extracting outlines. It returns a generator of tuples that contains the outline's level, title, destination, and other information. The "destination" value is a list made of a PDFObjRef
class instance and other information.
This is how data returned from get_outlines
looks like:
(...)
(1, 'account information client 20', [PDFObjRef:3918, /'FitH', 36], None, None)
(1, ‘account information client 21', [PDFObjRef:3931, /'FitH', 36], None, None)
(...)
The pdfminer documentation page says, ‘Some PDF documents use page numbers as destinations, while others use page numbers and the physical location within the page’.
The number of the PDFObjRef
in the example above is not a page number: the PDF I used for this example has only 933 pages.
As I said in the beginning, I need to extract the text only from one of the many outlines the PDF file has. With the following snippet, I can create a generator and extract every page in a sequence:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('samples/simple1.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
result = output_string.getvalue()
But I don’t how to point to a specific page destination (or a page range, comprised between two destinations) and extract the text from only this fragment.
Can someone please help? How do I convert PDFObjRef:3918 and PDFObjRef:3931 to a page number, or how do I extract data an interval like this?
I´m using Python 3.8.5, and pdfminer.six.
Thanks!