2

I’m using pdfminer.six to extract text from a PDF file. I’ve tried others PDF extractors, but only pdfminer handles the text they way I need.

I want to extract the text from a specific outline (bookmark) that matches a search criteria.

The PDFDocument class has the method get_outlines for extracting outlines. It returns a generator of tuples that contains the outline's level, title, destination, and other information. The "destination" value is a list made of a PDFObjRef class instance and other information.

This is how data returned from get_outlines looks like:

(...)

(1, 'account information client 20', [PDFObjRef:3918, /'FitH', 36], None, None)

(1, ‘account information client 21', [PDFObjRef:3931, /'FitH', 36], None, None)

(...)

The pdfminer documentation page says, ‘Some PDF documents use page numbers as destinations, while others use page numbers and the physical location within the page’. The number of the PDFObjRef in the example above is not a page number: the PDF I used for this example has only 933 pages.

As I said in the beginning, I need to extract the text only from one of the many outlines the PDF file has. With the following snippet, I can create a generator and extract every page in a sequence:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('samples/simple1.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

result = output_string.getvalue()

But I don’t how to point to a specific page destination (or a page range, comprised between two destinations) and extract the text from only this fragment.

Can someone please help? How do I convert PDFObjRef:3918 and PDFObjRef:3931 to a page number, or how do I extract data an interval like this?

I´m using Python 3.8.5, and pdfminer.six.

Thanks!

  • Did you find a solution for this? – Ali Kareem Raja Mar 23 '21 at 11:41
  • Unfortunately I didn't. But I developed a glitchy workaround: I use pyPDF2 to find the outline and its pages, extract them to a new PDF object, save it as a temporary PDF file (using Python built-in "tempfile" library), and then pass it to pdfminer extract its text. It is not beautiful nor optimal, but it works fine for me. – liquidpasta Mar 24 '21 at 15:40

0 Answers0