4

I'm having some trouble with the PDFPageInterpreter in pdfminer. The below code has worked for me on every pdf file I've seen up till now, but I recently found out that when faced with a pdf page with an insane amount of text on it (like a condensed data table with size 3pt font) my code will get stuck on the following line, and neither continue on or throw an error:

interpreter.process_page(page)

Code:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Import this to raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTPage, LTChar
from pdfminer.converter import PDFPageAggregator

path = "file_path_within_current_directory"

with open(path, 'rb') as f:

    parser = PDFParser(f)
    document = PDFDocument(parser)

    # Check if document is extractable, if not abort
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    # Create PDFResourceManager object that stores shared resources
    # such as fonts or images
    rsrcmgr = PDFResourceManager()

    # set parameters for analysis
    laparams = LAParams()

    # Extract the decive to page aggregator to get LT object elements
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)

    # Create interpreter object to process page content from
    # PDFDocument. Interpreter needs to be connected to resource
    # manager for shared resources and device. 
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Ok now that we have everything to process a pdf document, lets
    # process it page by page
    for page in PDFPage.create_pages(document):
        # As the interpreter processes the page stored in PDFDocument
        # object
        interpreter.process_page(page)

        # The device renders the layout from interpreter
        layout = device.get_result()

Once I get the layout object, I'm home free, but without running the interpreter first, device.get_result() returns what is pretty much an empty layout tree. Does anyone know if there's a way to make the interpreter run on these super information dense pages? That would be my ideal solution, but if it's impossible, does anyone know how to put a timer on the function, so that if it gets stuck on that line of code it will just continue? I've tried using the following code, but it ends up making a whole ton of subprocesses that don't get joined properly.

import multiprocessing
import time

# Ok now that we have everything to process a pdf document, lets
# process it page by page
for page in PDFPage.create_pages(document):
    # As the interpreter processes the page stored in PDFDocument
    # object
    p = multiprocessing.Process(target = interpreter.process_page,
                                args= (page,))
    p.start()
    # Wait for 10 seconds or until process finishes
    p.join(10)

    # Give up on the page if it took longer than 10 seconds to
    # interpret.
    if p.is_alive():
        p.terminate()
        p.join()
        time.sleep(1)
        continue

    # The device renders the layout from interpreter
    layout = device.get_result()
Malcoto
  • 89
  • 6

1 Answers1

0

I had the same problem. After some searches, I realized that it is a very slow function and you have to give it plenty of time to run. It will eventually work! Try it with a very simple pdf and test interpreter.process_page() if you want to make sure it is working. Note: As of 2020, PDFMiner is not actively maintained. See: https://github.com/euske/pdfminer/commit/423f851fc20ebd701bc4c8b5b7ba0e7904e18e3b Instead, you can use pdfminder.six: https://github.com/pdfminer/pdfminer.six

Ali A.
  • 1
  • 3
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 08 '22 at 14:56