I'm having some trouble with the PDFPageInterpreter in pdfminer. The below code has worked for me on every pdf file I've seen up till now, but I recently found out that when faced with a pdf page with an insane amount of text on it (like a condensed data table with size 3pt font) my code will get stuck on the following line, and neither continue on or throw an error:
interpreter.process_page(page)
Code:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Import this to raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTPage, LTChar
from pdfminer.converter import PDFPageAggregator
path = "file_path_within_current_directory"
with open(path, 'rb') as f:
parser = PDFParser(f)
document = PDFDocument(parser)
# Check if document is extractable, if not abort
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create PDFResourceManager object that stores shared resources
# such as fonts or images
rsrcmgr = PDFResourceManager()
# set parameters for analysis
laparams = LAParams()
# Extract the decive to page aggregator to get LT object elements
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create interpreter object to process page content from
# PDFDocument. Interpreter needs to be connected to resource
# manager for shared resources and device.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Ok now that we have everything to process a pdf document, lets
# process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument
# object
interpreter.process_page(page)
# The device renders the layout from interpreter
layout = device.get_result()
Once I get the layout object, I'm home free, but without running the interpreter first, device.get_result() returns what is pretty much an empty layout tree. Does anyone know if there's a way to make the interpreter run on these super information dense pages? That would be my ideal solution, but if it's impossible, does anyone know how to put a timer on the function, so that if it gets stuck on that line of code it will just continue? I've tried using the following code, but it ends up making a whole ton of subprocesses that don't get joined properly.
import multiprocessing
import time
# Ok now that we have everything to process a pdf document, lets
# process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument
# object
p = multiprocessing.Process(target = interpreter.process_page,
args= (page,))
p.start()
# Wait for 10 seconds or until process finishes
p.join(10)
# Give up on the page if it took longer than 10 seconds to
# interpret.
if p.is_alive():
p.terminate()
p.join()
time.sleep(1)
continue
# The device renders the layout from interpreter
layout = device.get_result()