Extracting text in known bbox from pdf, PDFQuery too slow

Question

I've found the bbox coordinates in the lxml file and managed to extract the wanted data with PDFQuery. Then I write the data to a csv file.

def pdf_scrape(pdf):
    """
    Extract each relevant information individually
    input: pdf to be scraped
    returns: dataframe of scraped data
    """
    # Define coordinates of text to be extracted
    CUSTOMER             = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 563.285, 624.656, 580.888")').text() 
    CUSTOMER_REF         = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 534.939, 443.186, 552.542")').text()
    SALES_ORDER          = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 504.692, 414.352, 522.295")').text()
    ITEM_NUMBER          = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 478.246, 395.129, 495.849")').text()
    KEY                  = '0000'+ SALES_ORDER + '-' + '00' + ITEM_NUMBER
    # Combine all relevant information into a single pandas dataframe
    page = pd.DataFrame({
        'KEY'          : KEY,
        'CUSTOMER'     : CUSTOMER,
        'CUSTOMER REF.': CUSTOMER_REF,
        'SALES ORDER'  : SALES_ORDER,
        'ITEM NUMBER'  : ITEM_NUMBER
                       }, index=[0])
    return(page)

pdf_search = Path("files/").glob("*.pdf")

pdf_files = [str(file.absolute()) for file in pdf_search]

master = list()
for pdf_file in pdf_files: 
    pdf = pdfquery.PDFQuery(pdf_file)
    pdf.load(0)

# Iterate over all pages in document and add scraped data to df
    page = pdf_scrape(pdf) 
    master.append(page)

master = pd.concat(master, ignore_index=True)
master.to_csv('scraped_PDF_as_csv\scraped_PDF_DataFrame.csv', index = False)

The problem is that I need to read through hundres of PDFs each day, and this script takes ~13-14 seconds to mine four elements from the first page of only 10 PDFs.

Is there a way to speed up my code? I've looked at the this: https://github.com/py-pdf/benchmarks which implies that PDFQuery is very slow compared to other libraries.

I've tried using PyMuPDF as it's supposed to be faster, but I'm having trouble implementing it to give the same output as PDFQuery. Does anyone know how to do this?

To reiterate, I know where in the document the desired text is, but I don't necessarily know what it says.

Zach Young · Accepted Answer · 2022-06-08T05:37:42.503

I've explored PyMuPDF a little as I've answered other questions, here on SO, but I have no personal/practical experience with it. I knew nothing of PDFQuery before this post. Still, I can show my take on a very basic sample of getting a single piece of text based on location with PyMuPDF.

Also, you don't need to infer from those timings that PDFQuery is slow, the author points this out multiple times in the docs:

Performance Note: The initial call to pdf.load() runs very slowly, because the underlying pdfminer library has to compare every element on the page to every other element. See the Caching section to avoid this on subsequent runs.

PDFQuery

import pdfquery

query1 = (176.4, 629.28, 176.4, 629.28)  # "Text 1" in simple.pdf
pdf = pdfquery.PDFQuery("simple.pdf")

# query1 = (130, 407, 130, 407)  # Looking for "Gaussian" in more_complicated.pdf
# pdf = pdfquery.PDFQuery("more_complicated.pdf")

pdf.load(0)

text1 = pdf.pq('LTTextLineHorizontal:overlaps_bbox("%d, %d, %d, %d")' % query1).text()

print(text1)

PyMuPDF

I'm still not sure how to best approach this task with PyMuPDF, but here's a way that at least gives me the target texts for both simple and complicated:

from fitz import open as fitz_open, Document, Page, Rect

query1 = Rect(165.6, 165.6, 165.6, 165.6)  # "Text 1" in simple.pdf
doc: Document = fitz_open("simple.pdf")

# query1 = Rect(130, 381, 130, 381)  # Looking for "Gaussian" in more_complicated.pdf
# doc: Document = fitz_open("more_complicated.pdf")

page: Page = doc.load_page(0)

page_dict: dict = page.get_text("dict")

bbox: Rect  # a variable we'll reuse as we work down to our query
text1 = ""  # the text we're looking for with query1

block: dict
for block in page_dict["blocks"]:
    if block["type"] == 1:  # skip, it's an image
        continue

    bbox = Rect(block["bbox"])
    if not bbox.contains(query1):
        continue

    line: dict
    for line in block["lines"]:

        bbox = Rect(line["bbox"])
        if not bbox.contains(query1):
            continue

        span: dict
        for span in line["spans"]:

            bbox = Rect(span["bbox"])
            if not bbox.contains(query1):
                continue

            text1 = span["text"]

print(text1)

Analysis

(You might have noticed that the query coordinates are different between PDFQuery and PyMuPDF, and that's because PDFQuery uses the bottom-left as the origin, and PyMuPDF uses the upper-left as the origin.)

I also measured the run times with the time command on macOS 12.4; average of 3 runs. Here are my results for running both PDFQuery and PyMuPDF against simple.pdf and more_complicated.pdf:

	simple.pdf	more_complicated.pdf
file
PDFQuery timing (s)	0.123	0.258
PyMuPDF timing (s)	0.069	0.070

PyMuPDF runs both PDFs in almost the same time, and I think we're seeing PDFQuery taking longer to make those n**2/2 cross-comparisons.

I think you'll be giving up a lot of convenience to try and do this yourself. If your PDFs are consistent you could probably tune PyMuPDF and get it just right, but if there's variation as to how they were created it might take longer to get right (if even ever, because text in PDFs is deceptively tricky).

Thank you very much for the explanation, and the analysis. Do you have any tips on how to find the coordinates of the desired text? I'm able to do it for PDFQuery by converting to lxml file and searching for the text with ctrl + f, noting the bbox coords. This works as the text will be in the same location each time. But, as you mentioned, the coordinates are different with PyMuPDF. Can't find a built in way to do it. — NOVEREI, Jun 08 '22 at 06:39
The docs mentioned just using a measuring tool in software, like Acrobat’s “Measure” tool. And then doing the inches to points conversion, if needed. — Zach Young, Jun 08 '22 at 07:37

Extracting text in known bbox from pdf, PDFQuery too slow

1 Answers1

PDFQuery

PyMuPDF

Analysis

Linked