I've found the bbox coordinates in the lxml file and managed to extract the wanted data with PDFQuery. Then I write the data to a csv file.
def pdf_scrape(pdf):
"""
Extract each relevant information individually
input: pdf to be scraped
returns: dataframe of scraped data
"""
# Define coordinates of text to be extracted
CUSTOMER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 563.285, 624.656, 580.888")').text()
CUSTOMER_REF = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 534.939, 443.186, 552.542")').text()
SALES_ORDER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 504.692, 414.352, 522.295")').text()
ITEM_NUMBER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 478.246, 395.129, 495.849")').text()
KEY = '0000'+ SALES_ORDER + '-' + '00' + ITEM_NUMBER
# Combine all relevant information into a single pandas dataframe
page = pd.DataFrame({
'KEY' : KEY,
'CUSTOMER' : CUSTOMER,
'CUSTOMER REF.': CUSTOMER_REF,
'SALES ORDER' : SALES_ORDER,
'ITEM NUMBER' : ITEM_NUMBER
}, index=[0])
return(page)
pdf_search = Path("files/").glob("*.pdf")
pdf_files = [str(file.absolute()) for file in pdf_search]
master = list()
for pdf_file in pdf_files:
pdf = pdfquery.PDFQuery(pdf_file)
pdf.load(0)
# Iterate over all pages in document and add scraped data to df
page = pdf_scrape(pdf)
master.append(page)
master = pd.concat(master, ignore_index=True)
master.to_csv('scraped_PDF_as_csv\scraped_PDF_DataFrame.csv', index = False)
The problem is that I need to read through hundres of PDFs each day, and this script takes ~13-14 seconds to mine four elements from the first page of only 10 PDFs.
Is there a way to speed up my code? I've looked at the this: https://github.com/py-pdf/benchmarks which implies that PDFQuery is very slow compared to other libraries.
I've tried using PyMuPDF as it's supposed to be faster, but I'm having trouble implementing it to give the same output as PDFQuery. Does anyone know how to do this?
To reiterate, I know where in the document the desired text is, but I don't necessarily know what it says.