0

I'm trying to extract information from a PDF using the package PDFQuery. The information is not in the same location every time so I need to have a query tag. First, I wrote the function:

def clean_text_data(text):
      return text.split(':')[1]

I then wrote a function to extract the text:

Date = clean_text_data(pdf.pq('LTTextLineHorizontal:contains("Date")').text())

The problem, however, is that (for some reason) almost all of the data is on the next 'LTTextHorizontal'.

The XML looks like this:

<LTTextLineHorizontal bbox="[58.501, 377.094, 78.501, 385.094]" height="8.0" width="20.0" word_margin="0.1" x0="58.501" x1="78.501" y0="377.094" y1="385.094"><LTTextBoxHorizontal bbox="[58.501, 377.094, 78.501, 385.094]" height="8.0" index="39" width="20.0" x0="58.501" x1="78.501" y0="377.094" y1="385.094">Date: </LTTextBoxHorizontal></LTTextLineHorizontal>
<LTTextLineHorizontal bbox="[107.249, 377.334, 147.281, 385.334]" height="8.0" width="40.032" word_margin="0.1" x0="107.249" x1="147.281" y0="377.334" y1="385.334"><LTTextBoxHorizontal bbox="[107.249, 377.334, 147.281, 385.334]" height="8.0" index="40" width="40.032" x0="107.249" x1="147.281" y0="377.334" y1="385.334">02/26/2020 </LTTextBoxHorizontal></LTTextLineHorizontal>

Here the Date is 02/26/2020, but it is in the box immediately following. How do I create a function to extract the following box?

Alex
  • 73
  • 6

1 Answers1

0

You do something like this:

label = pdf.pq('LTTextLineHorizontal:contains("Date")')
    left_corner = float(label.attr('x0'))
    bottom_corner = float(label.attr('y0'))

In this first part, I'm finding the area of the PDF that contains "Date" and extracting the source coordinates of it's bounding box, so x0:y0 corresponds to the lower-left corner of wherever "Date" is written

    name = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (
        left_corner, bottom_corner - 12, left_corner + 350, bottom_corner)).text()

Afterward, I offset those coordinates to create a new bbox that has the information I'm actualy looking for, and I get it's .text().

The coordinates are offset in points, which you can measure with Acrobat's ruler.

Source is here: https://pypi.org/project/pdfquery/#quick-start

The quickstart guide has a really good example.

SteelMasimo
  • 81
  • 1
  • 4