2

This is the first time i use PDFQuery to scrape PDF's.

What i need to do is to get the prices from a price list with several pages, i want to give the product code to PDFQuery, and it should find the code and return the price next to it. The problem is that using the very first example on the Github page gets the location of the text but it clearly says "Note that we don't have to know where the name is on the page, or what page it's on". Thats the case with my price list, but then all the other examples specify the page number ( LTPage[pageid=1] ), but i don't see where we get the page number.

And if I don't specify the page number it returns ALL the texts in the same location for ALL the pages.

Also, I added an exactText function because the codes could be, for example, "92005", "92005C", "92005G", so using :contains alone doesn't help much.

I've tried selecting the page where the element is located, and using JQuery .closest, both with no luck.

I checked the PDFMiner documentation and PyQuery documentation but i see nothing that helps me =(

My code looks like this right now:

import pdfquery

pdf = pdfquery.PDFQuery("tests/samples/priceList.pdf")
pdf.load()

code = "92005G"

def exactText():
    element = str(vars(this))
    text = str("u'" + code + "\\n'")
    if text in element:
        return True
    return False

#This should work if i could select the page where the element is located
#page = pdf.pq('LTPage:contains("'+code+'")')
#pageNum = page.attr('pageid')

#Here I would replace the "8" with the page number i get, or remove the LTPage 
#selector all together if i need to find the element first and then the page
label = pdf.pq('LTPage[page_index="8"] LTTextLineHorizontal:contains("'+code+'")').filter(exactText)

#Since we could use "JQuery selectors" i tried using ".closest", but it returns nothing
#page = label.closest('LTPage')
#pageNum = page.attr('pageid')

left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))

#Here I would replace the "8" with the page number i get
price = pdf.pq('LTPage[page_index="8"] LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (left_corner+110, bottom_corner, left_corner+140,     bottom_corner+20)).text()
print price

Any help is very appreciated, guys and girls!!!

aampudia
  • 1,581
  • 1
  • 11
  • 14

2 Answers2

1

There may be a more elegant way, but what I used to find the page an element is on is .interancestors('LTPage'). Example code below will find all the instances of "My Text" and tell you what page it is on:

for pq in pdf.pq('LTTextLineHorizontal:contains("My Text")'):
    page_pq = pq.iterancestors('LTPage').next()   # Use just the first ancestor
    print 'Found the text "%s" on page %s' % ( pq.layout.get_text(), page_pq.layout.pageid)

I hope that helps! :)

albrnick
  • 1,151
  • 11
  • 15
  • Hi!!! i'll try it out!!! i had "solved" my problem with a loop going through "all pages" in PDF, so i search the first page.... didnt find what i was looking for??? go to page 2 and so on.... so i always know what page i'm looking on.... but i'm not sure if there's a performance problem with that... i tried with a 200 pages pdf and it worked! thank you so much for your answer!! i will try it out and come back.... – aampudia May 26 '16 at 20:41
0

This should work in python3 (note calling next(iterator) to get the first page-ancestor):

code = "92005G"

label = pdf.pq('LTPage:contains("{}")'.format(code))
page_pq = next(label.iterancestors('LTPage'))
pageNum = int(page_pq.layout.pageid)

label = pdf.pq('LTPage[page_index="{0}"] LTTextLineHorizontal:contains("{1}")'.format(pageNum, code)).filter(exactText)

left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))

price = pdf.pq('LTPage[page_index="{0}"] LTTextLineHorizontal:in_bbox("{1}, {2}, {3}, {4}")'.format(pageNum, left_corner+110, bottom_corner, left_corner+140, bottom_corner+20)).text()
Benji
  • 549
  • 7
  • 22