0

I followed the example from this answer to get the editable field values from a PDF document:

How to extract PDF fields from a filled out form in Python?

For each field I get a data structure that looks like this below. But the list includes all the fields from all the pages. How can I determine what page each field was on? In the debugger I tried looking into the 'AP' and the 'P' items which are PDFObjRef's but that didn't lead me anywhere.

'AP' = {dict: 1} {'N': <PDFObjRef:1947>}
'DA' = {bytes: 23} b'0 0 0 rg /ArialMT 10 Tf'
'F' = {int} 4
'FT' = {PSLiteral} /'Tx'
'M' = {bytes: 23} b"D:20200129121854-06'00'"
'MK' = {dict: 0} {}
'P' = {PDFObjRef} <PDFObjRef:1887>
'Rect' = {list: 4} [36.3844, 28.5617, 254.605, 55.1097]
'StructParent' = {int} 213
'Subtype' = {PSLiteral} /'Widget'
'T' = {bytes: 12} b'CustomerName'
'TU' = {bytes: 13} b'Customer Name'
'Type' = {PSLiteral} /'Annot'
'V' = {bytes: 21} b'Ball-Mart Stores, Inc.'

TIA

naren8642
  • 51
  • 3

2 Answers2

0

Same Problem, took me 2 hours til I found the idea of page.annots by reviewing the PDF.

It works with PyPDF2. doc earlier is initialised by doc = open('sample.pdf')

idtopg = {}
pge = 0
for page in PDFPage.create_pages(doc):
    if page.annots:
        for annot in page.annots:
            por = PDFObjRef.resolve(annot)
            aid = por['T'].decode("utf-8")
            idtopg[aid] = pge
    pge += 1

Now look in your 'T's. The dict produced here gives you the page for each 'T'

myfieldid = thenameofyourfield['T'].decode('utf-8')
print("The field id {0} in on page {1}".format(myfieldid, idtopg[myfieldid])
Tom
  • 1
  • This will only work for form field objects which are merged with their widget annotation, in particular not for fields with more than one widget. – mkl Jul 24 '20 at 08:35
0

I was able to get the page number for the fields by doing the following:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

fp = open(PdfUtility.resource_path(filename), 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
kids = resolve1(doc.catalog['Pages'])['Kids']
page = 0
field_list = []
for kid in kids:
    page += 1
    kid_fields = resolve1(resolve1(kid)['Annots'])
    for i in kid_fields:
        field_dict = {}
        field = resolve1(i)
        name, position = field.get('T'), field.get('Rect')
        if name:
            field_dict['name'] = name.decode('utf-8')
            field_dict['page'] = page
            field_dict['position'] = position
            print(field_dict)
            field_list.append(field_dict)
  • This will only work for form field objects which are merged with their widget annotation, in particular not for fields with more than one widget. Furthermore, it will not work for non-flat page trees. – mkl Jan 06 '21 at 06:07