pdfminer doesn't extract data from filled-out pdf form

Question

I'm trying to use pdfminer to extract the filled-out contents in a pdf form. The instructions for accessing the pdf are:

Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
Click "Create Report" next to the fourth report from the top (i.e.,Banking Organization Systemic Risk Report (FR Y-15))
Click "Your request for a financial report is ready"

To extract the contents in blue, I copied code from this post:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']

for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

This didn't extract the data fields as expected -- nothing was printed. I tried the same code on another pdf and it worked so I suspect the failure might have to do with the security setting of the first pdf, which is shown below

For the second pdf on which the code worked, the security setting shows "Allowed" for all the actions. I also tried using pdfminer's pdf2txt.py functionality (see here) but the filled-out data in the fields in the original pdf form (which is what I want) was not in the converted text file; only the "flat" non-fillable part of the pdf was converted. Interestingly, if I use Adobe Reader's Save As Text to convert the pdf to a text file, the fillable part was in the converted text file. This is what I've been doing to get around the failed code.

Any idea how I can extract data directly from the pdf form? Thanks.

Concerning the PDF at https://www.ffiec.gov/nicpubweb/NICDataCache/FRY15/FRY15_1073757_20160630.PDF: *The resource you are looking for has been removed, had its name changed, or is temporarily unavailable.* Considering the URL saying "...DataCache..." this is not really surprising... — mkl, Dec 16 '16 at 09:28

score 0 · Answer 1 · answered Dec 18 '16 at 20:44

0

I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.

Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.

While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:

Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.

answered Dec 18 '16 at 20:44

mkl

90,588
15
125
265

Which software did you use to dissect the pdf as shown above? Thanks. – zzhengnan Dec 19 '16 at 15:10
@Nero I use [iText RUPS](http://itextpdf.com/de/node/7) but any other pdf inspection software should show something like that, too. – mkl Dec 20 '16 at 12:57
I managed to extract the info I needed by opening the pdf in a text editor but ran into a different problem which I believe is also tied to the internal workings pdf's. Could you please take a look at this post http://stackoverflow.com/questions/41232492/pdf-contents-dont-show-up-in-text-editor? Thanks. – zzhengnan Dec 20 '16 at 15:10

pdfminer doesn't extract data from filled-out pdf form

1 Answers1