1

I have a word document which is converted to pdf. now i have to reverese engineer to get the content of the fields. i am able to parse text field by TextAbsorb but can't able to find any way to get the radio button is checked or not. in text it is coming like Yes {{r11}} No {{r12}} and can't able to find which one is checked. There is no field in form or page i can find.

pdf content looks like following enter image description here

When read content from TextAbsorber it is like following enter image description here

Kamran Shahid
  • 3,954
  • 5
  • 48
  • 93
  • @KJ in programming everything is not by our choice. At the moment i have requirement to extract data about fields where i am able to extract almost 90% of things other then this radio button thing – Kamran Shahid Mar 16 '23 at 08:01
  • @KJ i have added the screenshot of the pdf content and the content read from TextAbsorber – Kamran Shahid Mar 16 '23 at 17:48
  • @KJ due to security polcy i can't share the pdf file. Any hint what property/method should I look into after Loading the Document? Data is in Page 1 where via TextAbsorber I am able to get needed textual data and extract my needed data. – Kamran Shahid Mar 16 '23 at 19:50

1 Answers1

1

When a MS Word document is processed with Acrobat Forms generation/export it should be something like this "perfect" example from Adobe. (Slightly modified to show Radio Buttons are exactly the same as CheckBoxes, except for their shape). The red Outline is to signal the entry is mandatory (required) but I have not scripted that requirement, its just the appearance for such field that differs.

Note the page text and the tooltip (as shown) by hover over (No) do not have to agree, the tip can say "Rejected" as its hint.

Also note the Field names on the Left do not need to reflect the body text in any way they do not have spaces and their label could simply be R101 to indicate they are RadioButtons Page 10 group 1 Or any other choice like doUnionAgree ?

enter image description here

So when the data is imported and exported the fields have that label

enter image description here

However at this stage there is no data, and the normal way to build an importable version is to dummy populate the form.

enter image description here

and now of course we can see where to Read / Write /Yes OR /No for import or export 1000's of times e.g. once per user PDF AcroForm. Also note the order does not matter the closeOffice on the left is after the byAgreement on the right, and both could be opposite ends of the FieldsDataFile.

enter image description here

How to import export that data for analysis? is simplest exported in a single command.

pdftk create-forms-sample-radio.pdf generate_fdf output Template.fdf verbose
Command Line Data is valid.

Input PDF Filenames & Passwords in Order
( <filename>[, <password>] )
   create-forms-sample-radio.pdf

The operation to be performed:
   generate_fdf - Generate a dummy FDF file from a PDF.

The output file will be named:
   Template.fdf

Output PDF encryption settings:
   Output PDF will not be encrypted.

No compression or uncompression being performed on output.

Creating Output ...

and the blank template may end like this output from that run

/T (total)
>> 
<<
/V /
/T (byAgreement)
>> 
<<
/V ()
/T (visitorsPurpose)
>>]
>>
>>
endobj 
trailer

<<
/Root 1 0 R
>>
%%EOF

Clearly the order of /V and /T does not matter (like much else in a PDF) what is important is byAgreement has /V /for appending Yes or No thus perfect for edits but not for your case of knowing which field is which. HOWEVER if we edit (as if for bulk entry) that single entry to /Yes ...

<<
/V /Yes
/T (byAgreement)
>> 

and open the FDF with Acrobat or similar, we see instantly which field may be R101 or byAgreement or otherwise.

enter image description here

So for your answer

which field is yes or no? simply blank all entries except for the unknown button. Do the export of all but one blanked fields using PDFtk https://www.pdflabs.com/, and the one that says yes or no, is your question target.

<<
/V /No
/T (byAgreement)
>> 

All of the above assumes, as 1st stated in your question, you have a Word generated PDF form. However the button field values do NOT have to be /Yes and /No (it can be /Oui or /Nine :-) , here is a form very much like the one above and yours, where the author thought it would be "cool" to use that for yes:-

18 0 obj
<</Ff 49152/FT/Btn/Kids[32 0 R 33 0 R]/T(think)/V/Cool>>
endobj
K J
  • 8,045
  • 3
  • 14
  • 36
  • Wow. thanks a lot @KJ for such a detail post. I have only pdf file ,Aspose.pdf file and my C# code to extract that data. one thing i found out that my pdf are secure as well (but I am able to read the data using TextAbsorber) – Kamran Shahid Mar 17 '23 at 17:39
  • i only have pdf file and above reference is for xml to fdf convertion. do i need to convert my pdf to some xml first and then tat xml to fdf. no expereince with pdf to xml or fdf related at the moment – Kamran Shahid Mar 19 '23 at 11:22