I am using pdfbox-1.8.12 to read content from PDF to get XFA. I have been able to get XFA for most of the files successfully without missing out on any field values.
The trouble is with some files like error.pdf. I have many of the fields having no values like CIN, but when I open the file in any PDF Viewer, foxit or Acrobat it shows that field.
public static byte[] getParsableXFAForm(File file) {
if (file == null)
return null;
PDDocument doc;
PDDocumentCatalog catalog;
PDAcroForm acroForm;
PDXFA xfa;
try {
doc = PDDocument.load(file);
catalog = doc.getDocumentCatalog();
acroForm = catalog.getAcroForm();
xfa = acroForm.getXFA();
byte[] xfaBytes = xfa.getBytes();
doc.close();
return xfaBytes;
} catch (IOException e) {
// handle IOException
// happens when the file is corrupt.
System.out.println("IOException");
return null;
}
}
Then the byte[] is converted to String.
This is the xfa for this file and if you search in this for 'U72300DL1996PLC075672', it would be missing.
This is a normal file, that gives all fields.
Any Ideas? I have tried everything, but my guess is that since readers can see that value, I should be able to as well.
EDIT : You will have to download the files, you might not be able to view them in the browser.