Read data from a PDF document that does not have an XFA-form

Question

I use iText to read a PDF document containing an XFA form. I convert it to XML, read data from the XML and insert it in a datatbase. But if I dont have an XFA form in the PDF then how I can efficiently read data from the PDF?

score 0 · Accepted Answer · answered Aug 09 '17 at 09:06

0

It depends on your expectations.

You can use text extraction to retrieve all the text on a certain page. How you then process the text is up to you. (e.g. regular expressions)
You can also opt for using pdf2Data, an iText7 add-on that allows you to match documents against templates. pdf2Data seems like a good fit, since it produces XML files as its output.

More information on pdf2Data can be found here http://itextpdf.com/itext7/pdf2Data

answered Aug 09 '17 at 09:06

Joris Schellekens

8,483
2
23
54

Text extraction is not much helpful as values can not be mapped – hrishi Aug 09 '17 at 09:39
It depends. You can use TextExtractionStrategies that take a specific location (Rectangle) as their input. This allows you a more targeted approach. Once you have the text at a certain (roughly defined) position, you can use regular expressions to further refine the result. – Joris Schellekens Aug 09 '17 at 09:40
ok. Thanks I will check it. I am not much familiar with PDFs. I use iText java code to read XFA forms. Can you share any sample code link where I can get idea on how to use it programmatically – hrishi Aug 09 '17 at 09:49
Sample code, both for pdf2Data and text extraction can be found on the website. Also, upvote my answer (or mark it as accepted) if it helped you. – Joris Schellekens Aug 09 '17 at 09:51

Read data from a PDF document that does not have an XFA-form

1 Answers1