PDF data extraction

Question

Is there a way for me to take a scanned PDF image and extract data from the image by highlighting the fields that are needed? We scan thousands of PDF images of real estate deeds daily and would like to be able to automate the data entry process. The problem that we are facing is that no two deeds are the same.

Your new here and as I am missing your code and errors you encounter. please read How to ask http://stackoverflow.com/help/how-to-ask And How to create a Minimal, Complete, and Verifiable example http://stackoverflow.com/help/mcve. — davejal, Nov 24 '15 at 02:22

score 0 · Answer 1 · answered Nov 24 '15 at 09:38

It has been said in comments that Stackoverflow is mainly about programming issues.

Nevertheless, there are possibilities, depending on the actual documents, and the volumes to be processed.

On the high end, there is a product called Teleform, originally developed by Cardiff, and now owned by HP, which is used to process paper forms; you may also look at the Business Process application Cardiff LiquidOffice, now HP LiquidOffice.

On the low end, I have developed an application in PDF, running under Acrobat, which can take a scanned and OCRd form, and transfer the data to a specially prepared fillable form, from where the data can be exported towards a database, for example. For more information, a demo and a quote, feel free to contact me in private.

If you want to develop something using Acrobat, you could also begin with a OCRd document, and then use the capabilities of the Redaction function (or use the industrial strength Redaction tool Redax by Appligent) to find keywords, and then use the positional information of those keywords to extract more data.

PDF data extraction

1 Answers1