My goal is to extract information from several different types of Invoices and transform that input into standard output. For now, all the Invoices are in PDF format (original digital pdfs, not printed!), so I don't think I need OCR but maybe in the future, we can support also printed ones (so OCR will be needed). C# is the backend technology.
I've been studying several ways how I can extract content from a PDF. The best libraries that I've tested so far were:
- pdf2data (IText) (paid)
- pdfsharp (free)
- ironpdf (paid)
Cloud services:
Cloud/Library services:
They are very different from each other.
For example, Cloud services from Amazon, Google, Azure support an API that provides the result of the OCR in JSON, others like but for example pdf2data from IText allow you to create templates, with several selector rules to extract the specific information of the result. This facilitates a lot the way you can interpreter the results and has also some visual tools to provide how/where the info was extracted. This facilitates a lot the work of extraction since I've no idea how to make easy extraction rules on Cloud OCR JSON results.
My question is there any library (C# if possible) abstract the extraction concepts and provides that functionalities like:
- Boundary search
- Font Type
- Font Size
- Paragraph
- Line
- Prefix-Suffix Pattern
- Table (columns/rows)
- Key-Value (forms)
- etc,
from a JSON result? This way I could use a cloud service, for example, Azure, with "the same extraction functionalities" as IText. Otherwise, it will be too complex to extract information from a lot of type of Invoices.