The objective is to extract data out of invoices in pdf format.
Pdf data format: selectable text (not scanned images) consists of lines of text, name-value pairs, tables (of varying lengths)
Invoices data includes: invoice_no, invoice_date, order_no, order_date in name-value pairs items details (item_code, name, rate, quantity, discount, price, etc) in table format final_taxation_info and gross_total
Inputs: Bulk of invoices are received weekly having both similar and distinct formats
Outputs: Extract invoices data and insert into database
Approaches tried or considered so far:
- Writing a custom algorithm in C# using libraries, like iText7, PDFix, GemBox.Pdf, GroupDocs.Parser, Bytescout.PDFExtractor, Sautinsoft.pdffocus, Spire.PDF, etc. Downside: Have to modify or write a new algorithm for a new pdf format.
- Data extraction tools, like SmallPDF, Convertapi.com, cometdocs.com, groupdocs.app. Downside: No control over the extraction algorithm.
- Template guided extraction, like Pdf_Element, Tabula, Docparser, iText pdf2Data. Downside: Fails when the table length varies.
- AI/ML-based extraction, automation tools/services, like AWS Textract, UiPath, KlearStack, IQ Bot (I have not tried this last approach practically in-depth, just scratched the surface). Downside: Not sure but seems like learning curve or cost could be stumbling blocks.
Considering this whole scenario can anybody suggest which approach I should follow.