3

The objective is to extract data out of invoices in pdf format.

Pdf data format: selectable text (not scanned images) consists of lines of text, name-value pairs, tables (of varying lengths)

Invoices data includes: invoice_no, invoice_date, order_no, order_date in name-value pairs items details (item_code, name, rate, quantity, discount, price, etc) in table format final_taxation_info and gross_total

Inputs: Bulk of invoices are received weekly having both similar and distinct formats

Outputs: Extract invoices data and insert into database

Approaches tried or considered so far:

  1. Writing a custom algorithm in C# using libraries, like iText7, PDFix, GemBox.Pdf, GroupDocs.Parser, Bytescout.PDFExtractor, Sautinsoft.pdffocus, Spire.PDF, etc. Downside: Have to modify or write a new algorithm for a new pdf format.
  2. Data extraction tools, like SmallPDF, Convertapi.com, cometdocs.com, groupdocs.app. Downside: No control over the extraction algorithm.
  3. Template guided extraction, like Pdf_Element, Tabula, Docparser, iText pdf2Data. Downside: Fails when the table length varies.
  4. AI/ML-based extraction, automation tools/services, like AWS Textract, UiPath, KlearStack, IQ Bot (I have not tried this last approach practically in-depth, just scratched the surface). Downside: Not sure but seems like learning curve or cost could be stumbling blocks.

Considering this whole scenario can anybody suggest which approach I should follow.

1 Answers1

1

We used approach 1, at our org, you have to come up with pdf->free text-> formulated exprressions to extract. AI tools would work only if you have a large set of documents that you can "train" the AI with .

http://www.puntechsolutions.com.au/smartdt.html
  • Thanks, @Jyotheendra, I have started off with approach 1, writing algorithms in C# harnessing iText7 library to extract pdf data. It appears very tedious, but also offers more control over the process. May I ask which platform and library you used for writing extraction algorithm? – Amit Bhagat May 21 '20 at 14:37
  • Hi @AmitBhagat, We used PDFbox -> text->algo all in Java, the Java algorithm was our area to fcous, and once we figured out proper way to template and find text matches, it beacame easy . – Jyotheendra May 26 '20 at 02:49
  • Thanks for replying, @Jyotheendra. Currently, I am developing in C#, I will check out PDFbox, seems like they have provided an assembly in .Net on NuGet. – Amit Bhagat May 28 '20 at 14:12