1

In my use-case, when performing batch processing of invoices, there's a chance that mutiple invoices might be included in the same file.

It appears that each file processed is treated as a single invoice. Is there a way to get an output with the content of each invoice within a specific file?

I tried looking at the Splitter processor but it appears to just give metadata on each split page.

  • It's a one to one process. You have to split your invoice before and then submit they to the document AI. Or, you can extract and split the extracted data of document AI. Do the split before or after, depend on what is the easiest for you – guillaume blaquiere Apr 03 '23 at 11:53
  • @guillaumeblaquiere How would you do the split after the processing? As I understand it, if you upload multiple invoices in a single PDF, there'd only be one returned value for each type (ex. "supplier_email"). I think it'd be able to split the line items after the fact though. – user2132770 Apr 03 '23 at 13:03
  • Can you iterate over the pages? – guillaume blaquiere Apr 03 '23 at 13:28
  • The "pages" section on the response just has metadata. The "entities" section only has one value per "type" (all the "line_items" for all the separate invoices show up however) – user2132770 Apr 03 '23 at 13:44

1 Answers1

0

You will need to split the source document file to include one invoice per file, then you can send the files to Document AI for Batch Processing.

You can use the Procurement Document Splitter & Classifier processor to identify split points. (Information on handling the processing response)

Then you can use the identified split points to create new pdfs with one invoice per file. You can do this with multiple libraries that work with PDFs, and you can use the Document AI Toolbox SDK which has a built in function for splitting a PDF file after processing by a Splitter/Classifier Processor.

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • For the splitter pricing, does "per document" mean if the splitter split a file into 7 different documents, then the cost would be 7 * (per document pricing). – user2132770 Apr 03 '23 at 16:59
  • Based on the information here: https://cloud.google.com/document-ai/pricing#procurement It means that pricing is based on each "sub document" that is classified in a file. The [footnote](https://cloud.google.com/document-ai/pricing#footnote3) also says that if a sub document is classified as `other`, then you are not billed. – Holt Skinner Apr 03 '23 at 21:26