1

I'm playing with Google Document AI and when I read some documentation from Google and other sources I often see a statement that Document AI can classify documents, not only extract the data by labels. However, I don't see how I can achieve that.

Does anybody have any ideas on how to do that?

2 Answers2

2

You can perform documents classification when using what is called Specialized Processors.

There is this codelab which explains how to deal with those specialized processors (including document classification).

Another way of creating documents classification is using Vertex AI AutoML image classification where you can create a dataset of documents images (ie. scanned documents) and train a model that will get a new document image and predict if it is document type 1, type 2, type 3, etc.

Luciano Martins
  • 421
  • 1
  • 9
  • I saw this video and it's still unclear how to get it to work. I don't see any mentions of a document type in the resulting JSON that I get from Document AI. – Vladimir Mischenko Feb 22 '23 at 17:23
  • the steps to get it to work are on the codelab I mention on the answer. – Luciano Martins Feb 22 '23 at 17:42
  • 1
    Here is more information on handling the Document object response for Splitting/Classification. https://cloud.google.com/document-ai/docs/handle-response#splitting You have to use a processor that performs Classification such as the Procurement Splitter/Classifier or Lending Splitter/Classifier. https://cloud.google.com/document-ai/docs/processors-list#processor_procurement-document-splitter https://cloud.google.com/document-ai/docs/processors-list#processor_lending-splitter-classifier – Holt Skinner Feb 22 '23 at 17:57
  • Do I understand correctly that a custom processor cannot classify documents? – Vladimir Mischenko Feb 22 '23 at 18:33
  • A Custom Document Extractor cannot classify documents, it can only extract entities. Refer to the release notes for updates on future Custom Processors that can classify documents. https://cloud.google.com/document-ai/docs/release-notes – Holt Skinner Feb 22 '23 at 22:11
1

Update on the product: Document AI now supports creating Custom Document Classifier processors in GA which allows classification of custom document types. So you won't need to use AutoML Image or Text Classification for classifying documents that don't have a dedicated Specialized Splitter/Classifier.

Here's the instructions for how to create one.

https://cloud.google.com/document-ai/docs/workbench/build-custom-classification-processor

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • Google Document AI custom classifier works really well on a good set of documents with a similar structure. I wish there were an API to upload labeled documents into a dataset. – Vladimir Mischenko Apr 11 '23 at 00:40
  • 1
    The API doesn't currently support uploading documents to a dataset only. But the [`processorVersions.train()`](https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors.processorVersions/train) method does support providing a dataset to train a processor. This page has a code sample for REST/Python https://cloud.google.com/document-ai/docs/workbench/train-processor#train_processor_version – Holt Skinner Apr 11 '23 at 15:17