0

I am currently in the process of training a new document processor with Google's Document AI. I have 16 training documents and 10 testing documents, are easily within the minimums illustrated by Google. However when I attempt to train the processor, I continue to get errors for input types that don't exist or indicating that I don't have the right amount of annotated labels; even though I have verified that every single document that I have provided has been labeled appropriately that fall within the defined minimums.

As I have seen through Stack Overflow, the errors that people are reporting are very ambiguous, and I am seeing this as well. I have tried training the machine 4 different times with all of the same errors. Any help would be appreciated.

Incorrect input types

This is a sample of the error that I am getting for the error type. The invalid document error is citing an invalid num_field. However I don't have any num_fields in my schema.

"documentErrors": [
        {
          "code": 3,
          "message": "Invalid document.",
          "details": [
            {
              "@type": "type.googleapis.com/google.rpc.ErrorInfo",
              "reason": "INVALID_DOCUMENT",
              "domain": "documentai.googleapis.com",
              "metadata": {
                "annotation_name": "product_inventory_result/reorder_point",
                "field_name": "entities.text_anchor.text_segments",
                "num_fields": "0",
                "num_fields_needed": "1",
                "document": "3ef767351034410f.json"
              }
            }
          ]
        }
]

Invalid Dataset Errors

This error says that I only have 8 documents with annotations. Which is incorrect. I have verified that I have 16 training documents and 10 documents as I said before.

"datasetErrors": [
        {
          "code": 3,
          "message": "Invalid dataset.",
          "details": [
            {
              "@type": "type.googleapis.com/google.rpc.ErrorInfo",
              "reason": "INVALID_DATASET",
              "domain": "documentai.googleapis.com",
              "metadata": {
                "num_documents_with_annotation": "8",
                "num_documents_required": "10",
                "annotation_name": "DOCUMENTS_WITH_ENTITIES"
              }
            }
          ]
        }
]
Doug Niccum
  • 196
  • 4
  • 16

1 Answers1

0

The issue seems that the dataset has several documents that have empty fields for product_inventory_result/reorder_point. (And possibly other fields) The entities.text_anchor.text_segments value is empty, meaning that a bounding box was labeled, but no text was found in the bounding box. This is the cause of the second error INVALID_DATASET because the dataset doesn't have enough valid documents.

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • Ok, I think I found the culprit, and it appears to happen with "0" values. I had to manually assign the value for the annotated bounding box as it was not automatically picked up. But apparently the training process doesn't like how I did it. [I recorded a quick GIF](https://share.cleanshot.com/R3r9kdW5TxhG1MtqybHn) so you can see what I am talking about – Doug Niccum May 18 '23 at 15:45
  • Oh, that's a very interesting finding. I'll report this to the product team in case this is a bug. – Holt Skinner May 18 '23 at 16:08
  • Did the product team report back any issues? Unfortunately this could very well be a blocker for us to use the platform. – Doug Niccum May 22 '23 at 16:00