Errors with Training and Testing datasets for Document AI: INVALID_DATASET

Question

I am using Google's Document AI to parse through some invoice PDF's and pull out relevant info like account number, invoice number, invoice date, billing period, etc. I created a custom processor and added and disabled some of the labels in the schema. After uploading and labeling my training and testing sets of documents, I clicked the Uptrain New Version button and began training the processor. But errors were thrown for both the training and test sets of data and the training could not finish.

The trainingDatasetValidation.datasetErrors listed multiple errors with the reason listed as "INVALID_DATASET". For some of the errors, the metadata listed specific labels and a count of how many times the label is included in the training set and how many times it's required. The count given in the error message is different than the actual amount showing in the TRAIN section for the processor.

{
          "code": 3,
          "message": "Invalid dataset.",
          "details": [
            {
              "@type": "type.googleapis.com/google.rpc.ErrorInfo",
              "reason": "INVALID_DATASET",
              "domain": "documentai.googleapis.com",
              "metadata": {
                "num_documents_with_annotation": "8",
                "num_documents_required": "10",
                "annotation_name": "charges"
              }
            }
          ]
        },

Other errors in the trainingDatasetValidation.datasetErrors included something similar, a count of the labels in the training set and a required amount. But these other errors used different terminology and included a constraint parameter with the value "text_anchor". I'm wondering if that's any indication of the reason for the error.

{
          "code": 3,
          "message": "Invalid dataset.",
          "details": [
            {
              "@type": "type.googleapis.com/google.rpc.ErrorInfo",
              "reason": "INVALID_DATASET",
              "domain": "documentai.googleapis.com",
              "metadata": {
                "entity_type_path": "account_number",
                "constraint": "text_anchor",
                "entities_count": "8",
                "min_entities_count": "10"
              }
            }
          ]
        }

The testDatasetValidation.datasetErrors included two of the same constraint errors above.

I have been unable to find much documentation online or questions here on Stackoverflow that address these types of errors in my training and testing sets.

score 0 · Answer 1 · answered May 02 '23 at 22:47

This looks to be related to an issue that is in the process of being resolved.

Note: You do need to include a minimum number of training/test documents and annotations for the processor to be able to train.

https://cloud.google.com/document-ai/quotas#release_limitations_for

I've seen this error before when some sections of the document are highlighted/labeled but don't have any detected text in them. (Which is why the error is showing text_anchor, that's where the entity is linked to the document text.)

I would recommend double-checking the labeled data to ensure that every entity label has associated text.

Thank you Holt. I was hoping you'd find this post and offer some advice. I'll double check the labeled data and make sure it includes the desired text. Great job on the YouTube videos! — H.I. McDunnough, May 03 '23 at 13:34

Errors with Training and Testing datasets for Document AI: INVALID_DATASET

1 Answers1