I am using Google's Document AI to parse through some invoice PDF's and pull out relevant info like account number, invoice number, invoice date, billing period, etc. I created a custom processor and added and disabled some of the labels in the schema. After uploading and labeling my training and testing sets of documents, I clicked the Uptrain New Version button and began training the processor. But errors were thrown for both the training and test sets of data and the training could not finish.
The trainingDatasetValidation.datasetErrors listed multiple errors with the reason listed as "INVALID_DATASET". For some of the errors, the metadata listed specific labels and a count of how many times the label is included in the training set and how many times it's required. The count given in the error message is different than the actual amount showing in the TRAIN section for the processor.
{
"code": 3,
"message": "Invalid dataset.",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "INVALID_DATASET",
"domain": "documentai.googleapis.com",
"metadata": {
"num_documents_with_annotation": "8",
"num_documents_required": "10",
"annotation_name": "charges"
}
}
]
},
Other errors in the trainingDatasetValidation.datasetErrors included something similar, a count of the labels in the training set and a required amount. But these other errors used different terminology and included a constraint parameter with the value "text_anchor". I'm wondering if that's any indication of the reason for the error.
{
"code": 3,
"message": "Invalid dataset.",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "INVALID_DATASET",
"domain": "documentai.googleapis.com",
"metadata": {
"entity_type_path": "account_number",
"constraint": "text_anchor",
"entities_count": "8",
"min_entities_count": "10"
}
}
]
}
The testDatasetValidation.datasetErrors included two of the same constraint errors above.
I have been unable to find much documentation online or questions here on Stackoverflow that address these types of errors in my training and testing sets.