I want to train and test custom document Classifier using Python Code and I found this train Processor. And I started implementing using this Documentation. But I am getting one error when I call function
train_processor_version_sample(497857003374, 'us','a530739de44a7ca6',"Version-1","gs://documentai-bucket-123/pdfs","gs://documentai-bucket-123/test")
error:
InvalidArgument Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 train_processor_version_sample(497857003374, 'us','a530739de44a7ca6',"Version-1","gs://documentai-bucket-123/pdfs","gs://documentai-bucket-123/test")
Input In [17], in train_processor_version_sample(project_id, location, processor_id, processor_version_display_name, train_data_uri, test_data_uri)
52 print(operation.operation.name)
53 # Wait for operation to complete
---> 54 response = documentai.TrainProcessorVersionResponse(operation.result())
56 metadata = documentai.TrainProcessorVersionMetadata(operation.metadata)
58 print(f"New Processor Version:{response.processor_version}")
File ~/anaconda3/lib/python3.9/site-packages/google/api_core/future/polling.py:261, in PollingFuture.result(self, timeout, retry, polling)
256 self._blocking_poll(timeout=timeout, retry=retry, polling=polling)
258 if self._exception is not None:
259 # pylint: disable=raising-bad-type
260 # Pylint doesn't recognize that this is valid in this case.
--> 261 raise self._exception
263 return self._result
InvalidArgument: 400 Invalid dataset. See operation metadata for specific errors
I have some idea about this. It is because custom document classifier have some training dataset requiements
Training guidelines
Minimum 2 labels required in the schema
Each label exists on 10 training documents
Each label exists on 2 test documents
I don't know how to get labeled dataset url and pass two bucket directory for training and test set too using python code. Can Anyone help me on this?