0

I want to train and test custom document Classifier using Python Code and I found this train Processor. And I started implementing using this Documentation. But I am getting one error when I call function

train_processor_version_sample(497857003374, 'us','a530739de44a7ca6',"Version-1","gs://documentai-bucket-123/pdfs","gs://documentai-bucket-123/test") 

error:

InvalidArgument                           Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 train_processor_version_sample(497857003374, 'us','a530739de44a7ca6',"Version-1","gs://documentai-bucket-123/pdfs","gs://documentai-bucket-123/test")

Input In [17], in train_processor_version_sample(project_id, location, processor_id, processor_version_display_name, train_data_uri, test_data_uri)
     52 print(operation.operation.name)
     53 # Wait for operation to complete
---> 54 response = documentai.TrainProcessorVersionResponse(operation.result())
     56 metadata = documentai.TrainProcessorVersionMetadata(operation.metadata)
     58 print(f"New Processor Version:{response.processor_version}")

File ~/anaconda3/lib/python3.9/site-packages/google/api_core/future/polling.py:261, in PollingFuture.result(self, timeout, retry, polling)
    256 self._blocking_poll(timeout=timeout, retry=retry, polling=polling)
    258 if self._exception is not None:
    259     # pylint: disable=raising-bad-type
    260     # Pylint doesn't recognize that this is valid in this case.
--> 261     raise self._exception
    263 return self._result

InvalidArgument: 400 Invalid dataset. See operation metadata for specific errors

I have some idea about this. It is because custom document classifier have some training dataset requiements

Training guidelines

Minimum 2 labels required in the schema

Each label exists on 10 training documents

Each label exists on 2 test documents

I don't know how to get labeled dataset url and pass two bucket directory for training and test set too using python code. Can Anyone help me on this?

Nitin Saini
  • 507
  • 2
  • 10
  • 26

1 Answers1

0

This answer should cover your use case.

The code sample you linked requires the Document.JSON files in Google Cloud Storage to be labeled already.

There's not a public API to explicitly label documents, the recommended process is to use the Cloud Console to create the labeled data, then you can use the training API to trigger the training process.

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21