The Operation
data returned from get_operation()
is in a slightly different format than how it's returned directly from batch_process_documents()
. This seems to be a quirk of how Google APIs handle operations.
The code sample and documentation don't include info about this, but I figured out how to do it using the built in methods. (I'm in the process of adding features to the Document AI Toolbox SDK that pulls the Document
output from the GCS URIs in BatchProcessMetadata
or from an Operation
name to make this easier.
Update: Code for Document AI Toolbox
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation = client.batch_process_documents(request)
# Format: projects/{project_id}/locations/{location}/operations/15842030886767182557
operation_name = operation.operation.name
# Use this wrapped document to get the extraction information you need.
wrapped_document = document.from_batch_process_operation(location, operation_name)
Main APIs
from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.longrunning.operations_pb2 import GetOperationRequest
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation_name = (
f"projects/{project_id}/locations/{location}/operations/15842030886767182557"
)
client = documentai.DocumentProcessorServiceClient(
client_options=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
)
while True:
operation = client.get_operation(
request=GetOperationRequest(name=operation_name)
)
if operation.done:
break
# The BatchProcessMetadata information is serialized, must be deserialized to access the values
metadata = documentai.BatchProcessMetadata.deserialize(operation.metadata.value)
# Get the individual_process_statuses
for process in list(metadata.individual_process_statuses):
# Handle the response however you need
print(process.output_gcs_destination)