0

I am using DocumentAI API and want to serialize/deserialize the Document object

https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document

to save the API call while developing. I guess I should use Protocol Buffers but do I have to write .proto file myself? Isn't it available somewhere? I am using Python.

anonaka
  • 85
  • 8

2 Answers2

1

Ok, by "Serialize" are you referring to saving the Document object as JSON or as an actual .proto file or Binary file?

When using the Document AI API with a client library (such as Python), you can save the Document proto as a JSON string for file by using the Document.to_json() method.

Note: if you use Batch Processing instead of Online Processing, the results will be in JSON Files in Google Cloud Storage.

Example with Online Processing:

from google.cloud import documentai

client = documentai.DocumentProcessorServiceClient()

# Create Processing Request
# Refer to https://cloud.google.com/document-ai/docs/send-request

# Send Processing Request
result = client.process_document(request=request)

# Serialize the Document Proto to JSON
json_string = Document.to_json(result.document, including_default_value_fields=False)

# Write JSON String to File
with open(json_file, "w") as outfile:
    outfile.write(json_string)

Also, the Docs you linked in your post are for the Document AI Toolbox, which is an additional Python SDK for Document AI with helper functions for pre and post-processing, meant to be used in conjunction with the Document AI.

Here's information about using Document AI Toolbox on documents processed by Document AI.

https://cloud.google.com/document-ai/docs/handle-response#toolbox

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
0

Google publishes interface definitions (protos) for its services that support REST/gRPC:

Google's libraries for services in this repo, combine a higher-level REST abstraction and a lower-level gRPC implementation.

So, if you are using Google's Python SDK for DocumentAI, it's probable that the Python stubs for e.g. Document are already generated and part of the SDK and you can leverage those. Alternatively, you can use protoc to generate the stubs for yourself though it's slightly gnarly as you'll need to correctly configure --proto_path to access import'ed protos.

Assuming you've (sparse checked out) googleapis/googleapis and are in the clone's root directory:

.
├── google
│   └── cloud
│       └── documentai
│           └── v1beta3
└── protoc-22.2-linux-x86_64
    ├── bin
    └── include

Then you can generate the Python stubs for document.proto using the following command. The stubs will be located alongside the document.proto source:

protoc \
--proto_path=${PWD} \
--python_out=${PWD} \
--pyi_out=${PWD} \
${PWD}/google/cloud/documentai/v1beta3/document.proto

Once you've Protobuf messages, you can SerializeToString or use the text format to MessageToString.

NOTE SerializeToString serializes to a binary format. Here's an example using it.

DazWilkin
  • 32,823
  • 5
  • 47
  • 88