I am using DocumentAI API and want to serialize/deserialize the Document object
to save the API call while developing. I guess I should use Protocol Buffers but do I have to write .proto file myself? Isn't it available somewhere? I am using Python.
I am using DocumentAI API and want to serialize/deserialize the Document object
to save the API call while developing. I guess I should use Protocol Buffers but do I have to write .proto file myself? Isn't it available somewhere? I am using Python.
Ok, by "Serialize" are you referring to saving the Document
object as JSON or as an actual .proto file or Binary file?
When using the Document AI API with a client library (such as Python), you can save the Document
proto as a JSON string for file by using the Document.to_json()
method.
Note: if you use Batch Processing instead of Online Processing, the results will be in JSON Files in Google Cloud Storage.
Example with Online Processing:
from google.cloud import documentai
client = documentai.DocumentProcessorServiceClient()
# Create Processing Request
# Refer to https://cloud.google.com/document-ai/docs/send-request
# Send Processing Request
result = client.process_document(request=request)
# Serialize the Document Proto to JSON
json_string = Document.to_json(result.document, including_default_value_fields=False)
# Write JSON String to File
with open(json_file, "w") as outfile:
outfile.write(json_string)
Also, the Docs you linked in your post are for the Document AI Toolbox, which is an additional Python SDK for Document AI with helper functions for pre and post-processing, meant to be used in conjunction with the Document AI.
Here's information about using Document AI Toolbox on documents processed by Document AI.
https://cloud.google.com/document-ai/docs/handle-response#toolbox
Google publishes interface definitions (protos) for its services that support REST/gRPC:
Google's libraries for services in this repo, combine a higher-level REST abstraction and a lower-level gRPC implementation.
So, if you are using Google's Python SDK for DocumentAI, it's probable that the Python stubs for e.g. Document
are already generated and part of the SDK and you can leverage those. Alternatively, you can use protoc
to generate the stubs for yourself though it's slightly gnarly as you'll need to correctly configure --proto_path
to access import
'ed protos.
Assuming you've (sparse checked out) googleapis/googleapis
and are in the clone's root directory:
.
├── google
│ └── cloud
│ └── documentai
│ └── v1beta3
└── protoc-22.2-linux-x86_64
├── bin
└── include
Then you can generate the Python stubs for document.proto
using the following command. The stubs will be located alongside the document.proto
source:
protoc \
--proto_path=${PWD} \
--python_out=${PWD} \
--pyi_out=${PWD} \
${PWD}/google/cloud/documentai/v1beta3/document.proto
Once you've Protobuf messages, you can SerializeToString
or use the text format to MessageToString
.
NOTE SerializeToString
serializes to a binary format. Here's an example using it.