How to convert the json to a document object for DocumentAI

Question

Using a general form parser, i want to fetch the entities and append those to the document object. (for a general form parser-- there are no properties called "entities", so need to create one)

though i am using to_json() to convert the document object to json object, similarly, is there any function to reverse this operation like converting the json object to the document object.

doc_json=documentai.Document.to_json(result.document)
   
my_image = json.loads(doc_json)

my_dict={}
my_dict['textAnchor']=result.document.pages[0].form_fields[0].field_value.text_anchor  //creating entities

my_list.append(my_dict)


my_image['entities']=my_list

toDoc=json.dumps(my_image)

Getting an error "TypeError: Object of type TextAnchor is not JSON serializable"

I did try the following

added to json for the result json but still its not making serialisable

score 0 · Answer 1 · answered Apr 11 '23 at 15:06

0

Yes, there is another method. documentai.Document.from_json(json_as_string) There is also documentai.Document.from_dict(document_as_dict).

It seems like the issue is how you're trying to convert the form fields into entities. Is there a reason you're trying to do this?

It would likely make more sense to extract the form fields as shown in Handle the processing response and keep them as Form Fields.

Note: The newest version of the Form Parser pretrained-form-parser-v2.0-2022-11-10 supports generic entity extraction, so this might serve your use case.

If you're having difficulty with managing the Document object output, you can also try using the Document AI Toolbox SDK which has a simpler interface for reading common fields.

answered Apr 11 '23 at 15:06

Holt Skinner

1,692
1
8
21

yes there exist a reason for doing this, i want to reduce the "human in the loop" work to as minimal as possible and want to train the documents in the custom document extractor, therefore, i am sending the document to the form-parser and retrieving the entities from the ["pages"]["formFields] (explicitly as the entities list is empty) and manually retrieving "textAnchor", "confidence", "boundingPolyForDemoFrontEnd", "mentionText" and "type" Thanks, – Asit Panda Apr 12 '23 at 05:46
Ok, that makes sense. You're trying to do reverse-annotation using the Form Parser. I just added a Notebook to the document-ai-samples repository that does what you're trying to do. It's designed for a specific dataset, and it also includes the batch processing of the documents but most of the concepts should transfer over. https://github.com/GoogleCloudPlatform/document-ai-samples/tree/main/form-parser-to-cde – Holt Skinner Apr 12 '23 at 18:00

How to convert the json to a document object for DocumentAI

1 Answers1