Using Google Cloud Document AI Processors for PDF analysis and document generation

Question

Is it plausible to train a document AI processor to analyze a pdf file containing instructions for a document outline and content (such as a government Request for Proposals), and output a new text document with an outline and draft content based on the input document?

My gut answer is no ... not using Document AI from Google. The Google Doc AI service can extract fields (entities) from documents. Imagine a PDF that contains a completed form. If we *just* ran OCR over it, we would get an unstructured blob of text. If we run Document AI over it, we get field/value pairs with the information extracted. That is (basically) what Doc AI does. In your question, you then asked for output generation ... this feels like advanced AI capabilities to GENERATE semantic output. Not in the remit of Doc AI which is parsing ... not construction. — Kolban, Jan 08 '23 at 14:54
Thanks. Do you think it can be trained to extract entities that are similar in concept between documents (proposal instructions, sections of text with similar headings such as "evaluation criteria") but with different formats in the documents? So a dozen documents contain similar type of information, but it's not always in the same place in the document with the same surrounding text. If I were to train it on a hundred of these documents, do you think it could be used on documents with different section headings and so on? — Jeff Nosanov, Jan 09 '23 at 15:08
I'm tempted to say "yes" to your last question. To me, DocAI is trained against labeled documents where YOU define a set of entities (text) you want to extract and give DocAI as many examples as you can of such labeled documents. DocAI will build "some" model ... the question becomes how accurate it will be. I believe DocAI uses its knowledge of language AND information you specify on page layouts both as guidance on how to extract information. If both the language and document structure is heavily variable, accuracy may be poor. — Kolban, Jan 09 '23 at 17:41

score 0 · Answer 1 · answered Jan 25 '23 at 00:58

Kolban mentioned in the comments that it is not possible to generate semantic output by using Google Document AI which extracts structured data from dark data or unstructured data. Document AI is not for output generation but it is for analyzing and extracting/parsing data.

Regarding your another question in the comments, it may be possible with Custom Document AI, where you can build models that suit your document types. You can train custom models from scratch and evaluate your data.

Using Google Cloud Document AI Processors for PDF analysis and document generation

1 Answers1