0

I am playing with Google document ai but I am unsure what the possibilities are. Has any one created a model that can read a pdf and split in into appropriate dita topics? Or split into separate json files for each identified dita topic? Any tips or help is appreciated

2 Answers2

0

To split documents you can use Document Splitter in Document AI.

Splitter output contains split information for the input document, including a confidence score. The Document AI API outputs a Document JSON object, and the output format uses the entities field for representing document splits.

The splitter is not designed to split logical documents that are over 30 pages long. Logical documents that are more than 30 pages long (e.g. a 40-page bank statement) may be split into two or more docs and classified separately.

Splitters identify page boundaries, but do not actually split the input document for you. Here is a code sample that physically splits a PDF file by using the page boundaries:

Document AI PDF Splitter Sample.

For more information about Document Splitter you can refer to this document.

To create a custom classification processor this documentation can be followed.

Prajna Rai T
  • 1,666
  • 3
  • 15
0

Slight clarification for https://stackoverflow.com/a/76021683/6216983


The general Document Splitter processor isn't recommended to be used for production use cases.

It is recommended to use Custom Document Splitter (currently requires allowlisting) or the Procurement Splitter & Classifier or Lending Splitter & Classifier depending on the types of documents.

Splitters identify page boundaries, but do not actually split the input document for you.

You can use the Document AI Toolbox SDK to split the original PDF based on page boundaries identified.

Document AI doesn't currently have built-in support for DITA topics. If you can provide more context for the use case, I can report this as a feature request to the product development team.

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • A simple use case would be, customer has hundreds of policy & procedure pdf documents that need to converted to dita topics. Each heading in the pdf is the title of a new topic. If the content is procedural it should be converted to a task topic and if the content is conceptual or descriptive it should be a concept topic. Paragraphs should be converted to

    , ordered list should be converted to

      /
    1. , etc... is this a candidate for a feature request?
    – user3618078 Apr 18 '23 at 17:59
  • Ok, so the DITA topics would be a desired output format? This wouldn't be embedded into the input PDF in some way? If that's the case, then it's very unlikely that this would be implemented into the product as all API output from processors is in the [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) format. It could make sense to integrate this into the Document AI Toolbox SDK as a converter if this follows a standard format. – Holt Skinner Apr 18 '23 at 21:03
  • json output would be acceptable as well. I think the key would be to add an additional key/value pair to annotate the type of content. For example, if the content is a paragraph then the key/value would be something like "type":"para". Likewise for a title "type":"title" – user3618078 Apr 19 '23 at 13:47
  • Ok, you can refer to this documentation which shows how to handle the text and layout information from the processing response. (Including Paragraphs, blocks, tokens) https://cloud.google.com/document-ai/docs/handle-response#basic_text – Holt Skinner Apr 19 '23 at 15:27