Is there a way to parse the Document AI OCR response into pdf format?

Question

I am passing scanned PDFs into the Google Cloud Document AI OCR. The JSON response (or the Document object returned when using the Python API) contains the content of the PDF in a structured format, as described here. I would like to be able to output a PDF file as well (or XML if that's easier). Is there such a functionality? Any hints on possible implementations are appreciated.

Note: the PDFs are already OCRed by another tool prior to my tasks, but the quality is not as good as the Document AI OCR.

Thank you

score 1 · Accepted Answer · answered Apr 20 '21 at 15:50

1

Sharing if anyone else is looking for this. I found this repository gcv2hocr which has a script to convert the Google Cloud Vision response (for image input) to hOCR format. The hOCR output can then be converted to other formats, including PDF using hocr-tools.

I suppose it would not be very difficult to adapt this code to work with the DocumentAI response.

answered Apr 20 '21 at 15:50

fx.paquette

76
5

ocrmypdf is also a good solution. I have scanned documents as PDF, I want to convert it searchable PDF. Any idea how to do same with Document AI? – jagad89 Jul 26 '23 at 19:50

score 1 · Answer 2 · answered Aug 17 '23 at 22:12

Google's Document AI Toolbox Library can convert it to hOCR or other formats but unfortunately not pdf.

here's hOCR example:

from google.cloud.documentai_toolbox import document

def convert_document_to_hocr(documentai_document, document_title):
    wrapped_document = document.Document.from_documentai_document(documentai_document)

    # Converting wrapped_document to hOCR format
    hocr_string = wrapped_document.export_hocr_str(title=document_title)

    print("Document converted to hOCR!")
    return hocr_string

Is there a way to parse the Document AI OCR response into pdf format?

2 Answers2