1

I am passing scanned PDFs into the Google Cloud Document AI OCR. The JSON response (or the Document object returned when using the Python API) contains the content of the PDF in a structured format, as described here. I would like to be able to output a PDF file as well (or XML if that's easier). Is there such a functionality? Any hints on possible implementations are appreciated.

Note: the PDFs are already OCRed by another tool prior to my tasks, but the quality is not as good as the Document AI OCR.

Thank you

2 Answers2

1

Sharing if anyone else is looking for this. I found this repository gcv2hocr which has a script to convert the Google Cloud Vision response (for image input) to hOCR format. The hOCR output can then be converted to other formats, including PDF using hocr-tools.

I suppose it would not be very difficult to adapt this code to work with the DocumentAI response.

  • ocrmypdf is also a good solution. I have scanned documents as PDF, I want to convert it searchable PDF. Any idea how to do same with Document AI? – jagad89 Jul 26 '23 at 19:50
1

Google's Document AI Toolbox Library can convert it to hOCR or other formats but unfortunately not pdf.

here's hOCR example:

from google.cloud.documentai_toolbox import document

def convert_document_to_hocr(documentai_document, document_title):
    wrapped_document = document.Document.from_documentai_document(documentai_document)

    # Converting wrapped_document to hOCR format
    hocr_string = wrapped_document.export_hocr_str(title=document_title)

    print("Document converted to hOCR!")
    return hocr_string
Raad Altaie
  • 1,025
  • 1
  • 15
  • 28