0

Using the Document AI Processor to extract text from PDF (english, german, italian) works quite good, but sometimes the OCR mismatches. Especially in situations where the "word" is not a word from a dictionary, but has problems with part numbers which contain letters and digits quite mixed up ( O 0 L 1 5 S mostly). Is there a way to tell Document AI to use the text contained in the PDF (as text). To my knowledge Document AI uses the image of a PDF page to ocr the content.

Are there any flags to customize Document AI to use the text versions or any other ideas? I need to use Document AI because I want to have the structure of the text extracted in the right way.

mooose
  • 60
  • 7
  • Could you provide more details? Are those errors are like O and 0 (zero)? What language are those PDFs? How are you using OCR, what language are you using, are you using ` DOCUMENT_TEXT_DETECTION`? Do you have the same result using [Vision Demo](https://cloud.google.com/vision/docs/drag-and-drop) ? – PjoterS Jul 05 '21 at 13:57
  • I am not using the Cloud Vision API, I am using [Document AI](https://cloud.google.com/document-ai) with https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr Docs are in englsh, german, italian. Problems are with O 0 L 1 5 S – mooose Jul 05 '21 at 14:36
  • @KJ yes I know OCR is best guess. But if the text is already in the PDF it would be good if Document AI would check that too. I am sure they relate known words to dictionaries, but with part numbers that is really hard. – mooose Jul 05 '21 at 15:09
  • @KJ my question is not around PDF but how to use Document AI in a better way to get a more correct answer. – mooose Jul 05 '21 at 18:31
  • How many pages have those PDFs? Are you using Python code like from [this docs](https://cloud.google.com/document-ai/docs/send-request#batch-process)? Could you try to use [Document AI Demo](https://cloud.google.com/document-ai#section-2) with 5page PDF, do you have the same result in your requests and with Demo? Those PDFs are full text or you have images there also? – PjoterS Jul 06 '21 at 12:34

1 Answers1

0

For the Document AI OCR Processor, there aren't any parameters that can be input that will affect the output generated by the model. If you find that certain characters/words are being recognized incorrectly, this can be handled by post processing, or by using Human-in-the-Loop (HITL) for Supported Processors.

There currently isn't a feature to use pre-embedded text from a PDF (Document AI does use an image of the PDF to perform OCR), but you can ask your Google Cloud Account Manager (if you have one) to reach out to the product team to discuss the options.

If you don't currently have a Google Cloud Account Manager, you can reach out to the sales team from the Contact Us page

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21