1

could you possibly help me: I have a pdf in Hebrew with numerated paragraphs inside. After processing this pdf with Google Document AI OCR API, I receive text, where paragraph numbering always goes before actual text:this is an example of paragraphs numeration before paragraphs text Is it possible to solve this problem?

I tried examining lines and tokens layout of the json, returned by Document AI, but the layout reflects the problem, the numbers are not in the correct place

`# documents - output of the Documents API
for document in documents:
    for page in document.pages:
       for line in page.lines:
           if page.page_number <=10:
              layout = line.layout
              text_anchor = layout.text_anchor
              start_index = text_anchor.text_segments[0].start_index
              end_index = text_anchor.text_segments[0].end_index
              line_text = document.text[start_index:end_index]
              print(line_text)

`

I was previously trying Google Vision AI and have also tried different documents, and all the time there was the same error.

Thank you!

1 Answers1

0

That's some interesting behavior. Just to clarify, the text looks something like this in the original document? (It would be helpful if you can provide a redacted example document and what you would expect the output to be)

.10 [hebrew text1]
.11 [text2]
etc.

But the output is like:

.10
.11
[hebrew text 1]
[hebrew text 2]

My hypothesis is that this could be an issue with how Document AI handles this type of input for right-to-left languages (like Hebrew). If that's the case, this can be reported to the product development team. But it will be difficult to tell without an input document and the expected output.

For your specific use case, it could also make sense to use the Form Parser if you're interested in extracting specific fields based on those numbers. Processor version pretrained-form-parser-v2.0-2022-11-10 added support for all of the languages supported by Document OCR

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • Hi Holt, thank you on your answer and on your Document AI videos! you are right the problem is that I have a numbered list in the original doc, but in the hocr output I have a list of numbers followed by all their text The quality of hocr recognition in Document AI is much better then in the tools I currently use, but this numbering issue prevents me from using Document AI, how can I report it to the dev team? Thank you! – Julia Grobman Apr 23 '23 at 09:25
  • Here's the information with how to file a public issue tracker to the product team. If you link the bug here, I can ensure it gets to the correct people. – Holt Skinner Apr 24 '23 at 17:03