OCR multi column text with Google Document AI

Question

I have a document with text in two columns per page. While uploading a test file with this formatting I noticed that the space between the columns was ignored and text was recognized as if it were all a single column page.

The data looks like this:

text of  first         text of second
more text test         second test

expected output:

text of first more text
text of second second test

Actual output:

text of first text of second
more text test second test

I should note: The file was a PDF file in Hebrew. The language was properly recognized and read from right to left as expected.

What can I do about this? Do I need to split it by column or something?

score 0 · Answer 1 · answered Jun 08 '23 at 23:02

0

Refer to this post about Correcting text output from Google Document AI . The idea is to identify and reorder jumbled text with the tools in the daiR package. Suppose your current actual identification and order is 1 - 2 - 3 - 4 from left-right top-bottom, you might have to rearrange it to 1 - 3 - 2 - 4 to get your expected output.

answered Jun 08 '23 at 23:02

Joevanie

489
2
5

I agree with the concepts presented in this post, just note that `daiR` is a third-party package and does not have any Google-managed SLA or support. You can also try creating some custom post-processing for another language by using the text block [Bounding boxes](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#layout) to determine which order they should be loaded in. – Holt Skinner Jun 12 '23 at 18:52

OCR multi column text with Google Document AI

1 Answers1