I am working on a project where I need to parse information from a reference book. To do this, I am using Google's Custom Document Extractor. I have been annotating my first few scanned documents, but I have noticed a problem.
The problem is that when I select text to annotate, the OCRed text isn't detected in the right order. For example, if the actual text is "I walked to my home" then it could be read as "my home I walked to". My guess is that Google's OCR determines order of text by looking at what text is higher up on the page. Therefore, when a line isn't perfectly straight and the OCR parses the line as two separate blocks, it can put the second block first because the inclinaison of the line makes the second block higher than the first one.
Here is a screenshot showing the problem (it's in french)
As you can see, the block I highlighted starts with "c) chaque réservoir ..." while the OCR parses it as "protection contre les ..." which is in fact at the end of the line (but higher up on the document technically because the text is slightly inclined).
The problem is that this makes it extremely hard to put it back in the right order after as there is no way to tell if it was wrongly interpreted without a human looking at the sentence and seeing it makes no sense.
Note: I thought of maybe rotating the pdf the other way slightly in a way that the left hand side is also the higher up on the page but this doesn't look too realiable.