0

I'm looking for technique to detect text on document.

For example on plain .txt file it's easy: There are many libraries, API's & SDK's for image processing and usually they have methods implementing OCR's algorithms.

But discussing "complex" printed document (structure of the document is well known & deterministic), for example the summary page of pension program annually report: I want to extract only the "bottom line" number. I know there is the header in the top center, in the middle some table, in the bottom left some paragraph and in the bottom right the paragraph I'm looking for.

What is the approach to extrac text from the document grouped & associated with it's location on the document? The main task here is a technique analysing the structure of the document versus pre defined structure, and when we know that we are now working on some specific paragraph - Well from here it's easy - apply standard mentioned above OCR API and collect the data in your custom data structure.

For example linked document (page 1): What is the approach such that every time I apply pure OCR API I know exactly on what part from the pre defined template I work? The document template has:

Top section devided into 3 horizontal parts.

Middle section: Title and then first table, another title and then another table.

Bottom section: some text on right corner.

example

Thanks,

michael
  • 3,835
  • 14
  • 53
  • 90
  • 2
    How about a couple of examples? – Mark Setchell Aug 27 '15 at 06:26
  • @MarkSetchell Thanks, please see my edit with link to document template. – michael Aug 28 '15 at 05:43
  • 1
    Google for "document layout analysis" – Miki Aug 29 '15 at 11:38
  • @Miki Maybe you familiar & have experince with something recommended for easy integration @ Android environment? Thanks, – michael Aug 29 '15 at 11:44
  • I did something like this, but I didn't use any specific tool and wasn't in Android. Since the documents I used had a predefined layout, I _basically_ took the subimage at some specific predefined ROI on the image. – Miki Aug 30 '15 at 14:59
  • @Miki you mean something like "always go to the bottom 50px" and NOT "recognize middle part ended and now bottom part started, go to the that second part (bottom)"? – michael Aug 30 '15 at 18:35
  • If 1) the layout is always the same 2) your input images are always the same size and skew (or you need to preprocess them) that's a quite simple but effective solution. – Miki Aug 30 '15 at 18:39
  • @Miki Well that's not the case. My input is something like camera capture. I'm starting wonder if it's even feasible or it's mission for computer vision phd's – michael Aug 30 '15 at 20:04
  • well, you never mentioned that... :D You can still, however, detect and rectify the paper, and resize the paper (or the layout) to fall in the simple case as above. – Miki Aug 30 '15 at 20:06
  • @Miki The main point is that the borders between the sections are not exactly the same. The Document overall layout-template is same, but for specific instance the middle paragraph may be longer etc. – michael Aug 31 '15 at 10:33
  • So what was the algorithm that you came up with to do document structure analysis ? – rresol Jun 13 '17 at 08:41

0 Answers0