0

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.

The problem is that most of these files have a two-column format:

Sample Protocol http://sert.homedns.org/img/btp12001.png

I would love to read your answer to my following questions:

  1. How I can split the two columns before feeding them into OCR?
  2. Which commercial, open-source OCR software or framework, do you recommend and why?

Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!

UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.

Best Regards,
Cetin Sert

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
Cetin Sert
  • 4,497
  • 5
  • 38
  • 76

4 Answers4

0

Cut the pages down the middle before you scan.

mcandre
  • 22,868
  • 20
  • 88
  • 147
0

It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.

Gavin
  • 17,053
  • 19
  • 64
  • 110
0

I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another. It autorecognit the layout, include columns, or you can set the default layout to columns. You can set many options how the output should look like. But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.

0

Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.

Eugene Osovetsky
  • 6,443
  • 2
  • 38
  • 59