For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
- How I can split the two columns before feeding them into OCR?
- Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert