How to OCR multiple column in a document using tesseract

Question

I working on a project of OCR sinhala language using tesseract. My goal is ocr, multiple column including text in a document. And get out put file in a correct format. Is there any method to identify column in a document using tesseract?

score 12 · Answer 1 · edited Apr 04 '23 at 14:28

Setting tesseract to work with a multi-column document is surprisingly easy though I found very little information or discussion specifically about multi-column pages online. The basic idea is to set the page segmentation method to do both "Automatic page segmentation" (the default) AND "Orientation and script detection" (OSD, not the default setting).

This is as simple as putting the psm setting to 1 which tells tesseract to "Automatic page segmentation with OSD." While it may not be obvious that OSD = recognize a multicolumn document, in practical terms that's one of the outcomes. Another benefit is that the script detection helps tesseract avoid trying to OCR non-text blocks like photographs.

For more on page segmentation methods, see: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Here is a sample of the command line syntax to adjust the page segmentation method

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

For more on the syntax, see: https://tesseract-ocr.github.io/tessdoc/ImproveQuality

The option is `--psm` (two dashes). A full example would also help clarify the syntax. E.g. `tesseract myimage.jpg outputbase --psm 1` — nealmcb, Nov 07 '22 at 16:39

score 1 · Answer 2 · answered Feb 18 '16 at 09:47

You can try with below solution to identify columns when we do scanning a picture.

TessBaseAPI baseApi = new TessBaseAPI();
 baseApi.setDebug(true);
 baseApi.init(DATA_PATH, lang); //DATA_PATH - Where Image stored and lang - en(english)
 baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_COLUMN);//This line will help us to do segment for captured image - Hope you looking for this line
 baseApi.setImage(bitmap);

 //Recognized Text after capturing image then process it.
 String recognizedText = baseApi.getUTF8Text();

If you are not expecting this solution then please try with PageSegMode, hope it may resolve your issue.

How to OCR multiple column in a document using tesseract

2 Answers2