Enhance readability of TessBaseAPI.getUTF8Text()

Question

I tried to use Tesseract OCR via Tess-Two in Android to recognize text from an image (developed using Android Studio).

In gradle, I added the following line into dependencies section:

compile 'com.rmtheis:tess-two:5.4.1'

Then, in the main activity's onCreate(), I have the following codes to initialize the library and load an image:

    final String lang = "eng";
    TessBaseAPI baseAPI = new TessBaseAPI();
    boolean initResult = baseAPI.init(Environment.getExternalStorageDirectory().getPath(), lang);
    if(initResult) {
        InputStream is = null;
        try {
            is = getAssets().open("test2.jpg");
            final Drawable drw = Drawable.createFromStream(is, null);
            Bitmap bmp = ((BitmapDrawable) drw).getBitmap();

            baseAPI.setDebug(true);
            baseAPI.setImage(bmp);
            ImageView imageView = (ImageView)findViewById(R.id.imageView);
            imageView.setImageBitmap(bmp);

            String recognizedText = baseAPI.getUTF8Text().trim();
            Log.d(TAG, recognizedText);
            TextView textView = (TextView) findViewById(R.id.txt_debug);
            textView.setText(recognizedText);
            baseAPI.end();
        } catch (FileNotFoundException nfe) {
            Log.d(TAG, "File Not Found");
            nfe.printStackTrace();
        } catch (IOException ioe) {
            Log.d(TAG, "Unable to open the file");
            ioe.printStackTrace();
        }
    } else {
        Log.d("OCR", "Unable to init Base API");
    }

Last, I put the JPEG in the asset folder (app/src/main/assets/). Here is the JPEG, basically a paragraph of text.

However, the OCR result is (pretty much garbage):

OWW WW ON
R W WWW WK
KW MK
214
3 W5 HE WM
M WW WWW
LFNWW VW QTY
VM ACNL 19 WE NH
5 332152391
HQ W M W

How to improve readability of the scan?

I tried the following Page Sec Mode, but the results are empty:

// Automatic page segmentation with orientation and script detection
baseAPI.setPageSegMode(TessBaseAPI.PageSegMode.PSM_AUTO_OSD);
// Treat the image as a single text line
baseAPI.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);

score 0 · Answer 1 · answered Mar 07 '16 at 16:59

0

Tesseract's recognition depends on mainly two things: The font file and the trained data file for it.

Usually tesseract does not recognize handwriting, but theoretically if you train it to a recognize a font that resembles handwriting then it could work.

answered Mar 07 '16 at 16:59

Samzerge

96
1
8

Thanks for the input. I found that by using tools like Scantailor or textcleaner ImageMagick script, the readability of OCR feature is improved. Key thing is to remove noise and increase DPI to at least 300dpi. – Raptor Mar 08 '16 at 02:42

Enhance readability of TessBaseAPI.getUTF8Text()

1 Answers1