0

I have been using tesseract (tess-two to be more precise)to make an app in android to recognize certain non conventional symbols. The purpose is to identify the symbol and redirect to the description of said symbol.

The symbols can be recognized almost perfectly whether they are alone in the image or they are next to each other... except for two (the ones below).

symbols omitted from recognition

Both of these symbols are not recognized when alone, BUT THEY ARE CORRECTLY RECOGNIZED if they are next to any other symbol.

For example:

Not recognized _

Correctly recognized

_ b

_ y _

Problem is that they are not mismatched with other symbols, but instead they are ignored completely. This occurs to me when calling:

TessBaseAPI baseApi;

...

String text = baseApi.getUTF8Text();

The returned string is always null. Like if it didn't even recognize the black regions to begin with. Anyone knows how I could fix this?

UPDATE:

To make it more clear here is my full code when initializing tess.

TessBaseAPI baseApi = new TessBaseAPI();

mainBitmap = mainBitmap.copy(Bitmap.Config.ARGB_8888, true);

baseApi.setDebug(true);

baseApi.init(MainActivity.DATA_PATH, MainActivity.lang);


baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_CHAR);


baseApi.setVariable("tessedit_char_whitelist","abcdefghijklmnopqrst");
baseApi.setImage(mainBitmap);


mainBitmap.recycle();
mainBitmap = null;


// Iterate through the results.
ResultIterator iterator = baseApi.getResultIterator();
String lastUTF8Text;
float lastConfidence;



iterator.begin();
do {
    lastUTF8Text = iterator.getUTF8Text(TessBaseAPI.PageIteratorLevel.RIL_SYMBOL);
    lastConfidence = iterator.confidence(TessBaseAPI.PageIteratorLevel.RIL_SYMBOL);

    Log.i("string, intConfidence",lastUTF8Text+", "+lastConfidence);
} while (iterator.next(TessBaseAPI.PageIteratorLevel.RIL_SYMBOL));

My whitelist goes from a range of "a" to "t" because I made a font corresponding to the symbols I had to use and mapped them to each one of those letters.

Samzerge
  • 96
  • 1
  • 8

1 Answers1

0

I would try and set the page segmentation mode to single char.

TessBaseAPI.PageSegMode.PSM_SINGLE_CHAR
Errol Green
  • 1,367
  • 5
  • 19
  • 32
  • I already tried that, but it keeps ignoring those 2 specific symbols. In fact I also tried all the modes just in case, but it just keeps returning a null String. – Samzerge Mar 09 '16 at 16:25
  • Have you tried only white listing the symbols you need ? – Errol Green Mar 09 '16 at 16:26
  • Yeah, that works correctly because everytime it returns a String it is within the range of that list, the problem is that it is being returned as null. – Samzerge Mar 09 '16 at 16:38
  • Could you paste your code where you initialize tess, also what happens when you read regular text? Does it still return null ? – Errol Green Mar 09 '16 at 16:41
  • I already pasted the code, hope it helps. As to your question my code isn't supposed to read regular text, just the symbols. But when trying to read regular text it sometimes matches it to the one most similar and sometimes returns null. – Samzerge Mar 09 '16 at 17:21
  • This is what I think is happening, when the tess reads your characters, it will look in neighborhood characters to help it understand which character it is currently reading. So lets say that you are reading 123567B9, it will most likely read the B as an 8, since its neighborhood characters are number. To fix this, I would recommend added some spacing between your characters, perhaps do some image preprocessing that will space out your characters. – Errol Green Mar 09 '16 at 18:04
  • But when they are together there rarely is any problem. So using your example: "1234567B9" is returned correctly as "1234567B9". But my problem is that when reading just a "B" I get a null returned, and this doesn't happen with "1", "2" or "9" or any of the other characters, which are returned correctly as "1", "2" and "9" respectively. – Samzerge Mar 10 '16 at 21:49