Tesseract False Space Recognition

Question

I'm using tesseract to recognize a serial number. This works acceptable, common problem like false recognition of zero and "O", 6 and 5, or M and H exists. Beside by this tesseract adds spaces to the recognized words, where no space is in the image. The following image is recognized as "HI 3H".

Example Image 1

This image results in " FBKHJ 1R1"

Example image 2

So tesseract added a space, although there isn't really a space in the image. Is there a possibility parametrize the spacing behavior of tesseract?

Edit

I'm sorry, have forgot to add, that I also have serial numbers which include spaces. So I cannot delete all spaces inside the recognized serial number.

For example the following image containing a space in the serial number results after tesseract recognition into: J4 F1583BB. Beside that the recognition of the characters is false, the space is recognized correct with this image.

Example image 3

My actual parameters for tesseract are:

tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);
tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
tess.SetVariable("tessedit_char_whitelist",
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ012345789");

char* out = tess.GetUTF8Text();
string text = string(out);

Edit

It is notices from already existing answers, that the space between the "J" and "I" for example seems to be little more, than between the other characters. The font-type I have chosen is a Monotype Font. Reason for this is that I thought, that this helps tesseract for character recognition. Drawback of such a Monospace font-type, where every character has the same width, is that the kernel (the space between the characters) varies. See example image of following source Source

Proportional vs. Monospace

Which font type do you think, will achieve better recognition results?

As a lazy dude, I would ask if your serials will ever contain a space? — Thomas Ayoub, Jun 26 '15 at 11:51
sorry, edited my question, serial numbers including spaces exists... — Mr.Sheep, Jun 26 '15 at 12:11
When you call `Init` on your `TessBaseAPI` object, you pass in "eng" as the second parameter. Is that to specify the character set or the language? If the latter, can you change it to an option that refers to just alphanumeric characters, but doesn't have the semantics of English proper? — Sam Estep, Jun 26 '15 at 12:17
dont know about you, but the distance between J and I in `FBK` may be a space, even for a human — UmNyobe, Jun 26 '15 at 12:20
@RedRoboHood: The init Function requires a language parameter, as far as i know. In general the serialnumber is language independent. — Mr.Sheep, Jun 26 '15 at 12:21
@UmNyobe: Yeah, I see that there is a little a distance between this characters, Since I have created the serialnumbers, and took the photos I know that there shouldn't be a space. I thought tesseract does something like, checking the mean distance between the characters and therefore can distinguish between spaces and belonging characters. If there is no option or parameter so set in tesseract, I would write a function to calculate the mean distance between characters out of the bounding boxes and checks if two boxes are too far and therefore result in a spacing in between. — Mr.Sheep, Jun 26 '15 at 12:24
Should I recreate the serial numbers trying to use the same space between all characters. In general it is to say, that if to less space is between the characters tesseract can not distinguish between individual characters. — Mr.Sheep, Jun 26 '15 at 12:26
same problem in 2009: https://groups.google.com/forum/#!topic/tesseract-ocr/5_3N6NShQck I guess there isn't such a parameter, but not guarantee... BUT maybe... have a look at `textord/tospace.cpp` as suggested by https://groups.google.com/forum/#!msg/tesseract-ocr/PepNaRySaHw/XzmKb_yZ7mkJ (all found on google) — Micka, Jun 26 '15 at 12:29
OK, Thank you. I have searched already too before opening a new question :) But haven't found something useful, ... But I will have a deeper look at the cpp file you mentioned. — Mr.Sheep, Jun 26 '15 at 12:32
What happened to the number "6"? Your variable `tessedit_char_whitelist` has excluded that digit!? — Stéphane, Aug 27 '19 at 00:25

score 6 · Answer 1 · edited Nov 15 '15 at 10:52

6

Adjusting parameter tosp_min_sane_kn_sp may help. I solved the problem by doing it.

If it doesn't help, you may try other tosp_* paramters, or working around the space source code "tospace.cpp"

edited Nov 15 '15 at 10:52

Manjunath Ballur

6,287
3
37
48

answered Nov 15 '15 at 10:25

Tony_Tong

86
1
4

score 0 · Answer 2 · answered Jun 26 '15 at 12:48

0

I'm not C++ programmer but i think that it's possible to calibrate the width of each letter space. I found this parameter "textord_space_size_is_variable" in this site, and it says "If true, word delimiter spaces are assumed to have variable width, even though characters have fixed pitch."

Good luck! :)

answered Jun 26 '15 at 12:48

André Agostinho

615
1
6
15

Haven't seen that there are that much parameter adjustable. I will give them a try, Thank you. – Mr.Sheep Jun 26 '15 at 12:53

Tesseract False Space Recognition

2 Answers2

Linked