Tesseract OCR confuses slashed 0 as 8

Question

I have trained tesseract on the terminus font, but no matter what, I can't get it to recognize the 0s. I am using the jTessEditor to create the training tif and boxes. Even when validating, it reads all 0s as 8s. Is there anything I am missing?

Here is an example of the 0 and it reading it as 8:

I use the following parameters:

--psm 10 -c tessedit_char_whitelist=0123456789# --oem 3 -l terminus

Have you tried proping up the zeros? ie just giving it a disproportionate number of 0s & 8s to learn from — Schalton, Oct 06 '21 at 18:39
You could also augment the zeros, not sure if you're going to use other fonts, to prop them up. Some simple ones could be mirroring it vertically/horizontally. Also it depends on the size of the network/receptive field if that doesn't fix it. — smerkd, Jan 02 '22 at 21:19
This is not directly answering your question, but it might help you practically. I've tried to use tesseract many times, and find it very difficult to get consistent results. Now, for all my projects, I use either Amazon Textract or Google Vision. These APIs are quite cheap, easy to use, and do the job really well. You might not be allowed to use them but if you are, I'd suggest to have a look. — Colin Bernet, Jan 26 '22 at 10:24
can you post a picture of 8 as well?. I think you need to create a images with minor difference and then train a model with them. And gradually increases the accuracy towards 0. — Nilanj, Jul 20 '22 at 19:03
I am noticing this is a rather high-resolution image with no margins. This might affect your results. Perhaps add margins and specify the `--dpi`? — Yuval, Jul 25 '22 at 08:07

score 1 · Answer 1 · answered Feb 02 '22 at 12:31

EasyOCR is lightweight model which is giving a good performance for receipt or PDF conversion. It is giving more accurate results with organized texts like pdf files, receipts, bills. EasyOCR also performs well on noisy images and recognize number better than pytesseract.

code:

!pip install easyocr

 import easyocr

 import cv2

    #Initialzing the ocr
    img = cv2.imread("image path")
    text_reader = easyocr.Reader(['en']) #Initialzing the ocr
    results = text_reader.readtext(img)
    for (bbox, text, prob) in results:
        print(text)

score 1 · Answer 2 · answered Feb 08 '22 at 13:28

I cannot help you on finding a better way, how to use tesseract.

I used CIB deeper OCR engine, which uses an OCR recognition purely based on an AI on your image at top. There the zeros from your example image were recognized as zeros.

The OCR engine can be freely used on https://doxiview.cib.de/ (click on the right side on text recognition). More info on Deeper: https://deeper.cib.de/en/

Tesseract OCR confuses slashed 0 as 8

2 Answers2

Linked