tesseract 5.0 bazaar + user-words config doesn't work

Question

I tried to force tesseract to use only my words list when perform OCR. First, i copy bazaar file to /usr/share/tesseract-ocr/5/tessdata/configs/. This is my bazaar file:

load_system_dawg F
load_freq_dawg F
user_words_suffix user-words

Then, i created eng.user-words in /usr/share/tesseract-ocr/5/tessdata. This is my user-words file:

Items
VAT
included
CASH

then i perform ocr for this image by command: tesseract -l eng --oem 2 test_small.jpg stdout bazaar.

this is my result:

2 Item(s) (VAT includsd) 36,000
casH 40,000
CHANGE 4. 000

As you can see, includsd is not in my user-words file, and it should be 'included'. Besides, i got same result even without using bazaaz config in command. It looks like that my bazaar and eng.user-words config doesn't have any effect in OCR output. So how can use bazaar and user-words config, in order to get desired result ?

Did you ever find a solution to this? My interpretation of the documentation is the same as yours, in that you should be able to provide a 'whitelist' of words. — atlas_scoffed, May 19 '22 at 10:53
at that time, I didn't found any solution and gave up. But it has been 2 years, so you should check documentation. It can support now @comfytoday — voxter, May 20 '22 at 01:51
This document is somewhat relevant https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html But in terms of only using the user words supplied I can't anything like that. I don't think it's possible without compiling your own dictionary. — atlas_scoffed, May 20 '22 at 03:36

score 0 · Answer 1 · answered Mar 24 '21 at 17:47

0

All you need to do was up-sampling the image.

If you up-sample two - times

Now read:

2 Item(s) (VAT included) 36,000
CASH 40,000
CHANGE 4,000

Code:

import cv2
import pytesseract

# Load the image
img = cv2.imread("4nGXo.jpg")

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Up-sample
gry = cv2.resize(gry, (0, 0), fx=2, fy=2)

# OCR
print(pytesseract.image_to_string(gry))

# Display
cv2.imshow("", gry)
cv2.waitKey(0)

answered Mar 24 '21 at 17:47

Ahmet

7,527
3
23
47

Your approach is about resizing image in order to get better result. But i want to configure tesseract in order to use only words in my dictionary, not about getting better result – voxter Mar 29 '21 at 04:47

score -1 · Answer 2 · answered Dec 12 '19 at 15:02

-1

user_words_suffix does not seem to work for --oem 2. A workaround is to use user_words_file which contains the path to your user-words file.

answered Dec 12 '19 at 15:02

jtmayer

401
4
13

can you post for me an example command ? i can not find your option from tesseract manual: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc – voxter Dec 12 '19 at 15:08
i add this option to my bazaar file, but it doesn't work: `user_words_file /usr/share/tesseract-ocr/5/tessdata/eng.user-words` – voxter Dec 12 '19 at 15:17
@voxter what do you mean with "it does't work" do you get any Error message? A user-words file is not a word whitelist, tesseract is not only using words from the user-words file. – jtmayer Dec 13 '19 at 05:12
"it doesn't work" means that i got same result before using this option – voxter Dec 13 '19 at 10:49
do we have any other idea ? – voxter Dec 16 '19 at 01:20
@voxter Like I mentioned in my comment above, the user-words file is only a hint for tesseract it does not mean that only word from this list are recognized. I think the only thing you could do is providing a better image. – jtmayer Dec 16 '19 at 07:20

tesseract 5.0 bazaar + user-words config doesn't work

2 Answers2